如何在文本分析中使用命名实体识别How to use Named Entity Recognition in Text Analytics

文本分析 API 允许你采用非结构化文本,并会返回一个已消歧实体的列表,其中包含指向 Web 上的详细信息的链接。The Text Analytics API lets you takes unstructured text and returns a list of disambiguated entities, with links to more information on the web. 此 API 支持命名实体识别 (NER) 和实体链接。The API supports both named entity recognition (NER) and entity linking.

实体链接和命名实体识别Entity Linking and Named Entity Recognition

实体链接是一种对文本中找到的实体的身份进行识别和消歧的功能(例如,确定 Mars 一词是指行星还是指罗马战神)。Entity linking is the ability to identify and disambiguate the identity of an entity found in text (for example, determining whether an occurrence of the word Mars refers to the planet, or to the Roman god of war). 此过程要求知识库采用适当的语言,以便链接文本中识别的实体。This process requires the presence of a knowledge base in an appropriate language, to link recognized entities in text. 实体链接使用维基百科作为此知识库。Entity Linking uses Wikipedia as this knowledge base.

命名实体识别 (NER)Named Entity Recognition (NER)

命名实体识别 (NER) 是指识别文本中不同实体,并将它们分入预定义类或类型(例如:人员、位置、事件、产品和组织)的能力。Named Entity Recognition (NER) is the ability to identify different entities in text and categorize them into pre-defined classes or types such as: person, location, event, product and organization.

在文本分析版本 2.1 中,实体链接和命名实体识别 (NER) 均可用于几种语言。In Text Analytics Version 2.1, both entity linking and named entity recognition (NER) are available for several languages. 有关详细信息,请参阅语言支持一文。See the language support article for more information.

语言支持Language support

使用各种语言的实体链接需要使用每种语言的相应知识库。Using entity linking in various languages requires using a corresponding knowledge base in each language. 对于文本分析中的实体链接,这意味着 entities 终结点支持的每种语言都将链接到该语言的相应 Wikipedia 语料库。For entity linking in Text Analytics, this means each language that is supported by the entities endpoint will link to the corresponding Wikipedia corpus in that language. 由于语料库的大小因语言而异,因此,实体链接功能的召回率应该也有所不同。Since the size of corpora varies between languages, it is expected that the entity linking functionality's recall will also vary.

命名实体识别支持的类型Supported Types for Named Entity Recognition

类型Type 子类型SubType 示例Example
人员Person 暂无*N/A* “Jeff”、“Bill Gates”"Jeff", "Bill Gates"
位置Location 暂无*N/A* “Redmond, Washington”、“Paris”"Redmond, Washington", "Paris"
组织Organization 暂无*N/A* “Microsoft”"Microsoft"
数量Quantity NumberNumber “6”、“six”"6", "six"
数量Quantity 百分比Percentage “50%”、“fifty percent”"50%", "fifty percent"
数量Quantity OrdinalOrdinal “2nd”、“second”"2nd", "second"
数量Quantity NumberRangeNumberRange “4 to 8”"4 to 8"
数量Quantity AgeAge “90 day old”、“30 years old”"90 day old", "30 years old"
数量Quantity 货币Currency “$10.99”"$10.99"
数量Quantity 维度Dimension “10 miles”、“40 cm”"10 miles", "40 cm"
数量Quantity 温度Temperature “32 degrees”"32 degrees"
DateTimeDateTime 暂无*N/A* “6:30PM February 4, 2012”"6:30PM February 4, 2012"
DateTimeDateTime DateDate “May 2nd, 2017”、“05/02/2017”"May 2nd, 2017", "05/02/2017"
DateTimeDateTime 时间Time “8am”、“8:00”"8am", "8:00"
DateTimeDateTime DateRangeDateRange “May 2nd to May 5th”"May 2nd to May 5th"
DateTimeDateTime TimeRangeTimeRange “6pm to 7pm”"6pm to 7pm"
DateTimeDateTime 持续时间Duration “1 minute and 45 seconds”"1 minute and 45 seconds"
DateTimeDateTime 设置Set “every Tuesday”"every Tuesday"
DateTimeDateTime TimeZoneTimeZone
URLURL 暂无*N/A* "https://www.bing.com""https://www.bing.com"
EmailEmail 暂无*N/A* "support@contoso.com""support@contoso.com"

*一些实体可能会省略 SubType,具体视输入和已提取的实体而定。* Depending on the input and extracted entities, certain entities may omit the SubType. 列出的所有受支持的实体类型仅适用于英文、简体中文、法文、德文和西班牙文。All the supported entity types listed are available only for the English, Chinese-Simplified, French, German and Spanish languages.

请求终结点Request endpoints

命名实体识别 v2 将单个终结点用于 NER 和实体链接请求:Named Entity Recognition v2 uses a single endpoint for NER and entity linking requests:

https://<your-custom-subdomain>.cognitiveservices.azure.cn/text/analytics/v2.1/entities


发送 REST API 请求Sending a REST API request

准备工作Preparation

必须拥有以下格式的 JSON 文档:ID、文本、语言。You must have JSON documents in this format: ID, text, language.

每个文档必须少于 5,120 个字符,每个集合最多可包含 1,000 个项目 (ID)。Each document must be under 5,120 characters, and you can have up to 1,000 items (IDs) per collection. 集合在请求正文中提交。The collection is submitted in the body of the request.

构造请求Structure the request

创建 POST 请求。Create a POST request. 使用 Postman 或以下链接中的“API 测试控制台”来快速构建并发送请求 。You can use Postman or the API testing console in the following links to quickly structure and send one.

Note

可以在 Azure 门户上找到文本分析资源的密钥和终结点。You can find your key and endpoint for your Text Analytics resource on the azure portal. 它们将位于资源的“快速启动” 页上的“资源管理” 下。They will be located on the resource's Quick start page, under resource management.

发布请求Post the request

在收到请求时执行分析。Analysis is performed upon receipt of the request. 有关每分钟和每秒可以发送的请求的大小和数量的信息,请参阅概述中的数据限制部分。See the data limits section in the overview for information on the size and number of requests you can send per minute and second.

文本分析 API 是无状态的。The Text Analytics API is stateless. 不会在帐户中存储数据,结果会立即在响应中返回。No data is stored in your account, and results are returned immediately in the response.

查看结果View results

所有 POST 请求都将返回 JSON 格式的响应,其中包含 ID 和检测到的属性。All POST requests return a JSON formatted response with the IDs and detected properties.

系统会立即返回输出。Output is returned immediately. 你可以将结果流式传输到接受 JSON 的应用程序,或者将输出保存到本地系统上的文件中,然后将其导入到允许对数据进行排序、搜索和操作的应用程序。You can stream the results to an application that accepts JSON or save the output to a file on the local system, and then import it into an application that allows you to sort, search, and manipulate the data.

下面展示了实体链接的输出示例:An example of the output for entity linking is shown next:

{
    "Documents": [
        {
            "Id": "1",
            "Entities": [
                {
                    "Name": "Jeff",
                    "Matches": [
                        {
                            "Text": "Jeff",
                            "Offset": 0,
                            "Length": 4
                        }
                    ],
                    "Type": "Person"
                },
                {
                    "Name": "three dozen",
                    "Matches": [
                        {
                            "Text": "three dozen",
                            "Offset": 12,
                            "Length": 11
                        }
                    ],
                    "Type": "Quantity",
                    "SubType": "Number"
                },
                {
                    "Name": "50",
                    "Matches": [
                        {
                            "Text": "50",
                            "Offset": 49,
                            "Length": 2
                        }
                    ],
                    "Type": "Quantity",
                    "SubType": "Number"
                },
                {
                    "Name": "50%",
                    "Matches": [
                        {
                            "Text": "50%",
                            "Offset": 49,
                            "Length": 3
                        }
                    ],
                    "Type": "Quantity",
                    "SubType": "Percentage"
                }
            ]
        },
        {
            "Id": "2",
            "Entities": [
                {
                    "Name": "Great Depression",
                    "Matches": [
                        {
                            "Text": "The Great Depression",
                            "Offset": 0,
                            "Length": 20
                        }
                    ],
                    "WikipediaLanguage": "en",
                    "WikipediaId": "Great Depression",
                    "WikipediaUrl": "https://en.wikipedia.org/wiki/Great_Depression",
                    "BingId": "d9364681-98ad-1a66-f869-a3f1c8ae8ef8"
                },
                {
                    "Name": "1929",
                    "Matches": [
                        {
                            "Text": "1929",
                            "Offset": 30,
                            "Length": 4
                        }
                    ],
                    "Type": "DateTime",
                    "SubType": "DateRange"
                },
                {
                    "Name": "By 1933",
                    "Matches": [
                        {
                            "Text": "By 1933",
                            "Offset": 36,
                            "Length": 7
                        }
                    ],
                    "Type": "DateTime",
                    "SubType": "DateRange"
                },
                {
                    "Name": "Gross domestic product",
                    "Matches": [
                        {
                            "Text": "GDP",
                            "Offset": 49,
                            "Length": 3
                        }
                    ],
                    "WikipediaLanguage": "en",
                    "WikipediaId": "Gross domestic product",
                    "WikipediaUrl": "https://en.wikipedia.org/wiki/Gross_domestic_product",
                    "BingId": "c859ed84-c0dd-e18f-394a-530cae5468a2"
                },
                {
                    "Name": "United States",
                    "Matches": [
                        {
                            "Text": "America",
                            "Offset": 56,
                            "Length": 7
                        }
                    ],
                    "WikipediaLanguage": "en",
                    "WikipediaId": "United States",
                    "WikipediaUrl": "https://en.wikipedia.org/wiki/United_States",
                    "BingId": "5232ed96-85b1-2edb-12c6-63e6c597a1de",
                    "Type": "Location"
                },
                {
                    "Name": "25",
                    "Matches": [
                        {
                            "Text": "25",
                            "Offset": 72,
                            "Length": 2
                        }
                    ],
                    "Type": "Quantity",
                    "SubType": "Number"
                },
                {
                    "Name": "25%",
                    "Matches": [
                        {
                            "Text": "25%",
                            "Offset": 72,
                            "Length": 3
                        }
                    ],
                    "Type": "Quantity",
                    "SubType": "Percentage"
                }
            ]
        }
    ],
    "Errors": []
}

摘要Summary

在本文中,你已了解使用认知服务中的文本分析进行实体链接的概念和工作流。In this article, you learned concepts and workflow for entity linking using Text Analytics in Cognitive Services. 综上所述:In summary:

  • 命名实体识别适用于选定的语言并提供两个版本。Named Entity Recognition is available for selected languages in two versions.
  • 请求正文中的 JSON 文档包括 ID、文本和语言代码。JSON documents in the request body include an ID, text, and language code.
  • 会通过对订阅有效的个性化访问密钥和终结点将 POST 请求发送到一个或多个终结点。POST requests are sent to one or more endpoints, using a personalized access key and an endpoint that is valid for your subscription.
  • 响应输出由链接实体(包括每个文档 ID 的置信度分数、偏移量和 Web 链接)组成,可用于任何应用程序Response output, which consists of linked entities (including confidence scores, offsets, and web links, for each document ID) can be used in any application

后续步骤Next steps