如何在文本分析中使用命名实体识别How to use Named Entity Recognition in Text Analytics

文本分析 API 允许你采用非结构化文本,并会返回一个已消歧实体的列表,其中包含指向 Web 上的详细信息的链接。The Text Analytics API lets you takes unstructured text and returns a list of disambiguated entities, with links to more information on the web. 此 API 支持若干实体类别的命名实体识别 (NER) 和实体链接。The API supports both named entity recognition (NER) for several entity categories, and entity linking.

实体链接Entity Linking

实体链接是一种对文本中找到的实体的身份进行识别和消歧的功能(例如,确定出现的“Mars”一词是指行星还是指罗马战神)。Entity linking is the ability to identify and disambiguate the identity of an entity found in text (for example, determining whether an occurrence of the word "Mars" refers to the planet, or to the Roman god of war). 此过程要求知识库采用适当的语言,以便链接文本中识别的实体。This process requires the presence of a knowledge base in an appropriate language, to link recognized entities in text. 实体链接使用维基百科作为此知识库。Entity Linking uses Wikipedia as this knowledge base.

命名实体识别 (NER)Named Entity Recognition (NER)

命名实体识别 (NER) 是指识别文本中不同实体,并将它们分入预定义类或类型(例如:人员、位置、事件、产品和组织)的能力。Named Entity Recognition (NER) is the ability to identify different entities in text and categorize them into pre-defined classes or types such as: person, location, event, product, and organization.

个人身份信息 (PII)Personally Identifiable Information (PII)

PII 功能是 NER 的一部分,可以识别和标记文本中与个人相关的敏感实体,例如:电话号码、电子邮件地址、邮寄地址、护照号码。The PII feature is part of NER and it can identify and redact sensitive entities in text that are associated with an individual person such as: phone number, email address, mailing address, passport number.

命名实体识别功能和版本Named Entity Recognition features and versions

功能Feature NER v3.0NER v3.0 NER v3.1-preview.4NER v3.1-preview.4
用于单个请求和批量请求的方法Methods for single, and batch requests XX XX
跨多个类别展开的实体识别Expanded entity recognition across several categories XX XX
用于发送实体链接和 NER 请求的不同终结点。Separate endpoints for sending entity linking and NER requests. XX XX
个人 (PII) 信息实体的识别Recognition of personal (PII) information entities XX
PII 的修正Redaction of PII XX

有关信息,请参阅语言支持See language support for information.

命名实体识别 v3 提供跨多种类型的扩展检测。Named Entity Recognition v3 provides expanded detection across multiple types. 目前,NER v3.0 可以识别常规实体类别中的实体。Currently, NER v3.0 can recognize entities in the general entity category.

命名实体识别 v3.1-preview.4 包括了 v3.0 的检测功能,以及:Named Entity Recognition v3.1-preview.4 includes the detection capabilities of v3.0, and:

  • 使用 v3.1-preview.4/entities/recognition/pii 终结点检测个人信息 (PII) 的功能。The ability to detect personal information (PII) using the v3.1-preview.4/entities/recognition/pii endpoint.

有关详细信息,请参阅下面的实体类别一文和请求终结点部分。For more information, see the entity categories article, and request endpoints section below.

发送 REST API 请求Sending a REST API request

准备工作Preparation

必须拥有以下格式的 JSON 文档:ID、文本、语言。You must have JSON documents in this format: ID, text, language.

每个文档必须少于 5,120 个字符,每个集合最多可包含 1,000 个项目 (ID)。Each document must be under 5,120 characters, and you can have up to 1,000 items (IDs) per collection. 集合在请求正文中提交。The collection is submitted in the body of the request.

构造请求Structure the request

创建 POST 请求。Create a POST request. 使用 Postman 或以下链接中的“API 测试控制台”来快速构建并发送请求 。You can use Postman or the API testing console in the following links to quickly structure and send one.

备注

可以在 Azure 门户上找到文本分析资源的密钥和终结点。You can find your key and endpoint for your Text Analytics resource on the azure portal. 它们将位于资源的“快速启动”页上的“资源管理”下。They will be located on the resource's Quick start page, under resource management.

请求终结点Request endpoints

命名实体识别 v3.1-preview.4 对 NER、PII 和实体链接请求使用不同的终结点。Named Entity Recognition v3.1-preview.4 uses separate endpoints for NER, PII, and entity linking requests. 根据你的请求使用以下 URL 格式。Use a URL format below based on your request.

实体链接Entity linking

  • https://<your-custom-subdomain>.cognitiveservices.azure.cn/text/analytics/v3.1-preview.4/entities/linking

Linking 的命名实体识别版本 3.1-preview 参考Named Entity Recognition version 3.1-preview reference for Linking

命名实体识别Named Entity Recognition

  • 常规实体 - https://<your-custom-subdomain>.cognitiveservices.azure.cn/text/analytics/v3.1-preview.4/entities/recognition/generalGeneral entities - https://<your-custom-subdomain>.cognitiveservices.azure.cn/text/analytics/v3.1-preview.4/entities/recognition/general

General 的命名实体识别版本 3.1-preview 参考Named Entity Recognition version 3.1-preview reference for General

个人身份信息 (PII)Personally Identifiable Information (PII)

  • 个人 (PII) 信息 - https://<your-custom-subdomain>.cognitiveservices.azure.cn/text/analytics/v3.1-preview.4/entities/recognition/piiPersonal (PII) information - https://<your-custom-subdomain>.cognitiveservices.azure.cn/text/analytics/v3.1-preview.4/entities/recognition/pii

v3.1-preview.4 开始,JSON 响应中将包含 redactedText 属性,该属性包含修改后的输入文本,其中检测到的 PII 实体的每个字符将被替换为 *Starting in v3.1-preview.4, The JSON response includes a redactedText property, which contains the modified input text where the detected PII entities are replaced by an * for each character in the entities.

PII 的命名实体识别版本 3.1-preview 参考Named Entity Recognition version 3.1-preview reference for PII

此 API 会尝试检测给定文档语言列出的实体类别The API will attempt to detect the listed entity categories for a given document language. 如果要指定将检测并返回哪些实体,请使用可选的 pii-categories 参数指定相应的实体类别。If you want to specify which entities will be detected and returned, use the optional pii-categories parameter with the appropriate entity categories. 此参数还可以检测默认情况下未为文档语言启用的实体。This parameter can also let you detect entities that aren't enabled by default for your document language. 例如,可能出现在英文文本中的法国驾照号码。For example, a French driver's license number that might occur in English text.

https://<your-custom-subdomain>.cognitiveservices.azure.cn/text/analytics/v3.1-preview.4/entities/recognition/pii?piiCategories=[FRDriversLicenseNumber]

发送请求标头以包括文本分析 API 密钥。Set a request header to include your Text Analytics API key. 在请求正文中,提供准备好的 JSON 文档。In the request body, provide the JSON documents you prepared.

示例请求Example requests

同步 NER 请求示例Example synchronous NER request

以下 JSON 是可能发送到 API 的内容示例。The following JSON is an example of content you might send to the API. 两个版本的 API 的请求格式相同。The request format is the same for both versions of the API.

{
  "documents": [
    {
        "id": "1",
        "language": "en",
        "text": "Our tour guide took us up the Space Needle during our trip to Seattle last week."
    }
  ]
}

发布请求Post the request

在收到请求时执行分析。Analysis is performed upon receipt of the request. 有关每分钟和每秒可以发送的请求大小和数量信息,请参阅数据限制一文。See the data limits article for information on the size and number of requests you can send per minute and second.

文本分析 API 是无状态的。The Text Analytics API is stateless. 不会在帐户中存储数据,结果会立即在响应中返回。No data is stored in your account, and results are returned immediately in the response.

查看结果View results

所有 POST 请求都将返回 JSON 格式的响应,其中包含 ID 和检测到的实体属性。All POST requests return a JSON formatted response with the IDs and detected entity properties.

系统会立即返回输出。Output is returned immediately. 可将结果流式传输到接受 JSON 的应用程序,或者将输出保存到本地系统上的文件中,然后将其导入到允许对数据进行排序、搜索和操作的应用程序。You can stream the results to an application that accepts JSON or save the output to a file on the local system, and then import it into an application that allows you to sort, search, and manipulate the data. 由于多语言和表情符号支持,响应可能包含文本偏移。Due to multilingual and emoji support, the response may contain text offsets. 有关详细信息,请参阅如何处理文本偏移For more information, see how to process text offsets.

示例响应Example responses

版本 3 为常规 NER、PII 和实体链接提供不同的终结点。Version 3 provides separate endpoints for general NER, PII, and entity linking. 这些操作的响应如下所示。The responses for these operations are below.

同步结果示例Synchronous example results

常规 NER 响应的示例:Example of a general NER response:

{
  "documents": [
    {
      "id": "1",
      "entities": [
        {
          "text": "tour guide",
          "category": "PersonType",
          "offset": 4,
          "length": 10,
          "confidenceScore": 0.45
        },
        {
          "text": "Space Needle",
          "category": "Location",
          "offset": 30,
          "length": 12,
          "confidenceScore": 0.38
        },
        {
          "text": "trip",
          "category": "Event",
          "offset": 54,
          "length": 4,
          "confidenceScore": 0.78
        },
        {
          "text": "Seattle",
          "category": "Location",
          "subcategory": "GPE",
          "offset": 62,
          "length": 7,
          "confidenceScore": 0.78
        },
        {
          "text": "last week",
          "category": "DateTime",
          "subcategory": "DateRange",
          "offset": 70,
          "length": 9,
          "confidenceScore": 0.8
        }
      ],
      "warnings": []
    }
  ],
  "errors": [],
  "modelVersion": "2020-04-01"
}

PII 响应的示例:Example of a PII response:

{
  "documents": [
    {
    "redactedText": "You can even pre-order from their online menu at *************************, call ************ or send email to ***************************!",
    "id": "0",
    "entities": [
        {
        "text": "www.contososteakhouse.com",
        "category": "URL",
        "offset": 49,
        "length": 25,
        "confidenceScore": 0.8
        }, 
        {
        "text": "312-555-0176",
        "category": "Phone Number",
        "offset": 81,
        "length": 12,
        "confidenceScore": 0.8
        }, 
        {
        "text": "order@contososteakhouse.com",
        "category": "Email",
        "offset": 111,
        "length": 27,
        "confidenceScore": 0.8
        }
      ],
    "warnings": []
    }
  ],
  "errors": [],
  "modelVersion": "2020-07-01"
}

实体链接响应的示例:Example of an Entity linking response:

{
  "documents": [
    {
      "id": "1",
      "entities": [
        {
          "bingId": "f8dd5b08-206d-2554-6e4a-893f51f4de7e", 
          "name": "Space Needle",
          "matches": [
            {
              "text": "Space Needle",
              "offset": 30,
              "length": 12,
              "confidenceScore": 0.4
            }
          ],
          "language": "en",
          "id": "Space Needle",
          "url": "https://en.wikipedia.org/wiki/Space_Needle",
          "dataSource": "Wikipedia"
        },
        {
          "bingId": "5fbba6b8-85e1-4d41-9444-d9055436e473",
          "name": "Seattle",
          "matches": [
            {
              "text": "Seattle",
              "offset": 62,
              "length": 7,
              "confidenceScore": 0.25
            }
          ],
          "language": "en",
          "id": "Seattle",
          "url": "https://en.wikipedia.org/wiki/Seattle",
          "dataSource": "Wikipedia"
        }
      ],
      "warnings": []
    }
  ],
  "errors": [],
  "modelVersion": "2020-02-01"
}

摘要Summary

在本文中,你已了解使用认知服务中的文本分析进行实体链接的概念和工作流。In this article, you learned concepts and workflow for entity linking using Text Analytics in Cognitive Services. 综上所述:In summary:

  • 请求正文中的 JSON 文档包括 ID、文本和语言代码。JSON documents in the request body include an ID, text, and language code.
  • 会通过对订阅有效的个性化访问密钥和终结点将 POST 请求发送到一个或多个终结点。POST requests are sent to one or more endpoints, using a personalized access key and an endpoint that is valid for your subscription.
  • 响应输出由链接实体(包括每个文档 ID 的置信度分数、偏移量和 Web 链接)组成,可用于任何应用程序Response output, which consists of linked entities (including confidence scores, offsets, and web links, for each document ID) can be used in any application

后续步骤Next steps