实体识别认知技能Entity Recognition cognitive skill

实体识别技能从文本中提取各种类型的实体。The Entity Recognition skill extracts entities of different types from text. 此技能使用认知服务中的文本分析提供的机器学习模型。This skill uses the machine learning models provided by Text Analytics in Cognitive Services.

备注

通过增大处理频率、添加更多文档或添加更多 AI 算法来扩大范围时,需要附加可计费的认知服务资源As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to attach a billable Cognitive Services resource. 调用认知服务中的 API 以及在 Azure 认知搜索中的文档破解阶段提取图像时,会产生费用。Charges accrue when calling APIs in Cognitive Services, and for image extraction as part of the document-cracking stage in Azure Cognitive Search. 提取文档中的文本不会产生费用。There are no charges for text extraction from documents.

当你行使内置技能时,我们会按现有的认知服务预付费价格收费。Execution of built-in skills is charged at the existing Cognitive Services pay-in-advance price. 图像提取定价如 Azure 认知搜索定价页所述。Image extraction pricing is described on the Azure Cognitive Search pricing page.

@odata.type

Microsoft.Skills.Text.EntityRecognitionSkillMicrosoft.Skills.Text.EntityRecognitionSkill

数据限制Data limits

记录的最大大小应为 50,000 个字符,通过 String.Length 进行测量。The maximum size of a record should be 50,000 characters as measured by String.Length. 如果在将数据发送到关键短语提取器之前需要拆分数据,请使用文本拆分技能If you need to break up your data before sending it to the key phrase extractor, consider using the Text Split skill.

技能参数Skill parameters

参数区分大小写并且都是可选的。Parameters are case-sensitive and are all optional.

参数名称Parameter name 说明Description
categories 应提取的类别的数组。Array of categories that should be extracted. 可能的类别类型有:"Person""Location""Organization""Quantity""Datetime""URL""Email"Possible category types: "Person", "Location", "Organization", "Quantity", "Datetime", "URL", "Email". 如果不提供类别,则返回所有类型。If no category is provided, all types are returned.
defaultLanguageCode 输入文本的语言代码。Language code of the input text. 支持以下语言:ar, cs, da, de, en, es, fi, fr, hu, it, ja, ko, nl, no, pl, pt-BR, pt-PT, ru, sv, tr, zh-hansThe following languages are supported: ar, cs, da, de, en, es, fi, fr, hu, it, ja, ko, nl, no, pl, pt-BR, pt-PT, ru, sv, tr, zh-hans. 并非所有实体类别都支持所有语言;请参阅下文中的说明。Not all entity categories are supported for all languages; see note below.
minimumPrecision 一个介于 0 和 1 之间的值。A value between 0 and 1. 如果置信度分数(在 namedEntities 输出中)低于此值,则不会返回该实体。If the confidence score (in the namedEntities output) is lower than this value, the entity is not returned. 默认值为 0。The default is 0.
includeTypelessEntities 如果要识别不符合当前类别的已知实体,请设置为 trueSet to true if you want to recognize well-known entities that don't fit the current categories. 识别出的实体将在 entities 复杂输出字段中返回。Recognized entities are returned in the entities complex output field. 例如,“Windows 10”是一个众所周知的实体(产品),但由于“产品”不是受支持的类别,因此,此实体将包含在实体输出字段中。For example, "Windows 10" is a well-known entity (a product), but since "Products" is not a supported category, this entity would be included in the entities output field. 默认为 falseDefault is false

技能输入Skill inputs

输入名称Input name 说明Description
languageCode 可选。Optional. 默认值为 "en"Default is "en".
text 要分析的文本。The text to analyze.

技能输出Skill outputs

备注

并非所有实体类别都支持所有语言。Not all entity categories are supported for all languages. "Person""Location""Organization" 实体类别支持上面的完整语言列表。The "Person", "Location", and "Organization" entity category types are supported for the full list of languages above. 只有 deenesfrzh-hans 支持 "Quantity""Datetime""URL""Email" 类型的提取。Only de, en, es, fr, and zh-hans support extraction of "Quantity", "Datetime", "URL", and "Email" types. 有关详细信息,请参阅文本分析 API 的语言和区域支持For more information, see Language and region support for the Text Analytics API.

输出名称Output name 说明Description
persons 一个字符串数组,其中,一个字符串表示一个人员名称。An array of strings where each string represents the name of a person.
locations 一个字符串数组,其中,一个字符串表示一个位置。An array of strings where each string represents a location.
organizations 一个字符串数组,其中,一个字符串表示一个组织。An array of strings where each string represents an organization.
quantities 一个字符串数组,其中,每个字符串都表示一个数量。An array of strings where each string represents a quantity.
dateTimes 一个字符串数组,其中,每个字符串都表示一个日期时间(因为它以文本形式显示)值。An array of strings where each string represents a DateTime (as it appears in the text) value.
urls 一个字符串数组,其中,每个字符串都表示一个 URLAn array of strings where each string represents a URL
emails 一个字符串数组,其中,每个字符串都表示一个电子邮件地址An array of strings where each string represents an email
namedEntities 复杂类型的数组,包含以下字段:An array of complex types that contains the following fields:
  • categorycategory
  • 值(实际实体名称)value (The actual entity name)
  • 偏移(在文本中找到它的位置)offset (The location where it was found in the text)
  • 置信度(值越高意味着它越有可能是一个真实的实体)confidence (Higher value means it's more to be a real entity)
entities 一个复杂类型数组,包含有关从文本提取的实体的丰富信息,具有以下字段An array of complex types that contains rich information about the entities extracted from text, with the following fields
  • name(实际实体名称。name (the actual entity name. 这表示一个“规范化”窗体)This represents a "normalized" form)
  • wikipediaIdwikipediaId
  • wikipediaLanguagewikipediaLanguage
  • wikipediaUrl(实体的 Wikipedia 页面的链接)wikipediaUrl (a link to Wikipedia page for the entity)
  • bingIdbingId
  • type(识别的实体的类别)type (the category of the entity recognized)
  • subType(仅适用于某些类别,这提供实体类型的更精细视图)subType (available only for certain categories, this gives a more granular view of the entity type)
  • matches(包含的复杂集合)matches (a complex collection that contains)
    • text(实体的原始文本)text (the raw text for the entity)
    • offset(找到它的位置)offset (the location where it was found)
    • length(原始实体文本的长度)length (the length of the raw entity text)

示例定义Sample definition

  {
    "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
    "categories": [ "Person", "Email"],
    "defaultLanguageCode": "en",
    "includeTypelessEntities": true,
    "minimumPrecision": 0.5,
    "inputs": [
      {
        "name": "text",
        "source": "/document/content"
      }
    ],
    "outputs": [
      {
        "name": "persons",
        "targetName": "people"
      },
      {
        "name": "emails",
        "targetName": "contact"
      },
      {
        "name": "entities"
      }
    ]
  }

示例输入Sample input

{
    "values": [
      {
        "recordId": "1",
        "data":
           {
             "text": "Contoso corporation was founded by John Smith. They can be reached at contact@contoso.com",
             "languageCode": "en"
           }
      }
    ]
}

示例输出Sample output

{
  "values": [
    {
      "recordId": "1",
      "data" : 
      {
        "persons": [ "John Smith"],
        "emails":["contact@contoso.com"],
        "namedEntities": 
        [
          {
            "category":"Person",
            "value": "John Smith",
            "offset": 35,
            "confidence": 0.98
          }
        ],
        "entities":  
        [
          {
            "name":"John Smith",
            "wikipediaId": null,
            "wikipediaLanguage": null,
            "wikipediaUrl": null,
            "bingId": null,
            "type": "Person",
            "subType": null,
            "matches": [{
                "text": "John Smith",
                "offset": 35,
                "length": 10
            }]
          },
          {
            "name": "contact@contoso.com",
            "wikipediaId": null,
            "wikipediaLanguage": null,
            "wikipediaUrl": null,
            "bingId": null,
            "type": "Email",
            "subType": null,
            "matches": [
            {
                "text": "contact@contoso.com",
                "offset": 70,
                "length": 19
            }]
          },
          {
            "name": "Contoso",
            "wikipediaId": "Contoso",
            "wikipediaLanguage": "en",
            "wikipediaUrl": "https://en.wikipedia.org/wiki/Contoso",
            "bingId": "349f014e-7a37-e619-0374-787ebb288113",
            "type": null,
            "subType": null,
            "matches": [
            {
                "text": "Contoso",
                "offset": 0,
                "length": 7
            }]
          }
        ]
      }
    }
  ]
}

请注意,在此技能的输出中,针对实体返回的偏移量是直接从文本分析 API 返回的,这意味着如果使用这些偏移量为原始字符串编制索引,则应使用 .NET 中的 StringInfo 类来提取正确的内容。Note that the offsets returned for entities in the output of this skill are directly returned from the Text Analytics API, which means if you are using them to index into the original string, you should use the StringInfo class in .NET in order to extract the correct content.

错误案例Error cases

如果文档的语言代码不受支持,则返回错误,并且不提取任何实体。If the language code for the document is unsupported, an error is returned and no entities are extracted.

另请参阅See also