自定义实体查找认知技能(预览版)Custom Entity Lookup cognitive skill (Preview)

重要

此技能目前以公共预览版提供。This skill is currently in public preview. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负荷。目前不提供门户或 .NET SDK 支持。Preview functionality is provided without a service level agreement, and is not recommended for production workloads.There is currently no portal or .NET SDK support.

“自定义实体查找”技能可在用户自定义的单词和短语列表中查找文本。The Custom Entity Lookup skill looks for text from a custom, user-defined list of words and phrases. 它使用此列表为包含任何匹配实体的所有文档加上标签。Using this list, it labels all documents with any matching entities. 该技能还支持一定程度的模糊匹配,应用此匹配方法可以查找类似但不完全相同的匹配项。The skill also supports a degree of fuzzy matching that can be applied to find matches that are similar but not quite exact.

此技能未绑定到认知服务 API,在预览期可免费使用。This skill is not bound to a Cognitive Services API and can be used free of charge during the preview period. 但是,仍然应该附加一个认知服务资源,以覆盖每日扩充限制。You should still attach a Cognitive Services resource, however, to override the daily enrichment limit. 每日限制适用于通过 Azure 认知搜索免费访问认知服务的情况。The daily limit applies to free access to Cognitive Services when accessed through Azure Cognitive Search.

@odata.type

Microsoft.Skills.Text.CustomEntityLookupSkillMicrosoft.Skills.Text.CustomEntityLookupSkill

数据限制Data limits

  • 支持的最大输入记录大小为 256 MB。The maximum input record size supported is 256 MB. 如果在将数据发送到自定义实体查找技能之前需要将其拆分,请考虑使用文本拆分技能If you need to break up your data before sending it to the custom entity lookup skill, consider using the Text Split skill.
  • 如果使用 entitiesDefinitionUri 参数提供实体定义表,则支持的最大表大小为 10 MB。The maximum entities definition table supported is 10 MB if it is provided using the entitiesDefinitionUri parameter.
  • 如果使用 inlineEntitiesDefinition 参数以内联方式定义实体,则支持的最大大小为 10 KB。If the entities are defined inline, using the inlineEntitiesDefinition parameter, the maximum supported size is 10 KB.

技能参数Skill parameters

参数区分大小写。Parameters are case-sensitive.

参数名称Parameter name 说明Description
entitiesDefinitionUri JSON 或 CSV 文件的路径,该文件包含要匹配的所有目标文本。Path to a JSON or CSV file containing all the target text to match against. 索引器运行一开始就会读取此实体定义;在后续运行之前,不会识别到在运行中途对此文件所做的任何更新。This entity definition is read at the beginning of an indexer run; any updates to this file mid-run won't be realized until subsequent runs. 必须可以通过 HTTPS 访问此配置。This config must be accessible over HTTPS. 有关预期的 CSV 或 JSON 架构,请参阅下面的自定义实体定义格式。See Custom Entity Definition Format" below for expected CSV or JSON schema.
inlineEntitiesDefinition 内联 JSON 实体定义。Inline JSON entity definitions. 此参数将取代 entitiesDefinitionUri 参数(如果存在)。This parameter supersedes the entitiesDefinitionUri parameter if present. 可通过内联方式提供不超过 10 KB 的配置。No more than 10 KB of configuration may be provided inline. 有关预期的 JSON 架构,请参阅下面的自定义实体定义See Custom Entity Definition below for expected JSON schema.
defaultLanguageCode (可选)用于标记化和描绘输入文本的输入文本的语言代码。(Optional) Language code of the input text used to tokenize and delineate input text. 支持以下语言:da, de, en, es, fi, fr, it, ko, ptThe following languages are supported: da, de, en, es, fi, fr, it, ko, pt. 默认值为英语 (en)。The default is English (en). 如果你传递的是 languagecode-countrycode 格式,只会使用格式的 languagecode 部分。If you pass a languagecode-countrycode format, only the languagecode part of the format is used.

技能输入Skill inputs

输入名称Input name 说明Description
text 要分析的文本。The text to analyze.
languageCode 可选。Optional. 默认值为 "en"Default is "en".

技能输出Skill outputs

输出名称Output name 说明Description
entities 对象的数组,包含有关找到的匹配项和相关元数据的信息。An array of objects that contain information about the matches that were found, and related metadata. 识别到的每个实体可包含以下字段:Each of the entities identified may contain the following fields:
  • name:识别到的顶级实体。name: The top-level entity identified. 实体表示“规范化”形式。The entity represents the "normalized" form.
  • id:用户以“自定义实体定义格式”定义的实体的唯一标识符。id: A unique identifier for the entity as defined by the user in the "Custom Entity Definition Format".
  • description:用户以“自定义实体定义格式”定义的实体说明。description: Entity description as defined by the user in the "Custom Entity Definition Format".
  • type:用户以“自定义实体定义格式”定义的实体类型。type: Entity type as defined by the user in the "Custom Entity Definition Format".
  • subtype:用户以“自定义实体定义格式”定义的实体子类型。subtype: Entity subtype as defined by the user in the "Custom Entity Definition Format".
  • matches:描述该实体在源文本中的每个匹配项的集合。matches: Collection that describes each of the matches for that entity on the source text. 每个匹配项具有以下成员:Each match will have the following members:
    • text:源文档中匹配的原始文本。text: The raw text match from the source document.
    • offset:该匹配项在文本中的位置。offset: The location where the match was found in the text.
    • length:匹配文本的长度。length: The length of the matched text.
    • matchDistance:此匹配项与原始实体名称或别名相差的字符数。matchDistance: The number of characters different this match was from original entity name or alias.

自定义实体定义格式Custom Entity Definition Format

可通过 3 种不同的方式向自定义实体查找技能提供自定义实体列表。There are 3 different ways to provide the list of custom entities to the Custom Entity Lookup skill. 可以在 .CSV 文件、.JSON 文件中提供列表,或者以内联定义的形式在技能定义中提供列表。You can provide the list in a .CSV file, a .JSON file or as an inline definition as part of the skill definition.

如果定义文件是 .CSV 或 .JSON 文件,则需要提供该文件的路径作为 entitiesDefinitionUri 参数的一部分。If the definition file is a .CSV or .JSON file, the path of the file needs to be provided as part of the entitiesDefinitionUri parameter. 在这种情况下,将在每次开始运行索引器时下载该文件一次。In this case, the file is downloaded once at the beginning of each indexer run. 每次想要运行索引器时,都必须能够访问该文件。The file must be accessible as long as the indexer is intended to run. 此外,该文件必须采用 UTF-8 编码。Also, the file must be encoded UTF-8.

如果以内联方式提供定义,应将其作为 inlineEntitiesDefinition 技能参数的内容提供。If the definition is provided inline, it should be provided as inline as the content of the inlineEntitiesDefinition skill parameter.

CSV 格式CSV format

可以提供自定义实体的定义来查找逗号分隔值 (CSV) 文件中的内容,方法是提供该文件的路径并在 entitiesDefinitionUri 技能参数中设置该路径。You can provide the definition of the custom entities to look for in a Comma-Separated Value (CSV) file by providing the path to the file and setting it in the entitiesDefinitionUri skill parameter. 该路径应位于 https 位置。The path should be at an https location. 定义文件的最大大小为 10 MB。The definition file can be up to 10 MB in size.

CSV 格式很简单。The CSV format is simple. 每行代表一个唯一实体,如下所示:Each line represents a unique entity, as shown below:

Bill Gates, BillG, William H. Gates
Microsoft, MSFT
Satya Nadella 

在本例中,可以返回找到的三个实体(Bill Gates、Satya Nadella、Microsoft),但是,如果行(别名)上的任何字词在文本中匹配,则会识别到这些实体。In this case, there are three entities that can be returned as entities found (Bill Gates, Satya Nadella, Microsoft), but they will be identified if any of the terms on the line (aliases) are matched on the text. 例如,如果在文档中找到字符串“William H. Gates”,则会返回“Bill Gates”实体的匹配项。For instance, if the string "William H. Gates" is found in a document, a match for the "Bill Gates" entity will be returned.

JSON 格式JSON format

还可以提供自定义实体的定义,以查找 JSON 文件中的内容。You can provide the definition of the custom entities to look for in a JSON file as well. JSON 格式提供的灵活性要大一些,因为它允许按字词定义匹配规则。The JSON format gives you a bit more flexibility since it allows you to define matching rules per term. 例如,可为每个字词指定模糊匹配距离(Damerau-Levenshtein 距离),或者匹配是否要区分大小写。For instance, you can specify the fuzzy matching distance (Damerau-Levenshtein distance) for each term or whether the matching should be case-sensitive or not.

与 CSV 文件一样,需要提供 JSON 文件的路径,并在 entitiesDefinitionUri 技能参数中设置该路径。Just like with CSV files, you need to provide the path to the JSON file and set it in the entitiesDefinitionUri skill parameter. 该路径应位于 https 位置。The path should be at an https location. 定义文件的最大大小为 10 MB。The definition file can be up to 10 MB in size.

最基本的 JSON 自定义实体列表定义可以是要匹配的实体列表:The most basic JSON custom entity list definition can be a list of entities to match:

[ 
    { 
        "name" : "Bill Gates"
    }, 
    { 
        "name" : "Microsoft"
    }, 
    { 
        "name" : "Satya Nadella"
    }
]

更复杂的 JSON 定义示例可以选择性地提供每个实体的 ID、说明、类型和子类型,以及其他别名。A more complex example of a JSON definition can optionally provide the id, description, type and subtype of each entity -- as well as other aliases. 如果匹配了某个别名字词,则也会返回该实体:If an alias term is matched, the entity will be returned as well:

[ 
    { 
        "name" : "Bill Gates",
        "description" : "Microsoft founder." ,
        "aliases" : [ 
            { "text" : "William H. Gates", "caseSensitive" : false },
            { "text" : "BillG", "caseSensitive" : true }
        ]
    }, 
    { 
        "name" : "Xbox One", 
        "type": "Harware",
        "subtype" : "Gaming Device",
        "id" : "4e36bf9d-5550-4396-8647-8e43d7564a76",
        "description" : "The Xbox One product"
    }, 
    { 
        "name" : "LinkedIn" , 
        "description" : "The LinkedIn company", 
        "id" : "differentIdentifyingScheme123", 
        "fuzzyEditDistance" : 0 
    }, 
    { 
        "name" : "Microsoft" , 
        "description" : "Microsoft Corporation", 
        "id" : "differentIdentifyingScheme987", 
        "defaultCaseSensitive" : false, 
        "defaultFuzzyEditDistance" : 1, 
        "aliases" : [ 
            { "text" : "MSFT", "caseSensitive" : true }
        ]
    } 
] 

下表更详细地描述了在定义要匹配的实体时可以设置的不同配置参数:The tables below describe in more details the different configuration parameters you can set when defining the entities to match:

字段名称Field name 说明Description
name 顶级实体描述符。The top-level entity descriptor. 技能输出中的匹配项将按此名称分组,此名称应表示所找到的文本的“规范化”形式。Matches in the skill output will be grouped by this name, and it should represent the "normalized" form of the text being found.
description (可选)此字段可用作有关匹配文本的自定义元数据的信息传达字段。(Optional) This field can be used as a passthrough for custom metadata about the matched text(s). 此字段的值将连同其在技能输出中的实体的每个匹配项一起显示。The value of this field will appear with every match of its entity in the skill output.
type (可选)此字段可用作有关匹配文本的自定义元数据的信息传达字段。(Optional) This field can be used as a passthrough for custom metadata about the matched text(s). 此字段的值将连同其在技能输出中的实体的每个匹配项一起显示。The value of this field will appear with every match of its entity in the skill output.
subtype (可选)此字段可用作有关匹配文本的自定义元数据的信息传达字段。(Optional) This field can be used as a passthrough for custom metadata about the matched text(s). 此字段的值将连同其在技能输出中的实体的每个匹配项一起显示。The value of this field will appear with every match of its entity in the skill output.
id (可选)此字段可用作有关匹配文本的自定义元数据的信息传达字段。(Optional) This field can be used as a passthrough for custom metadata about the matched text(s). 此字段的值将连同其在技能输出中的实体的每个匹配项一起显示。The value of this field will appear with every match of its entity in the skill output.
caseSensitive (可选)默认值为 false。(Optional) Defaults to false. 一个布尔值,表示在与实体名称进行比较时是否应区分字符大小写。Boolean value denoting whether comparisons with the entity name should be sensitive to character casing. 不区分大小写的“Microsoft”匹配示例:microsoft, microSoft, MICROSOFTSample case insensitive matches of "Microsoft" could be: microsoft, microSoft, MICROSOFT
fuzzyEditDistance (可选)默认值为 0。(Optional) Defaults to 0. 最大值为 5。Maximum value of 5. 表示仍看作与实体名称匹配的可接受分歧字符数。Denotes the acceptable number of divergent characters that would still constitute a match with the entity name. 将返回任意给定匹配项的最小可能模糊匹配数。The smallest possible fuzziness for any given match is returned. 例如,如果编辑距离设置为 3,则“Windows 10”仍与“Windows”、“Windows10”和“windows 7”匹配。For instance, if the edit distance is set to 3, "Windows 10" would still match "Windows", "Windows10" and "windows 7".
如果区分大小写设置为 false,则大小写差异不会计入模糊匹配容差;否则会计入。When case sensitivity is set to false, case differences do NOT count towards fuzziness tolerance, but otherwise do.
defaultCaseSensitive (可选)更改此实体的默认区分大小写值。(Optional) Changes the default case sensitivity value for this entity. 它用于更改所有别名 caseSensitive 值的默认值。It be used to change the default value of all aliases caseSensitive values.
defaultFuzzyEditDistance (可选)更改此实体的默认模糊编辑距离值。(Optional) Changes the default fuzzy edit distance value for this entity. 它可用于更改所有别名 fuzzyEditDistance 值的默认值。It can be used to change the default value of all aliases fuzzyEditDistance values.
aliases (可选)可用于指定根实体名称的替代拼写或同义词的复杂对象数组。(Optional) An array of complex objects that can be used to specify alternative spellings or synonyms to the root entity name.
别名属性Alias properties 说明Description
text 某个目标实体名称的替代拼写或表示形式。The alternative spelling or representation of some target entity name.
caseSensitive (可选)作用与前面所述的根实体“caseSensitive”参数相同,但仅应用于这一个别名。(Optional) Acts the same as root entity "caseSensitive" parameter above, but applies to only this one alias.
fuzzyEditDistance (可选)作用与前面所述的根实体“fuzzyEditDistance”参数相同,但仅应用于这一个别名。(Optional) Acts the same as root entity "fuzzyEditDistance" parameter above, but applies to only this one alias.

内联格式Inline format

在某些情况下,直接在技能定义中提供要内联匹配的自定义实体列表会更方便。In some cases, it may be more convenient to provide the list of custom entities to match inline directly into the skill definition. 对于这种情况,可以使用类似于前面所述的 JSON 格式,但要将它内联在技能定义中。In that case you can use a similar JSON format to the one described above, but it is inlined in the skill definition. 只能以内联方式定义小于 10 KB(序列化大小)的配置。Only configurations that are less than 10 KB in size (serialized size) can be defined inline.

示例定义Sample definition

下面显示了使用内联格式的示例技能定义:A sample skill definition using an inline format is shown below:

  {
    "@odata.type": "#Microsoft.Skills.Text.CustomEntityLookupSkill",
    "context": "/document",
    "inlineEntitiesDefinition": 
    [
      { 
        "name" : "Bill Gates",
        "description" : "Microsoft founder." ,
        "aliases" : [ 
            { "text" : "William H. Gates", "caseSensitive" : false },
            { "text" : "BillG", "caseSensitive" : true }
        ]
      }, 
      { 
        "name" : "Xbox One", 
        "type": "Hardware",
        "subtype" : "Gaming Device",
        "id" : "4e36bf9d-5550-4396-8647-8e43d7564a76",
        "description" : "The Xbox One product"
      }
    ],    
    "inputs": [
      {
        "name": "text",
        "source": "/document/content"
      }
    ],
    "outputs": [
      {
        "name": "entities",
        "targetName": "matchedEntities"
      }
    ]
  }

或者,如果你决定提供指向实体定义文件的指针,可以参考下面所示的使用 entitiesDefinitionUri 格式的示例技能定义:Alternatively, if you decide to provide a pointer to the entities definition file, a sample skill definition using the entitiesDefinitionUri format is shown below:

  {
    "@odata.type": "#Microsoft.Skills.Text.CustomEntityLookupSkill",
    "context": "/document",
    "entitiesDefinitionUri": "https://myblobhost.net/keyWordsConfig.csv",    
    "inputs": [
      {
        "name": "text",
        "source": "/document/content"
      }
    ],
    "outputs": [
      {
        "name": "entities",
        "targetName": "matchedEntities"
      }
    ]
  }

示例输入Sample input

{
    "values": [
      {
        "recordId": "1",
        "data":
           {
             "text": "The company, Microsoft, was founded by Bill Gates. Microsoft's gaming console is called Xbox",
             "languageCode": "en"
           }
      }
    ]
}

示例输出Sample output

  { 
    "values" : 
    [ 
      { 
        "recordId": "1", 
        "data" : { 
          "entities": [
            { 
              "name" : "Microsoft", 
              "description" : "This document refers to Microsoft the company", 
              "id" : "differentIdentifyingScheme987", 
              "matches" : [ 
                { 
                  "text" : "microsoft", 
                  "offset" : 13, 
                  "length" : 9, 
                  "matchDistance" : 0 
                }, 
                { 
                  "text" : "Microsoft",
                  "offset" : 49, 
                  "length" : 9, 
                  "matchDistance" : 0
                }
              ] 
            },
            { 
              "name" : "Bill Gates",
              "description" : "William Henry Gates III, founder of Microsoft.", 
              "matches" : [
                { 
                  "text" : "Bill Gates",
                  "offset" : 37, 
                  "length" : 10,
                  "matchDistance" : 0 
                }
              ]
            }
          ] 
        } 
      } 
    ] 
  } 

错误和警告Errors and warnings

警告:已达到最大的匹配项容量,将跳过所有后续的重复匹配项。Warning: Reached maximum capacity for matches, skipping all further duplicate matches.

如果检测到的匹配项数大于允许的最大值,则将发出此警告。This warning will be emitted if the number of matches detected is greater than the maximum allowed. 在这种情况下,我们将停止包含重复的匹配项。In this case, we will stop including duplicate matches. 如果这是你无法接受的,请提交支持票证,以便我们帮助你处理个别用例。If this is unacceptable to you, please file a support ticket so we can assist you with your individual use case.

另请参阅See also