如何向 Azure 认知搜索扩充管道添加自定义技能How to add a custom skill to an Azure Cognitive Search enrichment pipeline

Azure 认知搜索中的扩充管道可以从内置认知技能和你自己创建并添加到管道中的自定义技能进行装配。An enrichment pipeline in Azure Cognitive Search can be assembled from built-in cognitive skills as well as custom skills that you personally create and add to the pipeline. 在本文中,你将了解如何创建自定义技能并使其公开某个接口,用以将该技能包括在 AI 扩充管道中。In this article, learn how to create a custom skill that exposes an interface allowing it to be included in an AI enrichment pipeline.

通过生成自定义技能,可插入对内容唯一的转换。Building a custom skill gives you a way to insert transformations unique to your content. 自定义技能独立执行,可应用所需的任何扩充步骤。A custom skill executes independently, applying whatever enrichment step you require. 例如,可定义特定于域的自定义实体,生成自定义分类模型来区分商业和金融合同或文档,或者添加语音识别技能来深入了解相关内容的音频文件。For example, you could define field-specific custom entities, build custom classification models to differentiate business and financial contracts and documents, or add a speech recognition skill to reach deeper into audio files for relevant content. 有关分步示例,请参阅示例:创建用于 AI 扩充的自定义技能For a step-by-step example, see Example: Creating a custom skill for AI enrichment.

无论需要哪种自定义功能,都有一个简单明了的接口,可将自定义技能与其余扩充管道相连接。Whatever custom capability you require, there is a simple and clear interface for connecting a custom skill to the rest of the enrichment pipeline. 技能组合中包含的唯一需求是,能够以可在技能组合内作为整体使用的方式接受输入并发出输出。The only requirement for inclusion in a skillset is the ability to accept inputs and emit outputs in ways that are consumable within the skillset as a whole. 本文的重点是扩充管道所需的输入和输出格式。The focus of this article is on the input and output formats that the enrichment pipeline requires.

Web API 自定义技能接口Web API custom skill interface

如果未在 30 秒的期限内返回响应,自定义 WebAPI 技能终结点将默认超时。Custom WebAPI skill endpoints by default timeout if they don't return a response within a 30 second window. 索引管道是同步的,如果未在该期限内收到响应,索引会生成超时错误。The indexing pipeline is synchronous and indexing will produce a timeout error if a response is not received in that window. 通过设置超时参数,最多可以将超时配置为 230 秒:It is possible to configure the timeout to be up to 230 seconds, by setting the timeout parameter:

        "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
        "description": "This skill has a 230 second timeout",
        "uri": "https://[your custom skill uri goes here]",
        "timeout": "PT230S",

请确保 URI 是安全的 (HTTPS)。Make sure the URI is secure (HTTPS).

目前,与自定义技能交互的唯一机制是通过 Web API 接口。Currently, the only mechanism for interacting with a custom skill is through a Web API interface. Web API 需求必须满足本节中所述的要求。The Web API needs must meet the requirements described in this section.

1.Web API 输入格式1. Web API Input Format

Web API 必须接受要处理的一组记录。The Web API must accept an array of records to be processed. 每条记录都必须包含一个“属性包”,该属性包是提供给 Web API 的输入。Each record must contain a "property bag" that is the input provided to your Web API.

假设要创建一个简单的扩充器来识别合同文本中提到的第一个日期。Suppose you want to create a simple enricher that identifies the first date mentioned in the text of a contract. 在此示例中,技能接受单个输入 contractText 作为合同文本 。In this example, the skill accepts a single input contractText as the contract text. 技能也具有单个输出,即合同的日期。The skill also has a single output, which is the date of the contract. 若要使扩充器更有趣,请将此 contractDate 以多部分复杂类型的形式返回 。To make the enricher more interesting, return this contractDate in the shape of a multi-part complex type.

你的 Web API 应该可以接收一批输入记录。Your Web API should be ready to receive a batch of input records. values 数组中的每个成员都表示特定记录的输入 。Each member of the values array represents the input for a particular record. 每条记录都需要具有以下元素:Each record is required to have the following elements:

  • recordId 成员是特定记录的唯一标识符 。A recordId member that is the unique identifier for a particular record. 当扩充器返回结果时,它必须提供此 recordId 以便允许调用方将记录结果与其输入进行匹配 。When your enricher returns the results, it must provide this recordId in order to allow the caller to match the record results to their input.

  • data 成员,基本上是每条记录的输入域的包 。A data member, which is essentially a bag of input fields for each record.

更具体地说,根据上面的示例,你的 Web API 的请求应该类似于下:To be more concrete, per the example above, your Web API should expect requests that look like this:

{
    "values": [
      {
        "recordId": "a1",
        "data":
           {
             "contractText": 
                "This is a contract that was issues on November 3, 2017 and that involves... "
           }
      },
      {
        "recordId": "b5",
        "data":
           {
             "contractText": 
                "In the City of Seattle, WA on February 5, 2018 there was a decision made..."
           }
      },
      {
        "recordId": "c3",
        "data":
           {
             "contractText": null
           }
      }
    ]
}

实际上,可能会调用服务的数百或数千条记录,而不仅仅是这里显示的三条记录。In reality, your service may get called with hundreds or thousands of records instead of only the three shown here.

2.Web API 输出格式2. Web API Output Format

输出的格式是一组包含 recordId 和属性包的记录 The format of the output is a set of records containing a recordId, and a property bag

{
  "values": 
  [
      {
        "recordId": "b5",
        "data" : 
        {
            "contractDate":  { "day" : 5, "month": 2, "year" : 2018 }
        }
      },
      {
        "recordId": "a1",
        "data" : {
            "contractDate": { "day" : 3, "month": 11, "year" : 2017 }                    
        }
      },
      {
        "recordId": "c3",
        "data" : 
        {
        },
        "errors": [ { "message": "contractText field required "}   ],  
        "warnings": [ {"message": "Date not found" }  ]
      }
    ]
}

这个具体示例仅有一个输出,但你可以输出多个属性。This particular example has only one output, but you could output more than one property.

错误和警告Errors and Warning

如之前示例所示,可返回每条记录的错误和警告消息。As shown in the previous example, you may return error and warning messages for each record.

从技能组合中使用自定义技能Consuming custom skills from skillset

创建 Web API 扩充器时,可将 HTTP 标头和参数描述为请求的一部分。When you create a Web API enricher, you can describe HTTP headers and parameters as part of the request. 下面的代码片段显示了如何将请求参数和可选 HTTP 标头描述为技能组合定义的一部分。 The snippet below shows how request parameters and optional HTTP headers may be described as part of the skillset definition. HTTP 标头不是必需的,但它们可用来向技能中添加其他配置功能,并允许你在技能组定义中设置它们。HTTP headers are not a requirement, but they allow you to add additional configuration capabilities to your skill and to set them from the skillset definition.

{
    "skills": [
      {
        "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
        "description": "This skill calls an Azure function, which in turn calls TA sentiment",
        "uri": "https://indexer-e2e-webskill.chinacloudsites.cn/api/DateExtractor?language=en",
        "context": "/document",
        "httpHeaders": {
            "DateExtractor-Api-Key": "foo"
        },
        "inputs": [
          {
            "name": "contractText",
            "source": "/document/content"
          }
        ],
        "outputs": [
          {
            "name": "contractDate",
            "targetName": "date"
          }
        ]
      }
  ]
}

后续步骤Next steps

本文介绍了将自定义技能组成到技能组时所需的接口要求。This article covered the interface requirements necessary for integrating a custom skill into a skillset. 单击下面的链接来了解有关自定义技能和技能组构成的详细信息。Click the following links to learn more about custom skills and skillset composition.