如何在 Azure 认知搜索中为增量扩充配置缓存How to configure caching for incremental enrichment in Azure Cognitive Search

Important

增量扩充目前以公共预览版提供。Incremental enrichment is currently in public preview. 此预览版在提供时没有附带服务级别协议,不建议将其用于生产工作负荷。This preview version is provided without a service level agreement, and it's not recommended for production workloads. REST API 版本 2019-05-06-Preview 提供了此功能。The REST API version 2019-05-06-Preview provides this feature. 目前不支持门户或 .NET SDK。There is no portal or .NET SDK support at this time.

本文介绍如何将缓存添加到扩充管道,以便能够以增量方式修改步骤,而无需每次更改后都要重新生成。This article shows you how to add caching to an enrichment pipeline so that you can incrementally modify steps without having to rebuild every time. 默认情况下,技能集是无状态的,更改其任何构成部分都要从头到尾地重新运行索引器。By default, a skillset is stateless, and changing any part of its composition requires a full rerun of the indexer. 使用增量扩充时,索引器可以根据技能集或索引器定义中检测到的更改,确定需要刷新文档树的哪些组成部分。With incremental enrichment, the indexer can determine which parts of the document tree need to be refreshed based on changes detected in the skillset or indexer definitions. 现有的已处理输出将会保留,在可能的情况下可供重复使用。Existing processed output is preserved and reused wherever possible.

使用提供的帐户信息将缓存的内容放置在 Azure 存储中。Cached content is placed in Azure Storage using account information that you provide. 在运行索引器时,将创建名为 ms-az-search-indexercache-<alpha-numerc-string> 的容器。The container, named ms-az-search-indexercache-<alpha-numerc-string>, is created when you run the indexer. 应将此容器视为搜索服务管理的内部组件,不得对它进行修改。It should be considered an internal component managed by your search service and must not be modified.

如果你不熟悉如何设置索引器,请从索引器概述着手,然后继续学习技能集来了解扩充管道。If you're not familiar with setting up indexers, start with indexer overview and then continue on to skillsets to learn about enrichment pipelines. 有关重要概念的更多背景信息,请参阅增量扩充For more background on key concepts, see incremental enrichment.

对现有的索引器启用缓存Enable caching on an existing indexer

如果现有的索引器已有一个技能集,请遵循本部分中的步骤添加缓存。If you have an existing indexer that already has a skillset, follow the steps in this section to add caching. 在增量处理生效之前,必须先重置并从头到尾地重新运行索引器。这是一次性的操作。As a one-time operation, you will have to reset and rerun the indexer in full before incremental processing can take effect.

Tip

作为概念证明,可以通读此门户快速入门来创建必要的对象,然后使用 Postman 或门户进行更新。As proof-of-concept, you can run through this portal quickstart to create necessary objects, and then use Postman or the portal to make your updates. 你可能想要附加计费的认知服务资源。You might want to attach a billable Cognitive Services resource. 多次运行索引器会耗尽每日的免费资源分配,导致无法完成所有步骤。Running the indexer multiple times will exhaust the free daily allocation before you can complete all of the steps.

步骤 1:获取索引器定义Step 1: Get the indexer definition

从包含以下组件的现有有效索引器开始:数据源、技能集和索引。Start with a valid, existing indexer that has these components: data source, skillset, index. 索引器应可运行。Your indexer should be runnable.

使用 API 客户端构造获取索引器请求,以获取索引器的当前配置。Using an API client, construct a GET Indexer request to get the current configuration of the indexer. 使用预览版 API 获取索引器时,会将一个设置为 null 的 cache 属性添加到定义中。When you use the preview API version to the GET the indexer, a cache property set to null is added to the definitions.

GET https://[YOUR-SEARCH-SERVICE].search.azure.cn/indexers/[YOUR-INDEXER-NAME]?api-version=2019-05-06-Preview
Content-Type: application/json
api-key: [YOUR-ADMIN-KEY]

复制响应中的索引器定义。Copy the indexer definition from the response.

步骤 2:修改索引器定义中的缓存属性Step 2: Modify the cache property in the indexer definition

cache 属性默认为 null。By default the cache property is null. 使用 API 客户端设置缓存配置(门户不支持这种特殊的更新)。Use an API client to set the cache configuration (the portal does not support this particulate update).

修改缓存对象,以包含以下必需属性和可选属性:Modify the cache object to include the following required and optional properties:

  • storageConnectionString 是必需的,必须设置为 Azure 存储连接字符串。The storageConnectionString is required, and it must be set to an Azure storage connection string.
  • enableReprocessing 布尔属性是可选的(默认为 true),指示已启用增量扩充。The enableReprocessing boolean property is optional (true by default), and it indicates that incremental enrichment is enabled. 如果需要,可将其设置为 false 以暂停增量处理,同时让其他资源密集型操作(例如,为新文档编制索引)继续进行,完成这些操作后再将此属性改回 trueWhen needed, you can set it to false to suspend incremental processing while other resource-intensive operations, such as indexing new documents, are underway and then flip it back to true later.
{
    "name": "<YOUR-INDEXER-NAME>",
    "targetIndexName": "<YOUR-INDEX-NAME>",
    "dataSourceName": "<YOUR-DATASOURCE-NAME>",
    "skillsetName": "<YOUR-SKILLSET-NAME>",
    "cache" : {
        "storageConnectionString" : "<YOUR-STORAGE-ACCOUNT-CONNECTION-STRING>",
        "enableReprocessing": true
    },
    "fieldMappings" : [],
    "outputFieldMappings": [],
    "parameters": []
}

步骤 3:重置索引器Step 3: Reset the indexer

为现有索引器设置增量扩充时需要重置索引器,以确保所有文档处于一致状态。A reset of the indexer is required when setting up incremental enrichment for existing indexers to ensure all documents are in a consistent state. 可以使用门户或 API 客户端以及重置索引器 REST API 完成此任务。You can use the portal or an API client and the Reset Indexer REST API for this task.

POST https://[YOUR-SEARCH-SERVICE].search.azure.cn/indexers/[YOUR-INDEXER-NAME]/reset?api-version=2019-05-06-Preview
Content-Type: application/json
api-key: [YOUR-ADMIN-KEY]

步骤 4:保存更新的定义Step 4: Save the updated definition

使用 PUT 请求更新索引器;请求正文应包含具有缓存属性的已更新索引器定义。Update the indexer with a PUT request, the body of the request should contain the updated indexer definition that has the cache property. 如果收到 400 错误,请检查索引器定义,确保满足所有要求(数据源、技能集和索引)。If you get a 400, check the indexer definition to make sure all requirements are met (data source, skillset, index).

PUT https://[YOUR-SEARCH-SERVICE].search.azure.cn/indexers/[YOUR-INDEXER-NAME]?api-version=2019-05-06-Preview
Content-Type: application/json
api-key: [YOUR-ADMIN-KEY]
{
    "name" : "<YOUR-INDEXER-NAME>",
    ...
    "cache": {
        "storageConnectionString": "<YOUR-STORAGE-ACCOUNT-CONNECTION-STRING>",
        "enableReprocessing": true
    }
}

如果现在针对索引器发出另一个 GET 请求,服务的响应将包含缓存对象中的 ID 属性。If you now issue another GET request on the indexer, the response from the service will include an ID property in the cache object. 将字母数字字符串追加到容器的名称,该容器包含此索引器处理的每个文档的所有缓存结果和中间状态。The alphanumeric string is appended to the name of the container containing all the cached results and intermediate state of each document processed by this indexer. ID 用于唯一命名 Blob 存储中的缓存。The ID will be used to uniquely name the cache in Blob storage.

"cache": {
    "ID": "<ALPHA-NUMERIC STRING>",
    "enableReprocessing": true,
    "storageConnectionString": "DefaultEndpointsProtocol=https;AccountName=<YOUR-STORAGE-ACCOUNT>;AccountKey=<YOUR-STORAGE-KEY>;EndpointSuffix=core.chinacloudapi.cn"
}

步骤 5:运行索引器Step 5: Run the indexer

可以使用门户或 API 运行索引器。To run indexer, you can use the portal or the API. 在门户上的“索引器”列表中选择该索引器,然后单击“运行”。 In the portal, from the indexers list, select the indexer and click Run. 使用门户的一个优势是可以监视索引器状态、观察作业持续时间以及处理的文档数。One advantage to using the portal is that you can monitor indexer status, note the duration of the job, and how many documents are processed. 门户页面每隔几分钟刷新一次。Portal pages are refreshed every few minutes.

或者,可以使用 REST 运行索引器Alternatively, you can use REST to run the indexer:

POST https://[YOUR-SEARCH-SERVICE].search.azure.cn/indexers/[YOUR-INDEXER-NAME]/run?api-version=2019-05-06-Preview
Content-Type: application/json
api-key: [YOUR-ADMIN-KEY]

运行索引器后,可以查找 Azure Blob 存储中的缓存。After the indexer runs, you can find the cache in Azure Blob storage. 容器名称采用以下格式:ms-az-search-indexercache-<YOUR-CACHE-ID>The container name is in the following format: ms-az-search-indexercache-<YOUR-CACHE-ID>

Note

重置和重新运行索引器会导致完全重新生成,以便能够缓存内容。A reset and rerun of the indexer results in a full rebuild so that content can be cached. 所有认知扩充将针对所有文档重新运行。All cognitive enrichments will be rerun on all documents.

步骤 6:修改技能集并确认增量扩充Step 6: Modify a skillset and confirm incremental enrichment

可以使用门户或 API 修改技能集。To modify a skillset, you can use the portal or the API. 例如,如果你正在使用文本翻译,从 enes 或其他语言的简单内联更改便足以实现增量扩充的概念证明测试。For example, if you are using text translation, a simple inline change from en to es or another language is sufficient for proof-of-concept testing of incremental enrichment.

再次运行索引器。Run the indexer again. 只更新了扩充文档树的那些组成部分。Only those parts of an enriched document tree are updated. 如果使用门户快速入门作为概念证明,将文本翻译技能修改为“es”时你会发现,只更新了 8 个文档,而不是 14 个原始文档。If you used the portal quickstart as proof-of-concept, modifying the text translation skill to 'es', you'll notice that only 8 documents are updated instead of the original 14. 将在缓存中重复使用不受翻译过程影响的图像文件。Image files unaffected by the translation process are reused from cache.

对新索引器启用缓存Enable caching on new indexers

若要为新索引器设置增量扩充,只需在调用创建索引器 (2019-05-06-Preview) 时,将 cache 属性包含在索引器定义有效负载中。To set up incremental enrichment for a new indexer, all you have to do is include the cache property in the indexer definition payload when calling Create Indexer (2019-05-06-Preview).

{
    "name": "<YOUR-INDEXER-NAME>",
    "targetIndexName": "<YOUR-INDEX-NAME>",
    "dataSourceName": "<YOUR-DATASOURCE-NAME>",
    "skillsetName": "<YOUR-SKILLSET-NAME>",
    "cache" : {
        "storageConnectionString" : "<YOUR-STORAGE-ACCOUNT-CONNECTION-STRING>",
        "enableReprocessing": true
    },
    "fieldMappings" : [],
    "outputFieldMappings": [],
    "parameters": []
    }
}

检查缓存的输出Checking for cached output

缓存由索引器创建、使用和管理,其内容不是以用户可读的格式表示。The cache is created, used, and managed by the indexer, and its contents are not represented in a format that is human readable. 若要确定是否使用了缓存,最好的方法是运行索引器,并比较执行之前和之后的指标以及文档计数。The best way to determine whether the cache is used is by running the indexer and compare before-and-after metrics for execution time and document counts.

例如,假设某个技能集首先对扫描的文档执行图像分析和光学字符识别 (OCR),然后对生成的文本执行下游分析。For example, assume a skillset that starts with image analysis and Optical Character Recognition (OCR) of scanned documents, followed by downstream analysis of the resulting text. 如果修改下游文本技能,则索引器可以从缓存中检索所有以前处理的图像和 OCR 内容,仅更新并处理你在编辑内容中指示的文本相关更改。If you modify a downstream text skill, the indexer can retrieve all of the previously processed image and OCR content from cache, updating and processing just text-related changes indicated by your edits. 预期会看到更少的文档计数(例如 8/8,而不是原始运行中的 14/14)、执行时间更短,且帐单费用更低。You can expect to see fewer documents in the document count (for example 8/8 as opposed to 14/14 in the original run), shorter execution times, and fewer charges on your bill.

使用缓存Working with the cache

缓存正常运行后,索引器在每次调用运行索引器时都会检查缓存,以确定可以使用现有输出的哪些组成部分。Once the cache is operational, indexers check the cache whenever Run Indexer is called, to see which parts of the existing output can be used.

下表汇总了各个 API 与缓存之间的关系:The following table summarizes how various APIs relate to the cache:

APIAPI 缓存影响Cache impact
创建索引器 (2019-05-06-Preview)Create Indexer (2019-05-06-Preview) 创建并运行首次使用的索引器,包括创建缓存(如果索引器定义指定了缓存)。Creates and runs an indexer on first use, including creating a cache if your indexer definition specifies it.
运行索引器Run Indexer 按需执行扩充管道。Executes an enrichment pipeline on demand. 此 API 从缓存(如果存在)中读取内容;如果已将缓存添加到更新的索引器定义,则此 API 会创建缓存。This API reads from the cache if it exists, or creates a cache if you added caching to an updated indexer definition. 运行已启用缓存的索引器时,如果可以使用缓存的输出,则索引器将省略步骤。When you run an indexer that has caching enabled, the indexer omits steps if cached output can be used. 可以使用此 API 的正式版或预览版。You can use the generally available or preview API version of this API.
重置索引器Reset Indexer 清除任何增量索引信息的索引器。Clears the indexer of any incremental indexing information. 下一次运行索引器(按需或按计划)时会从头开始完全重新处理,包括重新运行所有技能并重新生成缓存。The next indexer run (either on-demand or schedule) is full reprocessing from scratch, including re-running all skills and rebuilding the cache. 它在功能上等同于删除再重新创建索引器。It is functionally equivalent to deleting the indexer and recreating it. 可以使用此 API 的正式版或预览版。You can use the generally available or preview API version of this API.
重置技能 (2019-05-06-Preview)Reset Skills (2019-05-06-Preview) 指定在下一次运行索引器时要重新运行哪些技能,即使未修改任何技能。Specifies which skills to rerun on the next indexer run, even if you haven't modified any skills. 缓存会相应地更新。The cache is updated accordingly. 将使用缓存中的可重用数据以及每个已更新技能的新内容来刷新知识存储或搜索索引等输出。Outputs, such as a knowledge store or search index, are refreshed using reusable data from the cache plus new content per the updated skill.

有关控制缓存发生的情况的详细信息,请参阅缓存管理For more information about controlling what happens to the cache, see Cache management.

后续步骤Next steps

增量扩充适用于包含技能集的索引器。Incremental enrichment is applicable on indexers that contain skillsets. 接下来,请访问技能集文档以了解概念和构成部分。As a next step, visit the skillset documentation to understand concepts and composition.

此外,在启用缓存后,需要了解缓存涉及的参数和 API,包括如何重写或强制特定的行为。Additionally, once you enable the cache, you will want to know about the parameters and APIs that factor into caching, including how to override or force particular behaviors. 有关详细信息,请参阅以下链接。For more information, see the following links.