Azure 认知搜索中的增量扩充和缓存Incremental enrichment and caching in Azure Cognitive Search

重要

增量扩充目前以公共预览版提供。Incremental enrichment is currently in public preview. 此预览版在提供时没有附带服务级别协议,不建议将其用于生产工作负荷。This preview version is provided without a service level agreement, and it's not recommended for production workloads. REST API 版本 2019-05-06-Preview 和 2020-06-30-Preview 提供此功能。The REST API versions 2019-05-06-Preview and 2020-06-30-Preview provide this feature. 目前不支持门户或 .NET SDK。There is no portal or .NET SDK support at this time.

“增量扩充”是一项针对技能组的功能。Incremental enrichment is a feature that targets skillsets. 它利用 Azure 存储保存扩充管道发出的处理输出,方便在将来的索引器运行中重复使用。It leverages Azure Storage to save the processing output emitted by an enrichment pipeline for reuse in future indexer runs. 索引器会尽可能重复使用任何仍有效的缓存输出。Wherever possible, the indexer reuses any cached output that is still valid.

增量扩充不仅可以保护在处理(特别是 OCR 和图像处理)方面的投资,而且还能提高系统的效率。Not only does incremental enrichment preserve your monetary investment in processing (in particular, OCR and image processing) but it also makes for a more efficient system. 缓存结构和内容时,索引器可以确定哪些技能已更改,并仅运行已修改的技能以及任何下游相关技能。When structures and content are cached, an indexer can determine which skills have changed and run only those that have been modified, as well as any downstream dependent skills.

使用增量缓存的工作流包括以下步骤:A workflow that uses incremental caching includes the following steps:

  1. 创建或标识用于存储缓存的 Azure 存储帐户Create or identify an Azure storage account to store the cache.
  2. 在索引器中启用增量扩充Enable incremental enrichment in the indexer.
  3. 创建索引器加上技能组以调用管道。Create an indexer - plus a skillset - to invoke the pipeline. 在处理过程中,会为 Blob 存储中的每个文档保存扩充的各个阶段,供将来使用。During processing, stages of enrichment are saved for each document in Blob storage for future use.
  4. 测试代码,对其进行更改后使用更新技能组修改定义。Test your code, and after making changes, use Update Skillset to modify a definition.
  5. 运行索引器以调用管道,从而检索缓存的输出以进行更快且更经济高效的处理。Run Indexer to invoke the pipeline, retrieving cached output for faster and more cost-effective processing.

若要详细了解使用现有索引器时的步骤和注意事项,请参阅设置增量扩充For more information about steps and considerations when working with an existing indexer, see Set up incremental enrichment.

索引器缓存Indexer cache

增量扩充将缓存添加到扩充管道。Incremental enrichment adds a cache to the enrichment pipeline. 索引器将缓存文档破解结果,以及针对每个文档运行每个技能后的输出。The indexer caches the results from document cracking plus the outputs of each skill for every document. 更新技能集后,只会重新运行已更改的技能或下游技能。When a skillset is updated, only the changed, or downstream, skills are rerun. 更新的结果将写入缓存,文档将在搜索索引或知识存储中更新。The updated results are written to the cache and the document is updated in the search index or the knowledge store.

在物理上,缓存存储在 Azure 存储帐户中的 Blob 容器内。Physically, the cache is stored in a blob container in your Azure Storage account. 缓存还使用表存储来保存处理更新的内部记录。The cache also uses table storage for an internal record of processing updates. 搜索服务中的所有索引可以共享索引器缓存的同一存储帐户。All indexes within a search service may share the same storage account for the indexer cache. 为每个索引器分配了它所用的容器的唯一不可变缓存标识符。Each indexer is assigned a unique and immutable cache identifier to the container it is using.

缓存配置Cache configuration

需要在索引器中设置 cache 属性才能受益于增量扩充。You'll need to set the cache property on the indexer to start benefitting from incremental enrichment. 以下示例演示了一个已启用缓存的索引器。The following example illustrates an indexer with caching enabled. 以下部分介绍了此配置的具体组成部分。Specific parts of this configuration are described in following sections. 有关详细信息,请参阅设置增量扩充For more information, see Set up incremental enrichment.

{
    "name": "myIndexerName",
    "targetIndexName": "myIndex",
    "dataSourceName": "myDatasource",
    "skillsetName": "mySkillset",
    "cache" : {
        "storageConnectionString" : "Your storage account connection string",
        "enableReprocessing": true
    },
    "fieldMappings" : [],
    "outputFieldMappings": [],
    "parameters": []
}

在现有索引器中设置此属性需要重置并重新运行该索引器,这会导致再次处理数据源中的所有文档。Setting this property on an existing indexer will require you to reset and rerun the indexer, which will result in all documents in your data source being processed again. 必须执行此步骤才能消除以前的技能集版本所扩充的任何文档。This step is necessary to eliminate any documents enriched by previous versions of the skillset.

缓存管理Cache management

缓存的生命周期由索引器管理。The lifecycle of the cache is managed by the indexer. 如果索引器中的 cache 属性设置为 null,或者连接字符串发生更改,则下一次运行索引器时会删除现有缓存。If the cache property on the indexer is set to null or the connection string is changed, the existing cache is deleted on the next indexer run. 缓存生命周期还与索引器生命周期相关联。The cache lifecycle is also tied to the indexer lifecycle. 如果删除某个索引器,则也会删除关联的缓存。If an indexer is deleted, the associated cache is also deleted.

尽管增量扩充无需你的干预即可检测和响应更改,但也可以使用一些参数来重写默认行为:While incremental enrichment is designed to detect and respond to changes with no intervention on your part, there are parameters you can use to override default behaviors:

  • 指定新文档的优先级Prioritize new documents
  • 绕过技能集检查Bypass skillset checks
  • 绕过数据源检查Bypass data source checks
  • 强制技能集评估Force skillset evaluation

指定新文档的优先级Prioritize new documents

设置 enableReprocessing 属性,以控制对缓存中已存在的传入文档的处理。Set the enableReprocessing property to control processing over incoming documents already represented in the cache. 如果设置为 true(默认值),则重新运行索引器时,会重新处理缓存中现有的文档,前提是技能更新会影响该文档。When true (default), documents already in the cache are reprocessed when you rerun the indexer, assuming your skill update affects that doc.

如果设置为 false,则不会重新处理现有文档,而是有效地使新传入内容的优先级高于现有内容。When false, existing documents are not reprocessed, effectively prioritizing new, incoming content over existing content. 只能暂时将 enableReprocessing 设置为 falseYou should only set enableReprocessing to false on a temporary basis. 为了确保数据集间的一致性,大多数时间 enableReprocessing 应是 true,以确保所有新的和现有的文档都按照当前技能集定义保持有效。To ensure consistency across the corpus, enableReprocessing should be true most of the time, ensuring that all documents, both new and existing, are valid per the current skillset definition.

绕过技能集评估Bypass skillset evaluation

修改技能集和重新处理该技能集通常是一起进行的。Modifying a skillset and reprocessing of that skillset typically go hand in hand. 但是,对技能集进行的一些更改不应导致重新处理(例如,将自定义技能部署到新位置,或使用新的访问密钥)。However, some changes to a skillset should not result in reprocessing (for example, deploying a custom skill to a new location or with a new access key). 这些外围修改很可能不会对技能集本身的实质造成真正影响。Most likely, these are peripheral modifications that have no genuine impact on the substance of the skillset itself.

如果你知道对技能集的更改确实是表面性的,应该通过将 disableCacheReprocessingChangeDetection 参数设置为 true 来重写技能集评估:If you know that a change to the skillset is indeed superficial, you should override skillset evaluation by setting the disableCacheReprocessingChangeDetection parameter to true:

  1. 调用“更新技能集”并修改技能集定义。Call Update Skillset and modify the skillset definition.
  2. 在请求中追加 disableCacheReprocessingChangeDetection=true 参数。Append the disableCacheReprocessingChangeDetection=true parameter on the request.
  3. 提交更改。Submit the change.

设置此参数可以确保仅提交对该技能集定义的更新,而不会评估更改对现有数据集的影响。Setting this parameter ensures that only updates to the skillset definition are committed and the change isn't evaluated for effects on the existing corpus.

以下示例演示带有参数的“更新技能集”请求:The following example shows an Update Skillset request with the parameter:

PUT https://customerdemos.search.azure.cn/skillsets/callcenter-text-skillset?api-version=2020-06-30-Preview&disableCacheReprocessingChangeDetection=true

绕过数据源验证检查Bypass data source validation checks

对数据源定义进行的大部分更改都会使缓存失效。Most changes to a data source definition will invalidate the cache. 但是,如果你知道某项更改不会使缓存失效 - 例如,在存储帐户中更改连接字符串或轮换密钥 - 请在数据源更新中追加 ignoreResetRequirement 参数。However, for scenarios where you know that a change should not invalidate the cache - such as changing a connection string or rotating the key on the storage account - append theignoreResetRequirement parameter on the data source update. 将此参数设置为 true 可让提交操作继续,而不会触发重置条件,导致重新生成并从头开始填充所有对象。Setting this parameter to true allows the commit to go through, without triggering a reset condition that would result in all objects being rebuilt and populated from scratch.

PUT https://customerdemos.search.azure.cn/datasources/callcenter-ds?api-version=2020-06-30-Preview&ignoreResetRequirement=true

强制技能集评估Force skillset evaluation

缓存的目的是避免不必要的处理,但假设你要对索引器不会检测的技能进行更改(例如,在外部代码中更改某项内容,如自定义技能)。The purpose of the cache is to avoid unnecessary processing, but suppose you make a change to a skill that the indexer doesn't detect (for example, changing something in external code, such as a custom skill).

在这种情况下,可以使用重置技能来强制重新处理特定的技能,包括依赖于该技能的输出的任何下游技能。In this case, you can use the Reset Skills to force reprocessing of a particular skill, including any downstream skills that have a dependency on that skill's output. 此 API 接受 POST 请求以及应该失效且标记为重新处理的技能列表。This API accepts a POST request with a list of skills that should be invalidated and marked for reprocessing. 运行“重置技能”后,运行索引器来调用管道。After Reset Skills, run the indexer to invoke the pipeline.

更改检测Change detection

启用缓存后,索引器将评估管道组合中的更改,以确定可以重复使用的内容以及需要重新处理的内容。Once you enable a cache, the indexer evaluates changes in your pipeline composition to determine which content can be reused and which needs reprocessing. 本部分先后列举了会使缓存彻底失效的更改,以及会触发增量处理的更改。This section enumerates changes that invalidate the cache outright, followed by changes that trigger incremental processing.

使缓存失效的更改Changes that invalidate the cache

使更改失效意味着整个缓存不再有效。An invalidating change is one where the entire cache is no longer valid. 例如,在更新数据源时就存在使更改失效的情况。An example of an invalidating change is one where your data source is updated. 下面是使缓存失效的更改的完整列表:Here is the complete list of changes that would invalidate your cache:

  • 对数据源类型的更改Change to your data source type
  • 对数据源容器的更改Change to data source container
  • 数据源凭据Data source credentials
  • 数据源更改检测策略Data source change detection policy
  • 数据源删除检测策略Data sources delete detection policy
  • 索引器字段映射Indexer field mappings
  • 索引器参数Indexer parameters
    • 分析模式Parsing Mode
    • 排除的文件扩展名Excluded File Name Extensions
    • 已编制索引的文件扩展名Indexed File Name Extensions
    • 仅为超大文档的存储元数据编制索引Index storage metadata only for oversized documents
    • 分隔的文本标题Delimited text headers
    • 分隔的文本分隔符Delimited text delimiter
    • 文档根Document Root
    • 图像操作(对图像提取方式的更改)Image Action (Changes to how images are extracted)

触发增量处理的更改Changes that trigger incremental processing

增量处理将评估技能集定义,确定要重新运行的技能,并选择性地更新文档树的受影响部分。Incremental processing evaluates your skillset definition and determines which skills to rerun, selectively updating the affected portions of the document tree. 下面是导致增量扩充的更改的完整列表:Here is the complete list of changes resulting in incremental enrichment:

  • 技能集中的技能具有不同的类型。Skill in the skillset has different type. 更新技能的 odata 类型The odata type of the skill is updated
  • 更新特定于技能的参数,例如 URL、默认值或其他参数Skill-specific parameters updated, for example the url, defaults or other parameters
  • 技能输出更改,技能返回其他输出或不同的输出Skill outputs changes, the skill returns additional or different outputs
  • 技能更新结果为不同的宗系,技能链发生更改,例如Skill updates resulting is different ancestry, skill chaining has changed i.e 技能输入skill inputs
  • 如果为任何上游技能提供输入的技能已更新,则此上游技能将会失效Any upstream skill invalidation, if a skill that provides an input to this skill is updated
  • 更新知识存储投影位置导致重新投影文档Updates to the knowledge store projection location, results in reprojecting documents
  • 更改知识存储投影导致重新投影文档Changes to the knowledge store projections, results in reprojecting documents
  • 更改索引器结果中的输出字段映射导致将文档重新投影到索引Output field mappings changed on an indexer results in reprojecting documents to the index

API 参考API reference

REST API 版本 2020-06-30-Preview 通过索引器中的附加属性提供增量扩充。REST API version 2020-06-30-Preview provides incremental enrichment through additional properties on indexers. 技能组和数据源可以使用正式版。Skillsets and data sources can use the generally available version. 除参考文档以外,另请参阅为增量扩充配置缓存来了解有关如何调用 API 的详细信息。In addition to the reference documentation, see Configure caching for incremental enrichment for details on how to call the APIs.

后续步骤Next steps

增量扩充是非常强大的功能,可将更改跟踪扩展到技能集和 AI 扩充。Incremental enrichment is a powerful feature that extends change tracking to skillsets and AI enrichment. 当你迭代技能集设计时,AIncremental 扩充可让你重复使用现有的已处理内容。AIncremental enrichment enables reuse of existing processed content as you iterate over skillset design.

接下来,请对现有的索引器启用缓存,或者在定义新索引器时添加缓存。As a next step, enable caching on an existing indexer or add a cache when defining a new indexer.