Azure 认知搜索中的 AI 扩充提示Tips for AI enrichment in Azure Cognitive Search

本文包含一系列提示和窍门,用于帮助你完成 Azure 认知搜索中的 AI 扩充功能的入门。This article contains a list of tips and tricks to keep you moving as you get started with AI enrichment capabilities in Azure Cognitive Search.

仔细阅读教程:了解如何调用 AI 扩充 API,以便练习如何将 AI 扩充应用到 Blob 数据源(如果尚未这样做)。If you have not done so already, step through the Tutorial: Learn how to call AI enrichment APIs for practice in applying AI enrichments to a blob data source.

提示 1:从小数据集开始Tip 1: Start with a small dataset

若要快速查找问题,最好的方式是提高解决问题的速度。The best way to find issues quickly is to increase the speed at which you can fix issues. 若要缩短索引时间,最好的方式是减少要索引的文档数。The best way to reduce the indexing time is by reducing the number of documents to be indexed.

一开始只使用少量文档/记录创建数据源。Start by creating a data source with just a handful of documents/records. 文档示例应该能够很好地代表要进行索引的文档的多样性。Your document sample should be a good representation of the variety of documents that will be indexed.

通过端到端管道运行文档示例,检查结果是否符合自己的需求。Run your document sample through the end-to-end pipeline and check that the results meet your needs. 对结果满意以后,即可向数据源添加更多文件。Once you are satisfied with the results, you can add more files to your data source.

提示 2:确保数据源凭据正确Tip 2: Make sure your data source credentials are correct

在定义一个使用数据源连接的索引器之前,数据源连接并未完成验证。The data source connection is not validated until you define an indexer that uses it. 如果看到任何错误提及索引器无法访问数据,请确保:If you see any errors mentioning that the indexer cannot get to the data, make sure that:

  • 连接字符串是正确的。Your connection string is correct. 确保使用 Azure 认知搜索预期的格式,尤其是在创建 SAS 令牌的时候。Specially when you are creating SAS tokens, make sure to use the format expected by Azure Cognitive Search. 若要了解受支持的不同格式,请参阅如何指定凭据部分See How to specify credentials section to learn about the different formats supported.
  • 索引器中的容器名称是正确的。Your container name in the indexer is correct.

提示 3:了解各种适当的措施,即使是在存在某些故障的情况下Tip 3: See what works even if there are some failures

有时候,一个小的故障就会让索引器无法运行。Sometimes a small failure stops an indexer in its tracks. 这没什么,只需按计划逐个解决问题即可。That is fine if you plan to fix issues one by one. 但是,可能需要忽略特定类型的错误,让索引器继续运行,这样就可以看到哪些流实际上是可以运行的。However, you might want to ignore a particular type of error, allowing the indexer to continue so that you can see what flows are actually working.

在这种情况下,可能需要告知索引器忽略错误。In that case, you may want to tell the indexer to ignore errors. 为此,请在索引器定义中将 maxFailedItemsmaxFailedItemsPerBatch 设置为 -1。Do that by setting maxFailedItems and maxFailedItemsPerBatch as -1 as part of the indexer definition.

  "// rest of your indexer definition

提示 4:深入了解扩充的文档Tip 4: Looking at enriched documents under the hood

扩充的文档是在扩充期间创建但在完成处理后会删除的临时结构。Enriched documents are temporary structures created during enrichment, and then deleted when processing is complete.

若要捕获索引编制期间创建的扩充文档的快照,请将名为 enriched 的字段添加到索引。To capture a snapshot of the enriched document created during indexing, add a field called enriched to your index. 索引器将作为该文档的所有扩充项的字符串表示形式自动转储到字段中。The indexer automatically dumps into the field a string representation of all the enrichments for that document.

enriched 字段将包含一个字符串,该字符串是内存中扩充文档的 JSON 逻辑表示形式,The enriched field will contain a string that is a logical representation of the in-memory enriched document in JSON. 但字段值是有效的 JSON 文档。The field value is a valid JSON document, however. 引号经过转义,因此,需将 \" 替换为 " 才能查看 JSON 格式的文档。Quotes are escaped so you'll need to replace \" with " in order to view the document as formatted JSON.

扩充的字段用于调试目的,它只能帮助你了解求值表达式所针对的内容的逻辑形状。The enriched field is intended for debugging purposes only, to help you understand the logical shape of the content that expressions are being evaluated against. 不应依赖此字段来进行索引编制。You should not depend on this field for indexing purposes.

添加一个 enriched 字段作为索引定义的一部分,以便进行调试:Add an enriched field as part of your index definition for debugging purposes:

请求正文语法Request Body Syntax

  "fields": [
    // other fields go here.
      "name": "enriched",
      "type": "Edm.String",
      "searchable": false,
      "sortable": false,
      "filterable": false,
      "facetable": false

提示 5:预期的内容没有出现Tip 5: Expected content fails to appear

缺少内容可能是由于文档在索引编制过程中被丢弃。Missing content could be the result of documents getting dropped during indexing. 免费层和基本层对文档大小的限制都很低。Free and Basic tiers have low limits on document size. 如果文件超出此限制,则会在索引编制过程中将其丢弃。Any file exceeding the limit is dropped during indexing. 可以在 Azure 门户中查找丢弃的文档。You can check for dropped documents in the Azure portal. 在搜索服务仪表板中,双击“索引器”磁贴。In the search service dashboard, double-click the Indexers tile. 查看成功地进行索引的文档的比率。Review the ratio of successful documents indexed. 如果不是 100%,可以单击该比率以获取更多详细信息。If it is not 100%, you can click the ratio to get more detail.

如果问题与文件大小相关,则可能会看到这样的错误:“Blob <file-name> 的大小为 <file-size> 字节,这超出了当前服务层级的文档提取的最大大小。”If the problem is related to file size, you might see an error like this: "The blob <file-name>" has the size of <file-size> bytes, which exceeds the maximum size for document extraction for your current service tier." 有关索引器限制的详细信息,请参阅服务限制For more information on indexer limits, see Service limits.

内容没有出现的另一原因可能与输入/输出映射错误相关。A second reason for content failing to appear might be related input/output mapping errors. 例如,输出目标名称为“People”,但索引字段名称为“people”(小写)。For example, an output target name is "People" but the index field name is lower-case "people". 系统可能会针对整个管道返回“201 成功”消息,因此你认为索引编制成功,但实际上有一个字段是空的。The system could return 201 success messages for the entire pipeline so you think indexing succeeded, when in fact a field is empty.

提示 6:延长处理时间至超出最长运行时间(24 小时)Tip 6: Extend processing beyond maximum run time (24-hour window)

即使情况很简单,图像分析也是计算密集型操作,因此当图像特别大或特别复杂时,处理时间可能会超过允许的最长时间。Image analysis is computationally-intensive for even simple cases, so when images are especially large or complex, processing times can exceed the maximum time allowed.

最长运行时间因层而异:免费层为数分钟,收费层为 24 小时(索引编制)。Maximum run time varies by tier: several minutes on the Free tier, 24-hour indexing on billable tiers. 进行按需处理时,如果处理无法在 24 小时期限内完成,则可改用计划形式,让索引器在计划时间接着上次的工作继续处理。If processing fails to complete within a 24-hour period for on-demand processing, switch to a schedule to have the indexer pick up processing where it left off.

对于计划的索引器来说,索引编制会按计划从已知正常的最后一个文档继续开始。For scheduled indexers, indexing resumes on schedule at the last known good document. 使用定时计划时,索引器可以在计划的一系列时间或日期进行积压图像的处理,直至所有未处理的图像得到处理。By using a recurring schedule, the indexer can work its way through the image backlog over a series of hours or days, until all un-processed images are processed. 有关计划语法的详细信息,请参阅步骤 3:创建索引器或参阅如何为 Azure 认知搜索计划索引器For more information on schedule syntax, see Step 3: Create-an-indexer or see How to schedule indexers for Azure Cognitive Search.


如果将索引器设置为某个计划,但每次运行时一次又一次地在同一文档上反复失败,则索引器将以不那么频繁的间隔开始运行(最多每 24 小时至少一次),直到它成功地再次取得进展。If an indexer is set to a certain schedule but repeatedly fails on the same document over and over again each time it runs, the indexer will begin running on a less frequent interval (up to the maximum of at least once every 24 hours) until it successfully makes progress again. 如果你认为你已修复了导致索引器在某一点停滞的任何问题,则可以按需运行索引器,如果成功取得进展,索引器将再次回到其设置的计划间隔。If you believe you have fixed whatever the issue that was causing the indexer to be stuck at a certain point, you can perform an on demand run of the indexer, and if that successfully makes progress, the indexer will return to its set schedule interval again.

进行基于门户的索引编制(如快速入门中所述)时,选择“运行一次”索引器选项即可将处理时间限制为 1 小时 ("maxRunTime": "PT1H")。For portal-based indexing (as described in the quickstart), choosing the "run once" indexer option limits processing to 1 hour ("maxRunTime": "PT1H"). 可能需要延长处理时间至更长。You might want to extend the processing window to something longer.

提示 7:提高索引编制吞吐量Tip 7: Increase indexing throughput

进行并行索引编制时,请将数据置于多个容器中,或者置于同一容器的多个虚拟文件夹中,For parallel indexing, place your data into multiple containers or multiple virtual folders inside the same container. 然后创建多个数据源和索引器对。Then create multiple datasource and indexer pairs. 所有索引器可以使用同一技术集并写入同一目标搜索索引,因此你的搜索应用不需了解这种分区。All indexers can use the same skillset and write into the same target search index, so your search app doesn’t need to be aware of this partitioning. 有关详细信息,请参阅为大型数据集编制索引For more information, see Indexing Large Datasets.

另请参阅See also