排查 Azure 认知搜索中的常见索引器问题Troubleshooting common indexer issues in Azure Cognitive Search

在将数据索引到 Azure 认知搜索中时,索引器可能会遇到许多问题。Indexers can run into a number of issues when indexing data into Azure Cognitive Search. 故障的主要类别包括:The main categories of failure include:

连接错误Connection errors


索引器对访问 Azure 网络安全机制保护的数据源和其他资源提供有限的支持。Indexers have limited support for accessing data sources and other resources that are secured by Azure network security mechanisms. 目前,索引器只能通过相应的 IP 地址范围限制机制或 NSG 规则(如果适用)访问数据源。Currently, indexers can only access data sources via corresponding IP address range restriction mechanisms or NSG rules when applicable. 在下面可以找到有关访问每个受支持数据源的详细信息。Details for accessing each supported data source can be found below.

可以通过 ping 搜索服务的完全限定的域名(例如 <your-search-service-name>.search.azure.cn)来查找其 IP 地址。You can find out the IP address of your search service by pinging its fully qualified domain name (eg., <your-search-service-name>.search.azure.cn).

可以使用可下载的 JSON 文件或通过服务标记发现 API 找到 AzureCognitiveSearch 服务标记的 IP 地址范围。You can find out the IP address range of AzureCognitiveSearch service tag by either using Downloadable JSON files or via the Service Tag Discovery API. IP 地址范围每周更新一次。The IP address range is updated weekly.

配置防火墙规则Configure firewall rules

Azure 存储、CosmosDB 和 Azure SQL 提供可配置的防火墙。Azure Storage, CosmosDB and Azure SQL provide a configurable firewall. 防火墙启用后,没有具体的错误消息。There's no specific error message when the firewall is enabled. 通常,防火墙错误是泛性的,类似于 The remote server returned an error: (403) ForbiddenCredentials provided in the connection string are invalid or have expiredTypically, firewall errors are generic and look like The remote server returned an error: (403) Forbidden or Credentials provided in the connection string are invalid or have expired.

有 2 个选项可让索引器访问此类实例中的这些资源:There are 2 options for allowing indexers to access these resources in such an instance:

  • 通过允许从所有网络进行访问(如果可行)来禁用防火墙。Disable the firewall, by allowing access from All Networks (if feasible).
  • 或者,可以允许搜索服务的 IP 地址以及资源防火墙规则中 AzureCognitiveSearch 服务标记的 IP 地址范围进行访问(IP 地址范围限制)。Alternatively, you can allow access for the IP address of your search service and the IP address range of AzureCognitiveSearch service tag in the firewall rules of your resource (IP address range restriction).

在以下链接中可以找到有关对每种数据源类型配置 IP 地址范围限制的详细信息:Details for configuring IP address range restrictions for each data source type can be found from the following links:

限制:如 Azure 存储的以上文档中所述,仅当搜索服务和存储帐户位于不同的区域时,IP 地址范围限制才起作用。Limitation: As stated in the documentation above for Azure Storage, IP address range restrictions will only work if your search service and your storage account are in different regions.

Azure Functions(可用作自定义 Web API 技能)也支持 IP 地址限制Azure functions (that could be used as a Custom Web Api skill) also support IP address restrictions. 要配置的 IP 地址列表是搜索服务的 IP 地址,以及 AzureCognitiveSearch 服务标记的 IP 地址范围。The list of IP addresses to configure would be the IP address of your search service and the IP address range of AzureCognitiveSearch service tag.

此文提供了有关访问 Azure VM 上 SQL 服务器中的数据的详细信息Details for accessing data in SQL server on an Azure VM are outlined here

配置网络安全组 (NSG) 规则Configure network security group (NSG) rules

访问 SQL 托管实例中的数据或者将 Azure VM 用作自定义 Web API 技能的 Web 服务 URI 时,客户无需考虑特定的 IP 地址。When accessing data in a SQL managed instance, or when an Azure VM is used as the web service URI for a Custom Web Api skill, customers need not be concerned with specific IP addresses.

在这种情况下,可将 Azure VM 或 SQL 托管实例配置为驻留在虚拟网络中。In such cases, the Azure VM, or the SQL managed instance can be configured to reside within a virtual network. 然后可以配置一个网络安全组,来筛选可流入和流出虚拟网络子网与网络接口的网络流量类型。Then a network security group can be configured to filter the type of network traffic that can flow in and out of the virtual network subnets and network interfaces.

可以在入站 NSG 规则中直接使用 AzureCognitiveSearch 服务标记,而无需查找其 IP 地址范围。The AzureCognitiveSearch service tag can be directly used in the inbound NSG rules without needing to look up its IP address range.

此文提供了有关访问 SQL 托管实例中的数据的更多详细信息More details for accessing data in a SQL managed instance are outlined here

未启用 CosmosDB“索引编制”CosmosDB "Indexing" isn't enabled

Azure 认知搜索对 Cosmos DB 索引存在隐式依赖。Azure Cognitive Search has an implicit dependency on Cosmos DB indexing. 如果在 Cosmos DB 中关闭自动索引,Azure 认知搜索会返回成功状态,但无法索引容器内容。If you turn off automatic indexing in Cosmos DB, Azure Cognitive Search returns a successful state, but fails to index container contents. 有关如何查看设置和启用索引功能的说明,请参阅管理 Azure Cosmos DB 中的索引编制For instructions on how to check settings and turn on indexing, see Manage indexing in Azure Cosmos DB.

文档处理错误Document processing errors

文档无法处理或不受支持Unprocessable or unsupported documents

显式支持可记录格式的 Blob 索引器文档The blob indexer documents which document formats are explicitly supported.. 有时候,Blob 存储容器包含不受支持的文档。Sometimes, a blob storage container contains unsupported documents. 而另一些时候,可能存在有问题的文档。Other times there may be problematic documents. 可以通过更改配置选项来避免停止这些文档上的索引器:You can avoid stopping your indexer on these documents by changing configuration options:

PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
Content-Type: application/json
api-key: [admin key]

  ... other parts of indexer definition
  "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false, "failOnUnprocessableDocument" : false } }

缺少文档内容Missing document content

Blob 索引器可查找并提取容器中 Blob 的文本The blob indexer finds and extracts text from blobs in a container. 提取文本时出现的一些问题包括:Some problems with extracting text include:

  • 文档仅包含扫描的图像。The document only contains scanned images. 包含扫描图像 (JPG) 之类的非文本内容的 PDF Blob 不会在标准 Blob 索引管道中生成结果。PDF blobs that have non-text content, such as scanned images (JPGs), don't produce results in a standard blob indexing pipeline. 如果图像内容包含文本元素,则可通过认知搜索来查找并提取文本。If you have image content with text elements, you can use cognitive search to find and extract the text.
  • Blob 索引器配置为仅索引元数据。The blob indexer is configured to only index metadata. 若要提取内容,必须将 Blob 索引器配置为提取内容和元数据To extract content, the blob indexer must be configured to extract both content and metadata:
PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
Content-Type: application/json
api-key: [admin key]

  ... other parts of indexer definition
  "parameters" : { "configuration" : { "dataToExtract" : "contentAndMetadata" } }

索引错误Index errors

缺少文档Missing documents

索引器从数据源查找文档。Indexers find documents from a data source. 有时候,索引中似乎会缺失数据源中本应进行索引的文档。Sometimes a document from the data source that should have been indexed appears to be missing from an index. 之所以发生这些错误,有多种常见原因:There are a couple of common reasons these errors may happen:

  • 文档尚未进行索引。The document hasn't been indexed. 查看门户中是否有成功的索引器运行。Check the portal for a successful indexer run.
  • 检查更改跟踪值。Check your change tracking value. 如果高水印值是设置为将来时间的日期,则索引器将跳过任何日期小于此日期的文档。If your high watermark value is a date set to a future time, then any documents that have a date less than this will be skipped by the indexer. 可以使用索引器状态中的“initialTrackingState”和“finalTrackingState”字段来了解索引器的更改跟踪状态。You can understand your indexer's change tracking state using the 'initialTrackingState' and 'finalTrackingState' fields in the indexer status.
  • 文档在索引器运行之后已更新。The document was updated after the indexer run. 如果索引器已在计划之中,它最终会重新运行并选取该文档。If your indexer is on a schedule, it will eventually rerun and pick up the document.
  • 在数据源中指定的 query 排除了该文档。The query specified in the data source excludes the document. 索引器不能索引不属于数据源的文档。Indexers can't index documents that aren't part of the data source.
  • 字段映射AI 扩充已更改此文档,因此它看起来不同于预期。Field mappings or AI enrichment have changed the document and it looks different than you expect.
  • 使用查找文档 API 来查找文档。Use the lookup document API to find your document.