如何在 Azure 认知搜索中配置 Blob 索引器How to configure a blob indexer in Azure Cognitive Search

本文说明如何使用 Azure 认知搜索为存储在 Azure Blob 存储中基于文本的文档(例如 PDF 文档、Microsoft Office 文档和其他多种常用格式的文档)编制索引。This article shows you how to use Azure Cognitive Search to index text-based documents (such as PDFs, Microsoft Office documents, and several other common formats) stored in Azure Blob storage. 首先,本文说明了设置和配置 Blob 索引器的基础知识。First, it explains the basics of setting up and configuring a blob indexer. 其次,本文更加深入地探讨了你可能会遇到的行为和场景。Then, it offers a deeper exploration of behaviors and scenarios you are likely to encounter.

支持的格式Supported formats

Blob 索引器可从以下文档格式提取文本:The blob indexer can extract text from the following document formats:

设置 Blob 索引编制Set up blob indexing

可以使用以下方式设置 Azure Blob 存储索引器:You can set up an Azure Blob Storage indexer using:

备注

某些功能(例如字段映射)在门户中尚不可用,必须以编程方式使用。Some features (for example, field mappings) are not yet available in the portal, and have to be used programmatically.

在这里,我们使用 REST API 演示流。Here, we demonstrate the flow using the REST API.

步骤 1:创建数据源Step 1: Create a data source

数据源指定要编制索引的数据、访问数据所需的凭据和有效标识数据更改(新行、修改的行或删除的行)的策略。A data source specifies which data to index, credentials needed to access the data, and policies to efficiently identify changes in the data (new, modified, or deleted rows). 一个数据源可供同一搜索服务中的多个索引器使用。A data source can be used by multiple indexers in the same search service.

若要为 Blob 编制索引,数据源必须具有以下属性:For blob indexing, the data source must have the following required properties:

  • name 是搜索服务中数据源的唯一名称。name is the unique name of the data source within your search service.
  • type 必须是 azureblobtype must be azureblob.
  • **credentials 以 credentials.connectionString 参数的形式提供存储帐户连接字符串。**credentials provide the storage account connection string as the credentials.connectionString parameter. 有关详细信息请参阅下方的如何指定凭据See How to specify credentials below for details.
  • container 指定存储帐户中的容器。container specifies a container in your storage account. 默认情况下,容器中的所有 Blob 都可检索。By default, all blobs within the container are retrievable. 如果只想为特定虚拟目录中的 Blob 编制索引,可以使用可选的 query 参数指定该目录。If you only want to index blobs in a particular virtual directory, you can specify that directory using the optional query parameter.

创建数据源:To create a data source:

    POST https://[service name].search.azure.cn/datasources?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
        "container" : { "name" : "my-container", "query" : "<optional-virtual-directory-name>" }
    }

有关创建数据源 API 的详细信息,请参阅创建数据源For more on the Create Datasource API, see Create Datasource.

如何指定凭据How to specify credentials

可通过以下一种方式提供 blob 容器的凭据:You can provide the credentials for the blob container in one of these ways:

  • 托管标识连接字符串ResourceId=/subscriptions/<your subscription ID>/resourceGroups/<your resource group name>/providers/Microsoft.Storage/storageAccounts/<your storage account name>/;Managed identity connection string: ResourceId=/subscriptions/<your subscription ID>/resourceGroups/<your resource group name>/providers/Microsoft.Storage/storageAccounts/<your storage account name>/;

  • 完全访问存储帐户连接字符串DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key>Full access storage account connection string: DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key>

    可通过导航到“存储帐户”边栏选项卡 >“设置”>“密钥”(对于经典存储帐户)或“设置”>“访问密钥”(对于 Azure 资源管理器存储帐户),从 Azure 门户获取连接字符串。You can get the connection string from the Azure portal by navigating to the storage account blade > Settings > Keys (for Classic storage accounts) or Settings > Access keys (for Azure Resource Manager storage accounts).

  • 存储帐户共享访问签名 (SAS) 连接字符串:BlobEndpoint=https://<your account>.blob.chinacloudsites.cn/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=b&sp=rlStorage account shared access signature (SAS) connection string: BlobEndpoint=https://<your account>.blob.chinacloudsites.cn/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=b&sp=rl

    SAS 应具有容器和对象(本例中为 blob)的列表和读取权限。The SAS should have the list and read permissions on containers and objects (blobs in this case).

  • 容器共享访问签名ContainerSharedAccessUri=https://<your storage account>.blob.chinacloudsites.cn/<container name>?sv=2016-05-31&sr=c&sig=<the signature>&se=<the validity end time>&sp=rlContainer shared access signature: ContainerSharedAccessUri=https://<your storage account>.blob.chinacloudsites.cn/<container name>?sv=2016-05-31&sr=c&sig=<the signature>&se=<the validity end time>&sp=rl

SAS 应具有容器的列表和读取权限。The SAS should have the list and read permissions on the container.

有关存储共享访问签名的详细信息,请参阅使用共享访问签名For more information on storage shared access signatures, see Using Shared Access Signatures.

备注

如果使用 SAS 凭据,则需使用续订的签名定期更新数据源凭据,以防止其过期。If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. 如果 SAS 凭据过期,索引器会失败,出现类似于 Credentials provided in the connection string are invalid or have expired. 的错误消息。If SAS credentials expire, the indexer will fail with an error message similar to Credentials provided in the connection string are invalid or have expired..

步骤 2:创建索引Step 2: Create an index

索引指定文档、属性和其他构造中可以塑造搜索体验的字段。The index specifies the fields in a document, attributes, and other constructs that shape the search experience.

下面介绍了如何使用可搜索 content 字段创建索引,以存储从 Blob 中提取的文本:Here's how to create an index with a searchable content field to store the text extracted from blobs:

    POST https://[service name].search.azure.cn/indexes?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
          "name" : "my-target-index",
          "fields": [
            { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
            { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
          ]
    }

有关详细信息,请参阅创建索引 (REST API)For more information, see Create Index (REST API)

步骤 3:创建索引器Step 3: Create an indexer

索引器将数据源与目标搜索索引关联,并提供自动执行数据刷新的计划。An indexer connects a data source with a target search index, and provides a schedule to automate the data refresh.

创建索引和数据源后,就可以准备创建索引器了:Once the index and data source have been created, you're ready to create the indexer:

    POST https://[service name].search.azure.cn/indexers?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      "name" : "blob-indexer",
      "dataSourceName" : "blob-datasource",
      "targetIndexName" : "my-target-index",
      "schedule" : { "interval" : "PT2H" }
    }

此索引器每隔两小时运行一次(已将计划间隔设置为“PT2H”)。This indexer will run every two hours (schedule interval is set to "PT2H"). 若要每隔 30 分钟运行一次索引器,可将间隔设置为“PT30M”。To run an indexer every 30 minutes, set the interval to "PT30M". 支持的最短间隔为 5 分钟。The shortest supported interval is 5 minutes. 计划是可选的 - 如果省略,则索引器在创建后只运行一次。The schedule is optional - if omitted, an indexer runs only once when it's created. 但是,可以随时根据需要运行索引器。However, you can run an indexer on-demand at any time.

有关详细信息,请参阅创建索引器 (REST API)For more information, see Create Indexer (REST API). 若要详细了解如何定义索引器计划,请参阅如何为 Azure 认知搜索计划索引器For more information about defining indexer schedules see How to schedule indexers for Azure Cognitive Search.

blob 的索引编制方式How blobs are indexed

根据具体的索引器配置,Blob 索引器可以仅为存储元数据编制索引(如果只关注元数据,而无需为 Blob 的内容编制索引,则此功能非常有用)、为存储元数据和内容元数据编制索引,或者同时为元数据和文本内容编制索引。Depending on the indexer configuration, the blob indexer can index storage metadata only (useful when you only care about the metadata and don't need to index the content of blobs), storage and content metadata, or both metadata and textual content. 默认情况下,索引器提取元数据和内容。By default, the indexer extracts both metadata and content.

备注

默认情况下,包含结构化内容(例如 JSON 或 CSV)的 lob 以单一文本区块的形式编制索引。By default, blobs with structured content such as JSON or CSV are indexed as a single chunk of text. 如果想要以结构化方法为 JSON 和 CSV Blob 编制索引,请参阅为 JSON Blob 编制索引为 CSV Blob 编制索引来了解详细信息。If you want to index JSON and CSV blobs in a structured way, see Indexing JSON blobs and Indexing CSV blobs for more information.

复合或嵌入式文档(例如 ZIP 存档、嵌入了带附件 Outlook 电子邮件的 Word 文档或带附件的 .MSG 文件)也以单一文档的形式编制索引。A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or a .MSG file with attachments) is also indexed as a single document. 例如,从 .MSG 文件的附件中提取的所有图像将在 normalized_images 字段中返回。For example, all images extracted from the attachments of an .MSG file will be returned in the normalized_images field.

  • 文档的文本内容将提取到名为 content 的字符串字段中。The textual content of the document is extracted into a string field named content.

    备注

    Azure 认知搜索会根据定价层限制提取的文本数量:免费层为 32,000 个字符,基本层为 64,000 个字符,标准层为 400 万个字符、标准 S2 层为 800 万个字符,标准 S3 层为 1600 万个字符。Azure Cognitive Search limits how much text it extracts depending on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, 4 million for Standard, 8 million for Standard S2, and 16 million for Standard S3. 已截断的文本会在索引器状态响应中出现一条警告。A warning is included in the indexer status response for truncated documents.

  • Blob 中用户指定的元数据属性(如果有)将逐字提取。User-specified metadata properties present on the blob, if any, are extracted verbatim. 请注意,这要求在索引中定义与 blob 的元数据密钥名称相同的字段。Note that this requires a field to be defined in the index with the same name as the metadata key of the blob. 例如,如果 blob 有值为 High 的元数据密钥 Sensitivity,则应在搜索索引中定义一个名为“Sensitivity”的字段,该字段将用值“High”填充。For example, if your blob has a metadata key of Sensitivity with value High, you should define a field named Sensitivity in your search index and it will be populated with the value High.

  • 标准 Blob 元数据属性将提取到以下字段中:Standard blob metadata properties are extracted into the following fields:

    • metadata_storage_name (Edm.String) - Blob 的文件名。metadata_storage_name (Edm.String) - the file name of the blob. 例如,对于 Blob /my-container/my-folder/subfolder/resume.pdf 而言,此字段的值是 resume.pdfFor example, if you have a blob /my-container/my-folder/subfolder/resume.pdf, the value of this field is resume.pdf.

    • metadata_storage_path (Edm.String) - Blob 的完整 URI(包括存储帐户)。metadata_storage_path (Edm.String) - the full URI of the blob, including the storage account. 例如: https://myaccount.blob.chinacloudsites.cn/my-container/my-folder/subfolder/resume.pdfFor example, https://myaccount.blob.chinacloudsites.cn/my-container/my-folder/subfolder/resume.pdf

    • metadata_storage_content_type (Edm.String) - 用于上传 Blob 的代码指定的内容类型。metadata_storage_content_type (Edm.String) - content type as specified by the code you used to upload the blob. 例如,application/octet-streamFor example, application/octet-stream.

    • metadata_storage_last_modified (Edm.DateTimeOffset) - 上次修改 Blob 的时间戳。metadata_storage_last_modified (Edm.DateTimeOffset) - last modified timestamp for the blob. Azure 认知搜索使用此时间戳来识别已更改的 Blob,避免在初次编制索引之后再次为所有内容编制索引。Azure Cognitive Search uses this timestamp to identify changed blobs, to avoid reindexing everything after the initial indexing.

    • metadata_storage_size (Edm.Int64) - Blob 大小,以字节为单位。metadata_storage_size (Edm.Int64) - blob size in bytes.

    • metadata_storage_content_md5 (Edm.String) - Blob 内容的 MD5 哈希(如果有)。metadata_storage_content_md5 (Edm.String) - MD5 hash of the blob content, if available.

    • metadata_storage_sas_token (Edm.String) - 一个临时 SAS 令牌,可由 自定义技能用来获取对 Blob 的访问权限。metadata_storage_sas_token (Edm.String) - A temporary SAS token that can be used by custom skills to get access to the blob. 不应存储此令牌供以后使用,因为它可能会过期。This token should not be stored for later use as it might expire.

  • 特定于每种文档格式的元数据属性将提取到此处所列的字段。Metadata properties specific to each document format are extracted into the fields listed here.

无需在搜索索引中针对上述所有属性定义字段 - 系统只捕获应用程序所需的属性。You don't need to define fields for all of the above properties in your search index - just capture the properties you need for your application.

备注

通常,现有索引中的字段名称与文档提取期间所生成的字段名称不同。Often, the field names in your existing index will be different from the field names generated during document extraction. 可以使用 字段映射 将 Azure 认知搜索提供的属性名称映射到搜索索引中的字段名称。You can use field mappings to map the property names provided by Azure Cognitive Search to the field names in your search index. 后面会提供字段映射的用法示例。You will see an example of field mappings use below.

定义文档键和字段映射Defining document keys and field mappings

在 Azure 认知搜索中,文档键唯一标识某个文档。In Azure Cognitive Search, the document key uniquely identifies a document. 每个搜索索引只能有一个类型为 Edm.String 的键字段。Every search index must have exactly one key field of type Edm.String. 键字段对于要添加到索引的每个文档必不可少(事实上它是唯一的必填字段)。The key field is required for each document that is being added to the index (it is actually the only required field).

应该仔细考虑要将提取的哪个字段映射到索引的键字段。You should carefully consider which extracted field should map to the key field for your index. 候选字段包括:The candidates are:

  • metadata_storage_name - 这可能是一个便利的候选项,但是请注意,1) 名称可能不唯一,因为不同的文件夹中可能有同名的 Blob;2) 名称中包含的字符可能在文档键中无效,例如短划线。metadata_storage_name - this might be a convenient candidate, but note that 1) the names might not be unique, as you may have blobs with the same name in different folders, and 2) the name may contain characters that are invalid in document keys, such as dashes. 可以使用 base64Encode 字段映射函数处理无效的字符 - 如果这样做,请记得在将这些字符传入 Lookup 等 API 调用时为文档键编码。You can deal with invalid characters by using the base64Encode field mapping function - if you do this, remember to encode document keys when passing them in API calls such as Lookup. (例如,在 .NET 中,可以使用 UrlTokenEncode 方法实现此目的)。(For example, in .NET you can use the UrlTokenEncode method for that purpose).

  • metadata_storage_path - 使用完整路径可确保唯一性,但是路径必定会包含 / 字符,而该字符 在文档键中无效metadata_storage_path - using the full path ensures uniqueness, but the path definitely contains / characters that are invalid in a document key. 如上所述,可以选择使用 base64Encode 函数为键编码。As above, you have the option of encoding the keys using the base64Encode function.

  • 第三种选择是将自定义元数据属性添加到 Blob。A third option is to add a custom metadata property to the blobs. 但是,这种做法需要 Blob 上传过程将该元数据属性添加到所有 Blob。This option does, however, require that your blob upload process adds that metadata property to all blobs. 由于键是必需的属性,因此没有该属性的所有 Blob 都无法编制索引。Since the key is a required property, all blobs that don't have that property will fail to be indexed.

重要

如果索引中没有键字段的显式映射,Azure 认知搜索会自动使用 metadata_storage_path 作为键并对键值进行 base-64 编码(上述第二个选项)。If there is no explicit mapping for the key field in the index, Azure Cognitive Search automatically uses metadata_storage_path as the key and base-64 encodes key values (the second option above).

本示例选择 metadata_storage_name 字段作为文档键。For this example, let's pick the metadata_storage_name field as the document key. 同时,假设索引具有名为 key 的键字段,以及一个用于存储文档大小的 fileSize 字段。Let's also assume your index has a key field named key and a field fileSize for storing the document size. 若要连接所需的元素,请在创建或更新索引器时指定以下字段映射:To wire things up as desired, specify the following field mappings when creating or updating your indexer:

    "fieldMappings" : [
      { "sourceFieldName" : "metadata_storage_name", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
      { "sourceFieldName" : "metadata_storage_size", "targetFieldName" : "fileSize" }
    ]

要将所有元素合并在一起,可按如下所示添加字段映射,并为现有索引器的键启用 base-64 编码:To bring this all together, here's how you can add field mappings and enable base-64 encoding of keys for an existing indexer:

    PUT https://[service name].search.azure.cn/indexers/blob-indexer?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      "dataSourceName" : " blob-datasource ",
      "targetIndexName" : "my-target-index",
      "schedule" : { "interval" : "PT2H" },
      "fieldMappings" : [
        { "sourceFieldName" : "metadata_storage_name", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
        { "sourceFieldName" : "metadata_storage_size", "targetFieldName" : "fileSize" }
      ]
    }

有关详细信息,请参阅字段映射和转换For more information, see Field mappings and transformations.

如果需要对字段进行编码以将其用作键,但也希望搜索它,该怎么办?What if you need to encode a field to use it as a key, but you also want to search it?

有时,需要使用一个字段的编码版本(如 metadata_storage_path)作为键,但也需要该字段是可搜索的(无需编码)。There are times when you need to use an encoded version of a field like metadata_storage_path as the key, but you also need that field to be searchable (without encoding). 若要解决此问题,可以将其映射到两个字段中:一个用于键,另一个用于搜索目的。In order to resolve this issue, you can map it into two fields; one that will be used for the key, and another one that will be used for search purposes. 在下面的示例中,“键”字段包含编码的路径,而“路径”字段未编码且将用作索引中的可搜索字段 。In the example below the key field contains the encoded path, while the path field is not encoded and will be used as the searchable field in the index.

    PUT https://[service name].search.azure.cn/indexers/blob-indexer?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      "dataSourceName" : " blob-datasource ",
      "targetIndexName" : "my-target-index",
      "schedule" : { "interval" : "PT2H" },
      "fieldMappings" : [
        { "sourceFieldName" : "metadata_storage_path", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
        { "sourceFieldName" : "metadata_storage_path", "targetFieldName" : "path" }
      ]
    }

按文件类型编制索引Index by file type

可以控制要为哪些 Blob 编制索引,以及要跳过哪些 Blob。You can control which blobs are indexed, and which are skipped.

包括具有特定文件扩展名的 BlobInclude blobs having specific file extensions

使用 indexedFileNameExtensions 索引器配置参数可以做到只为具有指定扩展名的 Blob 编制索引。You can index only the blobs with the file name extensions you specify by using the indexedFileNameExtensions indexer configuration parameter. 值是包含文件扩展名(包括前置句点)逗号分隔列表的字符串。The value is a string containing a comma-separated list of file extensions (with a leading dot). 例如,如果只要为 .PDF 和 .DOCX Blob 编制索引,请执行以下操作:For example, to index only the .PDF and .DOCX blobs, do this:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "indexedFileNameExtensions" : ".pdf,.docx" } }
    }

排除具有特定文件扩展名的 BlobExclude blobs having specific file extensions

使用 excludedFileNameExtensions 配置参数可在编制索引时排除具有特定文件扩展名的 Blob。You can exclude blobs with specific file name extensions from indexing by using the excludedFileNameExtensions configuration parameter. 值是包含文件扩展名(包括前置句点)逗号分隔列表的字符串。The value is a string containing a comma-separated list of file extensions (with a leading dot). 例如,若要为所有 Blob 编制索引,但要排除具有 .PNG 和 .JPEG 扩展名的 Blob,请执行以下操作:For example, to index all blobs except those with the .PNG and .JPEG extensions, do this:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "excludedFileNameExtensions" : ".png,.jpeg" } }
    }

如果同时存在 indexedFileNameExtensionsexcludedFileNameExtensions 参数,Azure 认知搜索会先查看 indexedFileNameExtensions,再查看 excludedFileNameExtensionsIf both indexedFileNameExtensions and excludedFileNameExtensions parameters are present, Azure Cognitive Search first looks at indexedFileNameExtensions, then at excludedFileNameExtensions. 这意味着,如果两个列表中存在同一个文件扩展名,将从索引编制中排除该扩展名。This means that if the same file extension is present in both lists, it will be excluded from indexing.

为 blob 的部件编制索引Index parts of a blob

可以使用 dataToExtract 配置参数控制要为 Blob 中的哪些部分编制索引。You can control which parts of the blobs are indexed using the dataToExtract configuration parameter. 该参数采用以下值:It can take the following values:

例如,如果只要为存储元数据编制索引,请使用:For example, to index only the storage metadata, use:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "dataToExtract" : "storageMetadata" } }
    }

使用 Blob 元数据来控制如何为 Blob 编制索引Using blob metadata to control how blobs are indexed

上述配置参数适用于所有 Blob。The configuration parameters described above apply to all blobs. 有时,你可能想要控制为 个体 Blob 编制索引的方式。Sometimes, you may want to control how individual blobs are indexed. 为此,可以添加以下 Blob 元数据属性和值:You can do this by adding the following blob metadata properties and values:

属性名称Property name 属性值Property value 说明Explanation
AzureSearch_SkipAzureSearch_Skip "true""true" 指示 Blob 索引器完全跳过该 Blob,Instructs the blob indexer to completely skip the blob. 既不尝试提取元数据,也不提取内容。Neither metadata nor content extraction is attempted. 如果特定的 Blob 反复失败并且中断编制索引过程,此属性非常有用。This is useful when a particular blob fails repeatedly and interrupts the indexing process.
AzureSearch_SkipContentAzureSearch_SkipContent "true""true" 此属性等效于上面所述的与特定 Blob 相关的 "dataToExtract" : "allMetadata" 设置。This is equivalent of "dataToExtract" : "allMetadata" setting described above scoped to a particular blob.

从多个源编制索引Index from multiple sources

你可能希望从索引中的多个源“组装”文档。You may want to "assemble" documents from multiple sources in your index. 例如,你可能希望将 blob 中的文本与 Cosmos DB 中存储的其他元数据进行合并。For example, you may want to merge text from blobs with other metadata stored in Cosmos DB. 甚至可以将推送索引 API 与各种索引器一起使用来基于多个部件搭建搜索文档。You can even use the push indexing API together with various indexers to build up search documents from multiple parts.

若要使此方式可行,所有索引器和其他组件需要针对文档键达成一致。For this to work, all indexers and other components need to agree on the document key. 有关此主题的其他详细信息,请参阅为多个 Azure 数据源编制索引或此博客文章:在 Azure 认知搜索中将文档与其他数据组合在一起For additional details on this topic, refer to Index multiple Azure data sources or this blog post, Combine documents with other data in Azure Cognitive Search.

为大型数据集编制索引Index large datasets

Blob 编制索引可能是一个耗时的过程。Indexing blobs can be a time-consuming process. 如果有几百万个 Blob 需要编制索引,可以将数据分区,并使用多个索引器来并行处理数据,从而加快索引编制的速度。In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to process the data in parallel. 设置方法如下:Here's how you can set this up:

  • 将数据分区到多个 Blob 容器或虚拟文件夹Partition your data into multiple blob containers or virtual folders

  • 设置多个 Azure 认知搜索数据源,为每个容器或文件夹各设置一个。Set up several Azure Cognitive Search data sources, one per container or folder. 若要指向某个 Blob 文件夹,请使用 query 参数:To point to a blob folder, use the query parameter:

    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "<your storage connection string>" },
        "container" : { "name" : "my-container", "query" : "my-folder" }
    }
    
  • 为每个数据源创建相应的索引器。Create a corresponding indexer for each data source. 所有索引器可以指向同一目标搜索索引。All the indexers can point to the same target search index.

  • 服务中的每个搜索单位在任何给定的时间都只能运行一个索引器。One search unit in your service can run one indexer at any given time. 只有当索引器实际上并行运行时,如上所述创建多个索引器才很有用。Creating multiple indexers as described above is only useful if they actually run in parallel. 若要并行运行多个索引器,请通过创建合适数量的分区和副本来横向扩展搜索服务。To run multiple indexers in parallel, scale out your search service by creating an appropriate number of partitions and replicas. 例如,如果搜索服务有 6 个搜索单位(例如,2 个分区 x 3 个副本),则 6 个索引器可以同时运行,导致索引吞吐量增加六倍。For example, if your search service has 6 search units (for example, 2 partitions x 3 replicas), then 6 indexers can run simultaneously, resulting in a six-fold increase in the indexing throughput. 若要详细了解缩放和容量规划,请参阅调整 Azure 认知搜索服务的容量To learn more about scaling and capacity planning, see Adjust the capacity of an Azure Cognitive Search service.

处理错误Handle errors

默认情况下,Blob 索引器一旦遇到包含不受支持内容类型(例如图像)的 Blob 时,就会立即停止。By default, the blob indexer stops as soon as it encounters a blob with an unsupported content type (for example, an image). 当然,可以使用 excludedFileNameExtensions 参数跳过某些内容类型。You can of course use the excludedFileNameExtensions parameter to skip certain content types. 但是,可能需要在未事先了解所有可能的内容类型的情况下,为 Blob 编制索引。However, you may need to index blobs without knowing all the possible content types in advance. 要在遇到了不受支持的内容类型时继续编制索引,可将 failOnUnsupportedContentType 配置参数设置为 falseTo continue indexing when an unsupported content type is encountered, set the failOnUnsupportedContentType configuration parameter to false:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
    }

对于某些 blob,Azure 认知搜索无法确定内容类型,或无法处理其内容类型受其他方式支持的文档。For some blobs, Azure Cognitive Search is unable to determine the content type, or unable to process a document of otherwise supported content type. 若要忽略此故障模式,将 failOnUnprocessableDocument 配置参数设置为 false:To ignore this failure mode, set the failOnUnprocessableDocument configuration parameter to false:

      "parameters" : { "configuration" : { "failOnUnprocessableDocument" : false } }

Azure 认知搜索会限制进行了索引编制的 blob 的大小。Azure Cognitive Search limits the size of blobs that are indexed. 这些限制记录在 Azure 认知搜索中的服务限制中。These limits are documented in Service Limits in Azure Cognitive Search. 过大的 blob 会被默认视为错误。Oversized blobs are treated as errors by default. 但是,如果将 indexStorageMetadataOnlyForOversizedDocuments 配置参数设为 true,你仍可以对过大 blob 的存储元数据编制索引:However, you can still index storage metadata of oversized blobs if you set indexStorageMetadataOnlyForOversizedDocuments configuration parameter to true:

    "parameters" : { "configuration" : { "indexStorageMetadataOnlyForOversizedDocuments" : true } }

如果在任意处理点(无论是在解析 blob 时,还是在将文档添加到索引时)发生错误,仍然可以继续索引。You can also continue indexing if errors happen at any point of processing, either while parsing blobs or while adding documents to an index. 若要忽略特定的错误数,将 maxFailedItemsmaxFailedItemsPerBatch 配置参数设置为所需值。To ignore a specific number of errors, set the maxFailedItems and maxFailedItemsPerBatch configuration parameters to the desired values. 例如:For example:

    {
      ... other parts of indexer definition
      "parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 10 }
    }

特定于内容类型的元数据属性Content type-specific metadata properties

下表汇总了针对每种文档格式执行的处理,并说明了 Azure 认知搜索提取的元数据属性。The following table summarizes processing done for each document format, and describes the metadata properties extracted by Azure Cognitive Search.

文档格式/内容类型Document format / content type 提取的元数据Extracted metadata 处理详细信息Processing details
HTML(文本/html)HTML (text/html) metadata_content_encoding
metadata_content_type
metadata_language
metadata_description
metadata_keywords
metadata_title
剥离 HTML 标记并提取文本Strip HTML markup and extract text
PDF(应用程序/pdf)PDF (application/pdf) metadata_content_type
metadata_language
metadata_author
metadata_title
提取文本,包括嵌入的文档(不包括图像)Extract text, including embedded documents (excluding images)
DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
DOC (application/msword)DOC (application/msword) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
DOCM(应用程序/vnd.ms-word.document.macroenabled.12)DOCM (application/vnd.ms-word.document.macroenabled.12) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
WORD XML(应用程序/vnd.ms-word2006ml)WORD XML (application/vnd.ms-word2006ml) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
剥离 XML 标记并提取文本Strip XML markup and extract text
WORD 2003 XML(应用程序/vnd.ms-wordml)WORD 2003 XML (application/vnd.ms-wordml) metadata_content_type
metadata_author
metadata_creation_date
剥离 XML 标记并提取文本Strip XML markup and extract text
XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
XLS (application/vnd.ms-excel)XLS (application/vnd.ms-excel) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
XLSM(应用程序/vnd.ms-excel.sheet.macroenabled.12)XLSM (application/vnd.ms-excel.sheet.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation)PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
PPT (application/vnd.ms-powerpoint)PPT (application/vnd.ms-powerpoint) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
PPTM(应用程序/vnd.ms-powerpoint.presentation.macroenabled.12)PPTM (application/vnd.ms-powerpoint.presentation.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
MSG (application/vnd.ms-outlook)MSG (application/vnd.ms-outlook) metadata_content_type
metadata_message_from
metadata_message_from_email
metadata_message_to
metadata_message_to_email
metadata_message_cc
metadata_message_cc_email
metadata_message_bcc
metadata_message_bcc_email
metadata_creation_date
metadata_last_modified
metadata_subject
提取文本,包括从附件中提取的文本。Extract text, including text extracted from attachments. metadata_message_to_emailmetadata_message_cc_emailmetadata_message_bcc_email 是字符串集合,其余字段是字符串。metadata_message_to_email, metadata_message_cc_email and metadata_message_bcc_email are string collections, the rest of the fields are strings.
ODT(应用程序/vnd.oasis.opendocument.text)ODT (application/vnd.oasis.opendocument.text) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
ODS(应用程序/vnd.oasis.opendocument.spreadsheet)ODS (application/vnd.oasis.opendocument.spreadsheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
ODP(应用程序/vnd.oasis.opendocument.presentation)ODP (application/vnd.oasis.opendocument.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
title
提取文本,包括嵌入的文档Extract text, including embedded documents
ZIP (application/zip)ZIP (application/zip) metadata_content_type 从存档中的所有文档提取文本Extract text from all documents in the archive
GZ(应用程序/gzip)GZ (application/gzip) metadata_content_type 从存档中的所有文档提取文本Extract text from all documents in the archive
EPUB(应用程序/epub+zip)EPUB (application/epub+zip) metadata_content_type
metadata_author
metadata_creation_date
metadata_title
metadata_description
metadata_language
metadata_keywords
metadata_identifier
metadata_publisher
从存档中的所有文档提取文本Extract text from all documents in the archive
XML (application/xml)XML (application/xml) metadata_content_type
metadata_content_encoding
剥离 XML 标记并提取文本Strip XML markup and extract text
JSON (application/json)JSON (application/json) metadata_content_type
metadata_content_encoding
提取文本Extract text
注意:如果需要从 JSON Blob 提取多个文档字段,请参阅为 JSON Blob 编制索引了解详细信息NOTE: If you need to extract multiple document fields from a JSON blob, see Indexing JSON blobs for details
EML (message/rfc822)EML (message/rfc822) metadata_content_type
metadata_message_from
metadata_message_to
metadata_message_cc
metadata_creation_date
metadata_subject
提取文本,包括附件Extract text, including attachments
RTF(应用程序/rtf)RTF (application/rtf) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_page_count
metadata_word_count
提取文本Extract text
纯文本 (text/plain)Plain text (text/plain) metadata_content_type
metadata_content_encoding
提取文本Extract text

另请参阅See also