如何使用 Azure 认知搜索为 Azure Blob 存储中的文档编制索引How to index documents in Azure Blob Storage with Azure Cognitive Search

本文说明如何使用 Azure 认知搜索为存储在 Azure Blob 存储中的文档(例如 PDF 文档、Microsoft Office 文档和其他多种常用格式文档)编制索引。This article shows how to use Azure Cognitive Search to index documents (such as PDFs, Microsoft Office documents, and several other common formats) stored in Azure Blob storage. 首先,本文说明了设置和配置 Blob 索引器的基础知识。First, it explains the basics of setting up and configuring a blob indexer. 其次,本文更加深入地探讨了你可能会遇到的行为和场景。Then, it offers a deeper exploration of behaviors and scenarios you are likely to encounter.

支持的文档格式Supported document formats

Blob 索引器可从以下文档格式提取文本:The blob indexer can extract text from the following document formats:

设置 Blob 索引Setting up blob indexing

可以使用以下方式设置 Azure Blob 存储索引器:You can set up an Azure Blob Storage indexer using:

备注

某些功能(例如字段映射)在门户中尚不可用,必须以编程方式使用。Some features (for example, field mappings) are not yet available in the portal, and have to be used programmatically.

在这里,我们使用 REST API 演示流。Here, we demonstrate the flow using the REST API.

步骤 1:创建数据源Step 1: Create a data source

数据源指定要编制索引的数据、访问数据所需的凭据和有效标识数据更改(新行、修改的行或删除的行)的策略。A data source specifies which data to index, credentials needed to access the data, and policies to efficiently identify changes in the data (new, modified, or deleted rows). 一个数据源可供同一搜索服务中的多个索引器使用。A data source can be used by multiple indexers in the same search service.

若要为 Blob 编制索引,数据源必须具有以下属性:For blob indexing, the data source must have the following required properties:

  • name 是搜索服务中数据源的唯一名称。name is the unique name of the data source within your search service.
  • type 必须是 azureblobtype must be azureblob.
  • credentialscredentials.connectionString 参数的形式提供存储帐户连接字符串。credentials provides the storage account connection string as the credentials.connectionString parameter. 有关详细信息请参阅下方的如何指定凭据See How to specify credentials below for details.
  • container 指定存储帐户中的容器。container specifies a container in your storage account. 默认情况下,容器中的所有 Blob 都可检索。By default, all blobs within the container are retrievable. 如果只想为特定虚拟目录中的 Blob 编制索引,可以使用可选的 query 参数指定该目录。If you only want to index blobs in a particular virtual directory, you can specify that directory using the optional query parameter.

创建数据源:To create a data source:

    POST https://[service name].search.azure.cn/datasources?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
        "container" : { "name" : "my-container", "query" : "<optional-virtual-directory-name>" }
    }   

有关创建数据源 API 的详细信息,请参阅创建数据源For more on the Create Datasource API, see Create Datasource.

如何指定凭据How to specify credentials

可通过以下一种方式提供 blob 容器的凭据:You can provide the credentials for the blob container in one of these ways:

  • 完全访问存储帐户连接字符串DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key> 可通过导航到“存储帐户”边栏选项卡 >“设置”>“密钥”(对于经典存储帐户)或“设置”>“访问密钥”(对于 Azure 资源管理器存储帐户),从 Azure 门户获取连接字符串。Full access storage account connection string: DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key> You can get the connection string from the Azure portal by navigating to the storage account blade > Settings > Keys (for Classic storage accounts) or Settings > Access keys (for Azure Resource Manager storage accounts).
  • 存储帐户共享访问签名 (SAS) 连接字符串:BlobEndpoint=https://<your account>.blob.chinacloudsites.cn/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=b&sp=rl SAS 应具有容器和对象(本例中为 blob)的列表和读取权限。Storage account shared access signature (SAS) connection string: BlobEndpoint=https://<your account>.blob.chinacloudsites.cn/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=b&sp=rl The SAS should have the list and read permissions on containers and objects (blobs in this case).
  • 容器共享访问签名ContainerSharedAccessUri=https://<your storage account>.blob.chinacloudsites.cn/<container name>?sv=2016-05-31&sr=c&sig=<the signature>&se=<the validity end time>&sp=rl SAS 应具有容器的列表和读取权限。Container shared access signature: ContainerSharedAccessUri=https://<your storage account>.blob.chinacloudsites.cn/<container name>?sv=2016-05-31&sr=c&sig=<the signature>&se=<the validity end time>&sp=rl The SAS should have the list and read permissions on the container.

有关存储共享访问签名的详细信息,请参阅使用共享访问签名For more info on storage shared access signatures, see Using Shared Access Signatures.

备注

如果使用 SAS 凭据,则需使用续订的签名定期更新数据源凭据,以防止其过期。If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. 如果 SAS 凭据过期,索引器会失败,出现类似于 Credentials provided in the connection string are invalid or have expired. 的错误消息。If SAS credentials expire, the indexer will fail with an error message similar to Credentials provided in the connection string are invalid or have expired..

步骤 2:创建索引Step 2: Create an index

索引指定文档、属性和其他构造中可以塑造搜索体验的字段。The index specifies the fields in a document, attributes, and other constructs that shape the search experience.

下面介绍了如何使用可搜索 content 字段创建索引,以存储从 Blob 中提取的文本:Here's how to create an index with a searchable content field to store the text extracted from blobs:

    POST https://[service name].search.azure.cn/indexes?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
          "name" : "my-target-index",
          "fields": [
            { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
            { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
          ]
    }

有关创建索引的详细信息,请参阅创建索引For more on creating indexes, see Create Index

步骤 3:创建索引器Step 3: Create an indexer

索引器将数据源与目标搜索索引关联,并提供自动执行数据刷新的计划。An indexer connects a data source with a target search index, and provides a schedule to automate the data refresh.

创建索引和数据源后,就可以准备创建索引器了:Once the index and data source have been created, you're ready to create the indexer:

    POST https://[service name].search.azure.cn/indexers?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      "name" : "blob-indexer",
      "dataSourceName" : "blob-datasource",
      "targetIndexName" : "my-target-index",
      "schedule" : { "interval" : "PT2H" }
    }

此索引器每隔两小时运行一次(已将计划间隔设置为“PT2H”)。This indexer will run every two hours (schedule interval is set to "PT2H"). 若要每隔 30 分钟运行一次索引器,可将间隔设置为“PT30M”。To run an indexer every 30 minutes, set the interval to "PT30M". 支持的最短间隔为 5 分钟。The shortest supported interval is 5 minutes. 计划是可选的 - 如果省略,则索引器在创建后只运行一次。The schedule is optional - if omitted, an indexer runs only once when it's created. 但是,可以随时根据需要运行索引器。However, you can run an indexer on-demand at any time.

有关创建索引器 API 的更多详细信息,请参阅创建索引器For more details on the Create Indexer API, check out Create Indexer.

若要详细了解如何定义索引器计划,请参阅如何为 Azure 认知搜索计划索引器For more information about defining indexer schedules see How to schedule indexers for Azure Cognitive Search.

Azure 认知搜索如何编制 Blob 的索引How Azure Cognitive Search indexes blobs

根据具体的索引器配置,Blob 索引器可以仅为存储元数据编制索引(如果只关注元数据,而无需为 Blob 的内容编制索引,则此功能非常有用)、为存储元数据和内容元数据编制索引,或者同时为元数据和文本内容编制索引。Depending on the indexer configuration, the blob indexer can index storage metadata only (useful when you only care about the metadata and don't need to index the content of blobs), storage and content metadata, or both metadata and textual content. 默认情况下,索引器提取元数据和内容。By default, the indexer extracts both metadata and content.

备注

默认情况下,包含结构化内容(例如 JSON 或 CSV)的 lob 以单一文本区块的形式编制索引。By default, blobs with structured content such as JSON or CSV are indexed as a single chunk of text. 如果想要以结构化方法为 JSON 和 CSV Blob 编制索引,请参阅为 JSON Blob 编制索引为 CSV Blob 编制索引来了解详细信息。If you want to index JSON and CSV blobs in a structured way, see Indexing JSON blobs and Indexing CSV blobs for more information.

复合或嵌入式文档(例如 ZIP 存档、嵌入了带附件 Outlook 电子邮件的 Word 文档或带附件的 .MSG 文件)也以单一文档的形式编制索引。A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or a .MSG file with attachments) is also indexed as a single document. 例如,从 .MSG 文件的附件中提取的所有图像将在 normalized_images 字段中返回。For example, all images extracted from the attachments of an .MSG file will be returned in the normalized_images field.

  • 文档的文本内容将提取到名为 content 的字符串字段中。The textual content of the document is extracted into a string field named content.

备注

Azure 认知搜索会根据定价层限制提取的文本数量:免费层为 32,000 个字符,基本层为 64,000 个字符,标准层为 400 万个字符、标准 S2 层为 800 万个字符,标准 S3 层为 1600 万个字符。Azure Cognitive Search limits how much text it extracts depending on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, 4 million for Standard, 8 million for Standard S2, and 16 million for Standard S3. 已截断的文本会在索引器状态响应中出现一条警告。A warning is included in the indexer status response for truncated documents.

  • Blob 中用户指定的元数据属性(如果有)将逐字提取。User-specified metadata properties present on the blob, if any, are extracted verbatim. 请注意,这要求在索引中定义与 blob 的元数据密钥名称相同的字段。Note that this requires a field to be defined in the index with the same name as the metadata key of the blob. 例如,如果 blob 有值为 High 的元数据密钥 Sensitivity,则应在搜索索引中定义一个名为“Sensitivity”的字段,该字段将用值“High”填充。For example, if your blob has a metadata key of Sensitivity with value High, you should define a field named Sensitivity in your search index and it will be populated with the value High.

  • 标准 Blob 元数据属性将提取到以下字段中:Standard blob metadata properties are extracted into the following fields:

    • metadata_storage_name (Edm.String) - Blob 的文件名。metadata_storage_name (Edm.String) - the file name of the blob. 例如,对于 Blob /my-container/my-folder/subfolder/resume.pdf 而言,此字段的值是 resume.pdfFor example, if you have a blob /my-container/my-folder/subfolder/resume.pdf, the value of this field is resume.pdf.
    • metadata_storage_path (Edm.String) - Blob 的完整 URI(包括存储帐户)。metadata_storage_path (Edm.String) - the full URI of the blob, including the storage account. 例如: https://myaccount.blob.chinacloudsites.cn/my-container/my-folder/subfolder/resume.pdfFor example, https://myaccount.blob.chinacloudsites.cn/my-container/my-folder/subfolder/resume.pdf
    • metadata_storage_content_type (Edm.String) - 用于上传 Blob 的代码指定的内容类型。metadata_storage_content_type (Edm.String) - content type as specified by the code you used to upload the blob. 例如,application/octet-streamFor example, application/octet-stream.
    • metadata_storage_last_modified (Edm.DateTimeOffset) - 上次修改 Blob 的时间戳。metadata_storage_last_modified (Edm.DateTimeOffset) - last modified timestamp for the blob. Azure 认知搜索使用此时间戳来识别已更改的 Blob,避免在初次编制索引之后再次为所有内容编制索引。Azure Cognitive Search uses this timestamp to identify changed blobs, to avoid reindexing everything after the initial indexing.
    • metadata_storage_size (Edm.Int64) - Blob 大小,以字节为单位。metadata_storage_size (Edm.Int64) - blob size in bytes.
    • metadata_storage_content_md5 (Edm.String) - Blob 内容的 MD5 哈希(如果有)。metadata_storage_content_md5 (Edm.String) - MD5 hash of the blob content, if available.
    • metadata_storage_sas_token (Edm.String) - 一个临时 SAS 令牌,可由自定义技能用来获取对 Blob 的访问权限。metadata_storage_sas_token (Edm.String) - A temporary SAS token that can be used by custom skills to get access to the blob. 不应存储此令牌供以后使用,因为它可能会过期。This token should not be stored for later use as it might expire.
  • 特定于每种文档格式的元数据属性将提取到此处所列的字段。Metadata properties specific to each document format are extracted into the fields listed here.

无需在搜索索引中针对上述所有属性定义字段 - 系统只捕获应用程序所需的属性。You don't need to define fields for all of the above properties in your search index - just capture the properties you need for your application.

备注

通常,现有索引中的字段名称与文档提取期间所生成的字段名称不同。Often, the field names in your existing index will be different from the field names generated during document extraction. 可以使用字段映射将 Azure 认知搜索提供的属性名称映射到搜索索引中的字段名称。You can use field mappings to map the property names provided by Azure Cognitive Search to the field names in your search index. 后面会提供字段映射的用法示例。You will see an example of field mappings use below.

定义文档键和字段映射Defining document keys and field mappings

在 Azure 认知搜索中,文档键唯一标识某个文档。In Azure Cognitive Search, the document key uniquely identifies a document. 每个搜索索引只能有一个类型为 Edm.String 的键字段。Every search index must have exactly one key field of type Edm.String. 键字段对于要添加到索引的每个文档必不可少(事实上它是唯一的必填字段)。The key field is required for each document that is being added to the index (it is actually the only required field).

应该仔细考虑要将提取的哪个字段映射到索引的键字段。You should carefully consider which extracted field should map to the key field for your index. 候选字段包括:The candidates are:

  • metadata_storage_name - 这可能是一个便利的候选项,但是请注意,1) 名称可能不唯一,因为不同的文件夹中可能有同名的 Blob;2) 名称中包含的字符可能在文档键中无效,例如短划线。metadata_storage_name - this might be a convenient candidate, but note that 1) the names might not be unique, as you may have blobs with the same name in different folders, and 2) the name may contain characters that are invalid in document keys, such as dashes. 可以使用 base64Encode 字段映射函数处理无效的字符 - 如果这样做,请记得在将这些字符传入 Lookup 等 API 调用时为文档键编码。You can deal with invalid characters by using the base64Encode field mapping function - if you do this, remember to encode document keys when passing them in API calls such as Lookup. (例如,在 .NET 中,可以使用 UrlTokenEncode 方法实现此目的)。(For example, in .NET you can use the UrlTokenEncode method for that purpose).
  • metadata_storage_path - 使用完整路径可确保唯一性,但是路径必定会包含 / 字符,而该字符在文档键中无效metadata_storage_path - using the full path ensures uniqueness, but the path definitely contains / characters that are invalid in a document key. 如上所述,可以选择使用 base64Encode 函数为键编码。As above, you have the option of encoding the keys using the base64Encode function.
  • 如果上述所有做法都不起作用,可将一个自定义元数据属性添加到 Blob。If none of the options above work for you, you can add a custom metadata property to the blobs. 但是,这种做法需要通过 Blob 上传过程将该元数据属性添加到所有 Blob。This option does, however, require your blob upload process to add that metadata property to all blobs. 由于键是必需的属性,因此没有该属性的所有 Blob 都无法编制索引。Since the key is a required property, all blobs that don't have that property will fail to be indexed.

重要

如果索引中没有键字段的显式映射,Azure 认知搜索会自动使用 metadata_storage_path 作为键并对键值进行 base-64 编码(上述第二个选项)。If there is no explicit mapping for the key field in the index, Azure Cognitive Search automatically uses metadata_storage_path as the key and base-64 encodes key values (the second option above).

本示例选择 metadata_storage_name 字段作为文档键。For this example, let's pick the metadata_storage_name field as the document key. 同时,假设索引具有名为 key 的键字段,以及一个用于存储文档大小的 fileSize 字段。Let's also assume your index has a key field named key and a field fileSize for storing the document size. 若要连接所需的元素,请在创建或更新索引器时指定以下字段映射:To wire things up as desired, specify the following field mappings when creating or updating your indexer:

    "fieldMappings" : [
      { "sourceFieldName" : "metadata_storage_name", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
      { "sourceFieldName" : "metadata_storage_size", "targetFieldName" : "fileSize" }
    ]

要将所有元素合并在一起,可按如下所示添加字段映射,并为现有索引器的键启用 base-64 编码:To bring this all together, here's how you can add field mappings and enable base-64 encoding of keys for an existing indexer:

    PUT https://[service name].search.azure.cn/indexers/blob-indexer?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      "dataSourceName" : " blob-datasource ",
      "targetIndexName" : "my-target-index",
      "schedule" : { "interval" : "PT2H" },
      "fieldMappings" : [
        { "sourceFieldName" : "metadata_storage_name", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
        { "sourceFieldName" : "metadata_storage_size", "targetFieldName" : "fileSize" }
      ]
    }

备注

有关字段映射的详细信息,请参阅此文To learn more about field mappings, see this article.

如果需要对字段进行编码以将其用作键,但也希望搜索它,该怎么办?What if you need to encode a field to use it as a key, but you also want to search it?

有时,需要使用一个字段的编码版本(如 metadata_storage_path)作为键,但也需要该字段是可搜索的(无需编码)。There are times when you need to use an encoded version of a field like metadata_storage_path as the key, but you also need that field to be searchable (without encoding). 若要解决此问题,可以将其映射到两个字段中:一个用于键,另一个用于搜索目的。In order to resolve this issue, you can map it into two fields; one that will be used for the key, and another one that will be used for search purposes. 在下面的示例中,“键”字段包含编码的路径,而“路径”字段未编码且将用作索引中的可搜索字段 。In the example below the key field contains the encoded path, while the path field is not encoded and will be used as the searchable field in the index.

    PUT https://[service name].search.azure.cn/indexers/blob-indexer?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      "dataSourceName" : " blob-datasource ",
      "targetIndexName" : "my-target-index",
      "schedule" : { "interval" : "PT2H" },
      "fieldMappings" : [
        { "sourceFieldName" : "metadata_storage_path", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
        { "sourceFieldName" : "metadata_storage_path", "targetFieldName" : "path" }
      ]
    }

控制要为哪些 Blob 编制索引Controlling which blobs are indexed

可以控制要为哪些 Blob 编制索引,以及要跳过哪些 Blob。You can control which blobs are indexed, and which are skipped.

只为具有特定文件扩展名的 Blob 编制索引Index only the blobs with specific file extensions

使用 indexedFileNameExtensions 索引器配置参数可以做到只为具有指定扩展名的 Blob 编制索引。You can index only the blobs with the file name extensions you specify by using the indexedFileNameExtensions indexer configuration parameter. 值是包含文件扩展名(包括前置句点)逗号分隔列表的字符串。The value is a string containing a comma-separated list of file extensions (with a leading dot). 例如,如果只要为 .PDF 和 .DOCX Blob 编制索引,请执行以下操作:For example, to index only the .PDF and .DOCX blobs, do this:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "indexedFileNameExtensions" : ".pdf,.docx" } }
    }

排除具有特定文件扩展名的 BlobExclude blobs with specific file extensions

使用 excludedFileNameExtensions 配置参数可在编制索引时排除具有特定文件扩展名的 Blob。You can exclude blobs with specific file name extensions from indexing by using the excludedFileNameExtensions configuration parameter. 值是包含文件扩展名(包括前置句点)逗号分隔列表的字符串。The value is a string containing a comma-separated list of file extensions (with a leading dot). 例如,若要为所有 Blob 编制索引,但要排除具有 .PNG 和 .JPEG 扩展名的 Blob,请执行以下操作:For example, to index all blobs except those with the .PNG and .JPEG extensions, do this:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "excludedFileNameExtensions" : ".png,.jpeg" } }
    }

如果同时存在 indexedFileNameExtensionsexcludedFileNameExtensions 参数,Azure 认知搜索会先查看 indexedFileNameExtensions,再查看 excludedFileNameExtensionsIf both indexedFileNameExtensions and excludedFileNameExtensions parameters are present, Azure Cognitive Search first looks at indexedFileNameExtensions, then at excludedFileNameExtensions. 这意味着,如果两个列表中存在同一个文件扩展名,将从索引编制中排除该扩展名。This means that if the same file extension is present in both lists, it will be excluded from indexing.

控制要为 Blob 中的哪些部分编制索引Controlling which parts of the blob are indexed

可以使用 dataToExtract 配置参数控制要为 Blob 中的哪些部分编制索引。You can control which parts of the blobs are indexed using the dataToExtract configuration parameter. 该参数采用以下值:It can take the following values:

例如,如果只要为存储元数据编制索引,请使用:For example, to index only the storage metadata, use:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "dataToExtract" : "storageMetadata" } }
    }

使用 Blob 元数据来控制如何为 Blob 编制索引Using blob metadata to control how blobs are indexed

上述配置参数适用于所有 Blob。The configuration parameters described above apply to all blobs. 有时,你可能想要控制为个体 Blob 编制索引的方式。Sometimes, you may want to control how individual blobs are indexed. 为此,可以添加以下 Blob 元数据属性和值:You can do this by adding the following blob metadata properties and values:

属性名称Property name 属性值Property value 说明Explanation
AzureSearch_SkipAzureSearch_Skip "true""true" 指示 Blob 索引器完全跳过该 Blob,Instructs the blob indexer to completely skip the blob. 既不尝试提取元数据,也不提取内容。Neither metadata nor content extraction is attempted. 如果特定的 Blob 反复失败并且中断编制索引过程,此属性非常有用。This is useful when a particular blob fails repeatedly and interrupts the indexing process.
AzureSearch_SkipContentAzureSearch_SkipContent "true""true" 此属性等效于上面所述的与特定 Blob 相关的 "dataToExtract" : "allMetadata" 设置。This is equivalent of "dataToExtract" : "allMetadata" setting described above scoped to a particular blob.

处理错误Dealing with errors

默认情况下,Blob 索引器一旦遇到包含不受支持内容类型(例如图像)的 Blob 时,就会立即停止。By default, the blob indexer stops as soon as it encounters a blob with an unsupported content type (for example, an image). 当然,可以使用 excludedFileNameExtensions 参数跳过某些内容类型。You can of course use the excludedFileNameExtensions parameter to skip certain content types. 但是,可能需要在未事先了解所有可能的内容类型的情况下,为 Blob 编制索引。However, you may need to index blobs without knowing all the possible content types in advance. 要在遇到了不受支持的内容类型时继续编制索引,可将 failOnUnsupportedContentType 配置参数设置为 falseTo continue indexing when an unsupported content type is encountered, set the failOnUnsupportedContentType configuration parameter to false:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
    }

对于某些 blob,Azure 认知搜索无法确定内容类型,或无法处理其内容类型受其他方式支持的文档。For some blobs, Azure Cognitive Search is unable to determine the content type, or unable to process a document of otherwise supported content type. 若要忽略此故障模式,将 failOnUnprocessableDocument 配置参数设置为 false:To ignore this failure mode, set the failOnUnprocessableDocument configuration parameter to false:

      "parameters" : { "configuration" : { "failOnUnprocessableDocument" : false } }

Azure 认知搜索会限制进行了索引编制的 blob 的大小。Azure Cognitive Search limits the size of blobs that are indexed. 这些限制记录在 Azure 认知搜索中的服务限制中。These limits are documented in Service Limits in Azure Cognitive Search. 过大的 blob 会被默认视为错误。Oversized blobs are treated as errors by default. 但是,如果将 indexStorageMetadataOnlyForOversizedDocuments 配置参数设为 true,你仍可以对过大 blob 的存储元数据编制索引:However, you can still index storage metadata of oversized blobs if you set indexStorageMetadataOnlyForOversizedDocuments configuration parameter to true:

    "parameters" : { "configuration" : { "indexStorageMetadataOnlyForOversizedDocuments" : true } }

如果在任意处理点(无论是在解析 blob 时,还是在将文档添加到索引时)发生错误,仍然可以继续索引。You can also continue indexing if errors happen at any point of processing, either while parsing blobs or while adding documents to an index. 若要忽略特定的错误数,将 maxFailedItemsmaxFailedItemsPerBatch 配置参数设置为所需值。To ignore a specific number of errors, set the maxFailedItems and maxFailedItemsPerBatch configuration parameters to the desired values. 例如:For example:

    {
      ... other parts of indexer definition
      "parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 10 }
    }

增量索引和删除检测Incremental indexing and deletion detection

将 Blob 索引器设置为按计划运行时,它将只根据 Blob 的 LastModified 时间戳,为更改的 Blob 重新编制索引。When you set up a blob indexer to run on a schedule, it reindexes only the changed blobs, as determined by the blob's LastModified timestamp.

备注

无需指定更改检测策略 - 系统会自动启用增量索引。You don't have to specify a change detection policy – incremental indexing is enabled for you automatically.

若要支持删除文档,请使用“软删除”方法。To support deleting documents, use a "soft delete" approach. 如果彻底删除 Blob,相应的文档不会从搜索索引中删除。If you delete the blobs outright, corresponding documents will not be removed from the search index.

可通过两种方法实现软删除方法。There are two ways to implement the soft delete approach. 下面介绍了这两种方法。Both are described below.

本机 Blob 软删除(预览版)Native blob soft delete (preview)

重要

对本机 Blob 软删除的支持目前为预览版。Support for native blob soft delete is in preview. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负荷。Preview functionality is provided without a service level agreement, and is not recommended for production workloads. REST API 版本 2020-06-30-Preview 提供此功能。The REST API version 2020-06-30-Preview provides this feature. 目前不支持门户或 .NET SDK。There is currently no portal or .NET SDK support.

备注

使用本机 Blob 软删除策略时,索引中文档的文档键必须是 Blob 属性或 Blob 元数据。When using the native blob soft delete policy the document keys for the documents in your index must either be a blob property or blob metadata.

在此方法中,你将使用 Azure Blob 存储提供的本机 Blob 软删除功能。In this method you will use the native blob soft delete feature offered by Azure Blob storage. 如果在存储帐户中启用了本机 Blob 软删除,你的数据源已设置了本地软删除策略,并且索引器找到了一个已转变为软删除状态的 Blob,则索引器会从索引中删除该文档。If native blob soft delete is enabled on your storage account, your data source has a native soft delete policy set, and the indexer finds a blob that has been transitioned to a soft deleted state, the indexer will remove that document from the index. 为 Azure Data Lake Storage Gen2 中的 Blob 编制索引时,不支持本机 Blob 软删除策略。The native blob soft delete policy is not supported when indexing blobs from Azure Data Lake Storage Gen2.

使用以下步骤:Use the following steps:

  1. 为 Azure Blob 存储启用本地软删除Enable native soft delete for Azure Blob storage. 我们建议将保留策略设置为比索引器间隔计划大得多的值。We recommend setting the retention policy to a value that's much higher than your indexer interval schedule. 这样,如果在运行索引器时出现问题,或者如果有大量的文档需要编制索引,可以为索引器留出大量的时间来最终处理已软删除的 Blob。This way if there's an issue running the indexer or if you have a large number of documents to index, there's plenty of time for the indexer to eventually process the soft deleted blobs. 仅当 Azure 认知搜索索引器在处理处于“已软删除”状态的 Blob 时,才会从索引中删除文档。Azure Cognitive Search indexers will only delete a document from the index if it processes the blob while it's in a soft deleted state.

  2. 在数据源中配置本机 Blob 软删除检测策略。Configure a native blob soft deletion detection policy on the data source. 下面显示了一个示例。An example is shown below. 由于此功能目前为预览版,因此必须使用预览版 REST API。Since this feature is in preview, you must use the preview REST API.

  3. 运行索引器,或者将索引器设置为按计划运行。Run the indexer or set the indexer to run on a schedule. 当索引器运行并处理 Blob 时,将从索引中删除文档。When the indexer runs and processes the blob the document will be removed from the index.

    PUT https://[service name].search.azure.cn/datasources/blob-datasource?api-version=2020-06-30-Preview
    Content-Type: application/json
    api-key: [admin key]
    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "<your storage connection string>" },
        "container" : { "name" : "my-container", "query" : null },
        "dataDeletionDetectionPolicy" : {
            "@odata.type" :"#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
        }
    }
    

为取消删除的 Blob 重新编制索引Reindexing undeleted blobs

在存储帐户中启用本机软删除后,如果从 Azure Blob 存储中删除某个 Blob,该 Blob 将转变为“已软删除”状态,允许你在保留期内取消删除该 Blob。If you delete a blob from Azure Blob storage with native soft delete enabled on your storage account the blob will transition to a soft deleted state giving you the option to undelete that blob within the retention period. 如果 Azure 认知搜索数据源具有本机 Blob 软删除策略,当索引器处理已软删除的 Blob 时,它会从索引中删除该文档。When an Azure Cognitive Search data source has a native blob soft delete policy and the indexer processes a soft deleted blob it will remove that document from the index. 如果随后取消删除该 Blob,则索引器始终不会为该 Blob 重新编制索引。If that blob is later undeleted the indexer will not always reindex that blob. 这是因为,索引器根据 Blob 的 LastModified 时间戳确定要为哪些 Blob 编制索引。This is because the indexer determines which blobs to index based on the blob's LastModified timestamp. 取消删除某个已软删除的 Blob 时,该 Blob 的 LastModified 时间戳不会更新,因此,如果索引器已处理的 Blob 的 LastModified 时间戳比已取消删除的 Blob 的时间戳更接近当前时间,则索引器不会为取消删除的 Blob 重新编制索引。When a soft deleted blob is undeleted its LastModified timestamp does not get updated, so if the indexer has already processed blobs with LastModified timestamps more recent than the undeleted blob it won't reindex the undeleted blob. 若要确保为取消删除的 Blob 重新编制索引,需要更新该 Blob 的 LastModified 时间戳。To make sure that an undeleted blob is reindexed, you will need to update the blob's LastModified timestamp. 为此,可以重新保存该 Blob 的元数据。One way to do this is by resaving the metadata of that blob. 你无需更改元数据,但重新保存元数据会更新 Blob 的 LastModified 时间戳,使索引器知道它需要为此 Blob 重新编制索引。You don't need to change the metadata but resaving the metadata will update the blob's LastModified timestamp so that the indexer knows that it needs to reindex this blob.

使用自定义元数据的软删除Soft delete using custom metadata

在此方法中,你将使用 Blob 的元数据来指示何时应从搜索索引中删除文档。In this method you will use a blob's metadata to indicate when a document should be removed from the search index.

使用以下步骤:Use the following steps:

  1. 将一个自定义元数据键值对属性添加到 Blob,以告知 Azure 认知搜索该 Blob 已采用逻辑方式删除。Add a custom metadata key-value pair to the blob to indicate to Azure Cognitive Search that it is logically deleted.
  2. 在数据源中配置软删除列检测策略。Configure a soft deletion column detection policy on the data source. 下面显示了一个示例。An example is shown below.
  3. 在索引器处理 Blob 并从索引中删除文档后,你可以删除 Azure Blob 存储的 Blob。Once the indexer has processed the blob and deleted the document from the index, you can delete the blob for Azure Blob storage.

例如,如果某个 Blob 具有值为 true 的元数据属性 IsDeleted,以下策略会将该 Blob 视为已删除:For example, the following policy considers a blob to be deleted if it has a metadata property IsDeleted with the value true:

    PUT https://[service name].search.azure.cn/datasources/blob-datasource?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "<your storage connection string>" },
        "container" : { "name" : "my-container", "query" : null },
        "dataDeletionDetectionPolicy" : {
            "@odata.type" :"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",     
            "softDeleteColumnName" : "IsDeleted",
            "softDeleteMarkerValue" : "true"
        }
    }

为取消删除的 Blob 重新编制索引Reindexing undeleted blobs

如果在数据源中设置软删除列检测策略,再将自定义元数据添加到具有标记值的 Blob,然后运行索引器,则索引器将从索引中删除该文档。If you set a soft delete column detection policy on your data source, then add the custom metadata to a blob with the marker value, then run the indexer, the indexer will remove that document from the index. 若要为该文档重新编制索引,只需更改该 Blob 的软删除元数据值,然后重新运行索引器。If you would like to reindex that document, simply change the soft delete metadata value for that blob and rerun the indexer.

为大型数据集编制索引Indexing large datasets

Blob 编制索引可能是一个耗时的过程。Indexing blobs can be a time-consuming process. 如果有几百万个 Blob 需要编制索引,可以将数据分区,并使用多个索引器来并行处理数据,从而加快索引编制的速度。In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to process the data in parallel. 设置方法如下:Here's how you can set this up:

  • 将数据分区到多个 Blob 容器或虚拟文件夹Partition your data into multiple blob containers or virtual folders

  • 设置多个 Azure 认知搜索数据源,为每个容器或文件夹各设置一个。Set up several Azure Cognitive Search data sources, one per container or folder. 若要指向某个 Blob 文件夹,请使用 query 参数:To point to a blob folder, use the query parameter:

    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "<your storage connection string>" },
        "container" : { "name" : "my-container", "query" : "my-folder" }
    }
    
  • 为每个数据源创建相应的索引器。Create a corresponding indexer for each data source. 所有索引器可以指向同一目标搜索索引。All the indexers can point to the same target search index.

  • 服务中的每个搜索单位在任何给定的时间都只能运行一个索引器。One search unit in your service can run one indexer at any given time. 只有当索引器实际上并行运行时,如上所述创建多个索引器才很有用。Creating multiple indexers as described above is only useful if they actually run in parallel. 若要并行运行多个索引器,请通过创建合适数量的分区和副本来横向扩展搜索服务。To run multiple indexers in parallel, scale out your search service by creating an appropriate number of partitions and replicas. 例如,如果搜索服务有 6 个搜索单位(例如,2 个分区 x 3 个副本),则 6 个索引器可以同时运行,导致索引吞吐量增加六倍。For example, if your search service has 6 search units (for example, 2 partitions x 3 replicas), then 6 indexers can run simultaneously, resulting in a six-fold increase in the indexing throughput. 若要详细了解缩放和容量规划,请参阅在 Azure 认知搜索中缩放用于查询和索引工作负荷的资源级别To learn more about scaling and capacity planning, see Scale resource levels for query and indexing workloads in Azure Cognitive Search.

你可能希望从索引中的多个源“组装”文档。You may want to "assemble" documents from multiple sources in your index. 例如,你可能希望将 blob 中的文本与 Cosmos DB 中存储的其他元数据进行合并。For example, you may want to merge text from blobs with other metadata stored in Cosmos DB. 甚至可以将推送索引 API 与各种索引器一起使用来基于多个部件搭建搜索文档。You can even use the push indexing API together with various indexers to build up search documents from multiple parts.

若要使此方式可行,所有索引器和其他组件需要针对文档键达成一致。For this to work, all indexers and other components need to agree on the document key. 有关本主题的更多详细信息,请参阅为多个 Azure 数据源编制索引For additional details on this topic, refer to Index multiple Azure data sources. 有关详细演练,请参阅外部文章:Combine documents with other data in Azure Cognitive Search(将文档与 Azure 认知搜索中的其他数据结合在一起)。For a detailed walk-through, see this external article: Combine documents with other data in Azure Cognitive Search.

为纯文本编制索引Indexing plain text

如果所有 blob 都包含采用同一编码的纯文本,则可以通过使用文本分析模式显著提高索引编制性能。If all your blobs contain plain text in the same encoding, you can significantly improve indexing performance by using text parsing mode. 若要使用文本分析模式,请将 parsingMode 配置属性设置为 textTo use text parsing mode, set the parsingMode configuration property to text:

    PUT https://[service name].search.azure.cn/indexers/[indexer name]?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "parsingMode" : "text" } }
    }

默认情况下将采用 UTF-8 编码。By default, the UTF-8 encoding is assumed. 若要指定不同的编码,请使用 encoding 配置属性:To specify a different encoding, use the encoding configuration property:

    {
      ... other parts of indexer definition
      "parameters" : { "configuration" : { "parsingMode" : "text", "encoding" : "windows-1252" } }
    }

特定于内容类型的元数据属性Content type-specific metadata properties

下表汇总了针对每种文档格式执行的处理,并说明了 Azure 认知搜索提取的元数据属性。The following table summarizes processing done for each document format, and describes the metadata properties extracted by Azure Cognitive Search.

文档格式/内容类型Document format / content type 特定于内容类型的元数据属性Content-type specific metadata properties 处理详细信息Processing details
HTML(文本/html)HTML (text/html) metadata_content_encoding
metadata_content_type
metadata_language
metadata_description
metadata_keywords
metadata_title
剥离 HTML 标记并提取文本Strip HTML markup and extract text
PDF(应用程序/pdf)PDF (application/pdf) metadata_content_type
metadata_language
metadata_author
metadata_title
提取文本,包括嵌入的文档(不包括图像)Extract text, including embedded documents (excluding images)
DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
DOC (application/msword)DOC (application/msword) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
DOCM(应用程序/vnd.ms-word.document.macroenabled.12)DOCM (application/vnd.ms-word.document.macroenabled.12) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
WORD XML(应用程序/vnd.ms-word2006ml)WORD XML (application/vnd.ms-word2006ml) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
剥离 XML 标记并提取文本Strip XML markup and extract text
WORD 2003 XML(应用程序/vnd.ms-wordml)WORD 2003 XML (application/vnd.ms-wordml) metadata_content_type
metadata_author
metadata_creation_date
剥离 XML 标记并提取文本Strip XML markup and extract text
XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
XLS (application/vnd.ms-excel)XLS (application/vnd.ms-excel) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
XLSM(应用程序/vnd.ms-excel.sheet.macroenabled.12)XLSM (application/vnd.ms-excel.sheet.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation)PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
PPT (application/vnd.ms-powerpoint)PPT (application/vnd.ms-powerpoint) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
PPTM(应用程序/vnd.ms-powerpoint.presentation.macroenabled.12)PPTM (application/vnd.ms-powerpoint.presentation.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
MSG (application/vnd.ms-outlook)MSG (application/vnd.ms-outlook) metadata_content_type
metadata_message_from
metadata_message_from_email
metadata_message_to
metadata_message_to_email
metadata_message_cc
metadata_message_cc_email
metadata_message_bcc
metadata_message_bcc_email
metadata_creation_date
metadata_last_modified
metadata_subject
提取文本,包括从附件中提取的文本。Extract text, including text extracted from attachments. metadata_message_to_emailmetadata_message_cc_emailmetadata_message_bcc_email 是字符串集合,其余字段是字符串。metadata_message_to_email, metadata_message_cc_email and metadata_message_bcc_email are string collections, the rest of the fields are strings.
ODT(应用程序/vnd.oasis.opendocument.text)ODT (application/vnd.oasis.opendocument.text) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
ODS(应用程序/vnd.oasis.opendocument.spreadsheet)ODS (application/vnd.oasis.opendocument.spreadsheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
ODP(应用程序/vnd.oasis.opendocument.presentation)ODP (application/vnd.oasis.opendocument.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
title
提取文本,包括嵌入的文档Extract text, including embedded documents
ZIP (application/zip)ZIP (application/zip) metadata_content_type 从存档中的所有文档提取文本Extract text from all documents in the archive
GZ(应用程序/gzip)GZ (application/gzip) metadata_content_type 从存档中的所有文档提取文本Extract text from all documents in the archive
EPUB(应用程序/epub+zip)EPUB (application/epub+zip) metadata_content_type
metadata_author
metadata_creation_date
metadata_title
metadata_description
metadata_language
metadata_keywords
metadata_identifier
metadata_publisher
从存档中的所有文档提取文本Extract text from all documents in the archive
XML (application/xml)XML (application/xml) metadata_content_type
metadata_content_encoding
剥离 XML 标记并提取文本Strip XML markup and extract text
JSON (application/json)JSON (application/json) metadata_content_type
metadata_content_encoding
提取文本Extract text
注意:如果需要从 JSON Blob 提取多个文档字段,请参阅为 JSON Blob 编制索引了解详细信息NOTE: If you need to extract multiple document fields from a JSON blob, see Indexing JSON blobs for details
EML (message/rfc822)EML (message/rfc822) metadata_content_type
metadata_message_from
metadata_message_to
metadata_message_cc
metadata_creation_date
metadata_subject
提取文本,包括附件Extract text, including attachments
RTF(应用程序/rtf)RTF (application/rtf) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_page_count
metadata_word_count
提取文本Extract text
纯文本 (text/plain)Plain text (text/plain) metadata_content_type
metadata_content_encoding
提取文本Extract text

帮助我们改善 Azure 认知搜索Help us make Azure Cognitive Search better

如果想要请求新功能或者在改进方面有什么看法,敬请通过 UserVoice 站点告诉我们。If you have feature requests or ideas for improvements, let us know on our UserVoice site.