为可以生成多个搜索文档的 Blob 编制索引Indexing blobs to produce multiple search documents

默认情况下,Blob 索引器将一个 Blob 的内容视为单个搜索文档。By default, a blob indexer will treat the contents of a blob as a single search document. 某些 parsingMode 值支持单个 Blob 导致多个搜索文档的方案。Certain parsingMode values support scenarios where an individual blob can result in multiple search documents. 允许索引器从一个 Blob 提取多个搜索文档的不同类型的 parsingMode 包括:The different types of parsingMode that allow an indexer to extract more than one search document from a blob are:

  • delimitedText
  • jsonArray
  • jsonLines

一对多文档键One-to-many document key

Azure 认知搜索索引中显示的每个文档由一个文档键唯一标识。Each document that shows up in an Azure Cognitive Search index is uniquely identified by a document key.

如果未指定分析模式,并且索引中的键字段不存在显式映射,Azure 认知搜索会自动将 metadata_storage_path 属性映射为键。When no parsing mode is specified, and if there is no explicit mapping for the key field in the index Azure Cognitive Search automatically maps the metadata_storage_path property as the key. 这种映射确保每个 Blob 显示为不同的搜索文档。This mapping ensures that each blob appears as a distinct search document.

使用上面所列的任一分析模式时,一个 Blob 将映射到“多个”搜索文档,因此,一个文档键仅基于 Blob 元数据是不适当的。When using any of the parsing modes listed above, one blob maps to "many" search documents, making a document key solely based on blob metadata unsuitable. 为了克服这种约束,Azure 认知搜索能够为从 Blob 提取的每个单个实体生成“一对多”的文档键。To overcome this constraint, Azure Cognitive Search is capable of generating a "one-to-many" document key for each individual entity extracted from a blob. 此属性名为 AzureSearch_DocumentKey,将添加到从 Blob 提取的每个实体。This property is named AzureSearch_DocumentKey and is added to each individual entity extracted from the blob. 系统保证此属性的值对于各 Blob 中的每个实体唯一,而实体将显示为独立的搜索文档。 The value of this property is guaranteed to be unique for each individual entity across blobs and the entities will show up as separate search documents.

默认情况下,如果未指定键索引字段的显式字段映射,系统会使用 base64Encode 字段映射函数将 AzureSearch_DocumentKey 映射到该字段。By default, when no explicit field mappings for the key index field are specified, the AzureSearch_DocumentKey is mapped to it, using the base64Encode field-mapping function.

示例Example

假设某个索引定义包含以下字段:Assume you've an index definition with the following fields:

  • id
  • temperature
  • pressure
  • timestamp

Blob 容器包含采用以下结构的 Blob:And your blob container has blobs with the following structure:

Blob1.jsonBlob1.json

    { "temperature": 100, "pressure": 100, "timestamp": "2019-02-13T00:00:00Z" }
    { "temperature" : 33, "pressure" : 30, "timestamp": "2019-02-14T00:00:00Z" }

Blob2.jsonBlob2.json

    { "temperature": 1, "pressure": 1, "timestamp": "2018-01-12T00:00:00Z" }
    { "temperature" : 120, "pressure" : 3, "timestamp": "2013-05-11T00:00:00Z" }

创建索引器并将 parsingMode 设置为 jsonLines(未指定键字段的任何显式字段映射)时,将隐式应用以下映射When you create an indexer and set the parsingMode to jsonLines - without specifying any explicit field mappings for the key field, the following mapping will be applied implicitly

    {
        "sourceFieldName" : "AzureSearch_DocumentKey",
        "targetFieldName": "id",
        "mappingFunction": { "name" : "base64Encode" }
    }

此设置会生成包含以下信息的 Azure 认知搜索索引(为简洁起见,base64 编码的 ID 已缩短)This setup will result in the Azure Cognitive Search index containing the following information (base64 encoded id shortened for brevity)

idid 温度temperature 压力pressure timestamptimestamp
aHR0 ...YjEuanNvbjsxaHR0 ... YjEuanNvbjsx 100100 100100 2019-02-13T00:00:00Z2019-02-13T00:00:00Z
aHR0 ...YjEuanNvbjsyaHR0 ... YjEuanNvbjsy 3333 3030 2019-02-14T00:00:00Z2019-02-14T00:00:00Z
aHR0 ...YjIuanNvbjsxaHR0 ... YjIuanNvbjsx 11 11 2018-01-12T00:00:00Z2018-01-12T00:00:00Z
aHR0 ...YjIuanNvbjsyaHR0 ... YjIuanNvbjsy 120120 33 2013-05-11T00:00:00Z2013-05-11T00:00:00Z

索引键字段的自定义字段映射Custom field mapping for index key field

假设索引定义与前面的示例相同,并且 Blob 容器包含采用以下结构的 Blob:Assuming the same index definition as the previous example, say your blob container has blobs with the following structure:

Blob1.jsonBlob1.json

    recordid, temperature, pressure, timestamp
    1, 100, 100,"2019-02-13T00:00:00Z" 
    2, 33, 30,"2019-02-14T00:00:00Z" 

Blob2.jsonBlob2.json

    recordid, temperature, pressure, timestamp
    1, 1, 1,"2018-01-12T00:00:00Z" 
    2, 120, 3,"2013-05-11T00:00:00Z" 

使用 delimitedText parsingMode 创建索引器时,可能会自然而然地将字段映射函数设置为如下所示的键字段:When you create an indexer with delimitedText parsingMode, it might feel natural to set up a field-mapping function to the key field as follows:

    {
        "sourceFieldName" : "recordid",
        "targetFieldName": "id"
    }

但是,此映射不会生成索引中显示的 4 个文档,因为 recordid 字段在各 Blob 中不是唯一的。 However, this mapping will not result in 4 documents showing up in the index, because the recordid field is not unique across blobs. 因此,我们建议对“一对多”分析模式,应用从 AzureSearch_DocumentKey 属性到键索引字段的隐式字段映射。Hence, we recommend you to make use of the implicit field mapping applied from the AzureSearch_DocumentKey property to the key index field for "one-to-many" parsing modes.

如果确实想要设置显式字段映射,请确保 sourceField 对于所有 Blob 中的每个实体都是不同的。If you do want to set up an explicit field mapping, make sure that the sourceField is distinct for each individual entity across all blobs.

备注

AzureSearch_DocumentKey 用来确保每个提取实体的唯一性的方法可能会发生变化,因此你不应该依赖于使用其值来解决应用程序的需求。The approach used by AzureSearch_DocumentKey of ensuring uniqueness per extracted entity is subject to change and therefore you should not rely on it's value for your application's needs.

后续步骤Next steps

如果尚未熟悉 blob 索引编制的基本结构和工作流,则应先使用 Azure 认知搜索为 Azure Blob 存储编制索引If you aren't already familiar with the basic structure and workflow of blob indexing, you should review Indexing Azure Blob Storage with Azure Cognitive Search first. 请查看以下文章,详细了解不同 blob 内容类型的分析模式。For more information about parsing modes for different blob content types, review the following articles.