如何使用 Azure 认知搜索中的 delimitedText 分析模式和 Blob 索引器为 CSV blob 编制索引How to index CSV blobs using delimitedText parsing mode and Blob indexers in Azure Cognitive Search

默认情况下,Azure 认知搜索 Blob 索引器会将分隔的文本 blob 分析为单个文本块。By default, Azure Cognitive Search blob indexer parses delimited text blobs as a single chunk of text. 但在 blob 含有 CSV 数据的情况下,通常希望将 blob 中的每一行视为一个单独文档。However, with blobs containing CSV data, you often want to treat each line in the blob as a separate document. 例如,给定以下带分隔符的文本,可能要将其分析为两个文档,每个文档包含“id”、“datePublished”和“tags”字段:For example, given the following delimited text, you might want to parse it into two documents, each containing "id", "datePublished", and "tags" fields:

id, datePublished, tags
1, 2016-01-12, "azure-search,azure,cloud"
2, 2016-07-07, "cloud,mobile"

本文介绍如何设置 delimitedText 分析模式,以便使用 Azure 认知搜索 Blob 索引器分析 CSV Blob。In this article, you will learn how to parse CSV blobs with an Azure Cognitive Search blob indexer by setting the delimitedText parsing mode.

备注

请遵循一对多索引中的索引器配置建议从一个 Azure Blob 输出多个搜索文档。Follow the indexer configuration recommendations in One-to-many indexing to output multiple search documents from one Azure blob.

设置 CSV 索引Setting up CSV indexing

若要对 CSV blob 编制索引,请使用 delimitedText 分析模式根据创建索引器请求创建或更新索引器定义:To index CSV blobs, create or update an indexer definition with the delimitedText parsing mode on a Create Indexer request:

    {
      "name" : "my-csv-indexer",
      ... other indexer properties
      "parameters" : { "configuration" : { "parsingMode" : "delimitedText", "firstLineContainsHeaders" : true } }
    }

firstLineContainsHeaders 指示每个 blob 的第一行(非空)包含标头。firstLineContainsHeaders indicates that the first (non-blank) line of each blob contains headers. 如果 blob 未包含初始标头行,则应在索引器配置中指定标头:If blobs don't contain an initial header line, the headers should be specified in the indexer configuration:

"parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextHeaders" : "id,datePublished,tags" } } 

可以使用 delimitedTextDelimiter 配置设置来自定义分隔符字符。You can customize the delimiter character using the delimitedTextDelimiter configuration setting. 例如:For example:

"parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextDelimiter" : "|" } }

备注

目前,仅支持 UTF-8 编码。Currently, only the UTF-8 encoding is supported. 如果需要支持其他编码,请在 UserVoice 上为其投票。If you need support for other encodings, vote for it on UserVoice.

重要

当使用分隔文本分析模式时,Azure 认知搜索假定数据源中的所有 blob 都将是 CSV。When you use the delimited text parsing mode, Azure Cognitive Search assumes that all blobs in your data source will be CSV. 如果需要在同一数据源中支持混用 CSV 和非 CSV blob,请在 UserVoice 上为其投票。If you need to support a mix of CSV and non-CSV blobs in the same data source, please vote for it on UserVoice.

请求示例Request examples

汇总后,以下是完整的有效负载示例。Putting this all together, here are the complete payload examples.

数据源:Datasource:

    POST https://[service name].search.azure.cn/datasources?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
        "name" : "my-blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
        "container" : { "name" : "my-container", "query" : "<optional, my-folder>" }
    }   

索引器:Indexer:

    POST https://[service name].search.azure.cn/indexers?api-version=2020-06-30
    Content-Type: application/json
    api-key: [admin key]

    {
      "name" : "my-csv-indexer",
      "dataSourceName" : "my-blob-datasource",
      "targetIndexName" : "my-target-index",
      "parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextHeaders" : "id,datePublished,tags" } }
    }

帮助我们改善 Azure 认知搜索Help us make Azure Cognitive Search better

如果有功能请求或改进建议,请在 UserVoice 上提供相关意见。If you have feature requests or ideas for improvements, provide your input on UserVoice.