教程:使用 REST 为 Azure 存储中的 JSON Blob 编制索引Tutorial: Index JSON blobs from Azure Storage using REST

Azure 认知搜索可使用一个知晓如何读取半结构化数据的索引器来编制 Azure blob 存储中 JSON 文档和数组的索引。Azure Cognitive Search can index JSON documents and arrays in Azure blob storage using an indexer that knows how to read semi-structured data. 半结构化数据包含用于分隔数据中的内容的标记或标签。Semi-structured data contains tags or markings which separate content within the data. 它的本质是提供必须全面索引的非结构化数据和符合数据模型的正式结构化数据之间的一个折中,例如可以按字段编制索引的关系数据库架构。It splits the difference between unstructured data, which must be fully indexed, and formally structured data that adheres to a data model, such as a relational database schema, that can be indexed on a per-field basis.

本教程使用 Postman 和搜索 REST API 执行以下任务:This tutorial uses Postman and the Search REST APIs to perform the following tasks:

  • 为 Azure blob 容器配置 Azure 认知搜索数据源Configure an Azure Cognitive Search data source for an Azure blob container
  • 创建 Azure 认知搜索索引以包含可搜索的内容Create an Azure Cognitive Search index to contain searchable content
  • 配置和运行索引器以读取容器和从 Azure blob 存储中提取可搜索内容Configure and run an indexer to read the container and extract searchable content from Azure blob storage
  • 搜索刚刚创建的索引Search the index you just created

如果没有 Azure 订阅,可在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.

先决条件Prerequisites

备注

可在本教程中使用免费服务。You can use the free service for this tutorial. 免费搜索服务限制为三个索引、三个索引器和三个数据源。A free search service limits you to three indexes, three indexers, and three data sources. 本教程每样创建一个。This tutorial creates one of each. 在开始之前,请确保服务中有足够的空间可接受新资源。Before starting, make sure you have room on your service to accept the new resources.

下载文件Download files

Clinical-trials-json.zip 包含本教程使用的数据。Clinical-trials-json.zip contains the data used in this tutorial. 请下载此文件并将其解压缩到其自身的文件夹。Download and unzip this file to its own folder. 数据源自 clinicaltrials.gov,已为本教程转换为 JSON。Data originates from clinicaltrials.gov, converted to JSON for this tutorial.

1 - 创建服务1 - Create services

本教程使用 Azure 认知搜索进行索引编制和查询,并使用 Azure Blob 存储提供数据。This tutorial uses Azure Cognitive Search for indexing and queries, and Azure Blob storage to provide the data.

如果可能,请在同一区域和资源组中创建这两个服务,使它们相互靠近并易于管理。If possible, create both in the same region and resource group for proximity and manageability. 在实践中,Azure 存储帐户可位于任意区域。In practice, your Azure Storage account can be in any region.

从 Azure 存储开始Start with Azure Storage

  1. 登录到 Azure 门户并单击“+ 创建资源”。 Sign in to the Azure portal and click + Create Resource.

  2. 搜索“存储帐户”,并选择“Microsoft 的存储帐户”产品/服务。 Search for storage account and select Microsoft's Storage Account offering.

    创建存储帐户Create Storage account

  3. 在“基本信息”选项卡中,必须填写以下项。In the Basics tab, the following items are required. 对于其他任何字段,请接受默认设置。Accept the defaults for everything else.

    • 资源组 。Resource group. 选择现有的资源组或创建新资源组,但对于所有服务请使用相同的组,以便可以统一管理这些服务。Select an existing one or create a new one, but use the same group for all services so that you can manage them collectively.

    • 存储帐户名称Storage account name. 如果你认为将来可能会用到相同类型的多个资源,请使用名称来区分类型和区域,例如 blobstoragechinaeast2。If you think you might have multiple resources of the same type, use the name to disambiguate by type and region, for example blobstoragechinaeast2.

    • 位置Location. 如果可能,请选择 Azure 认知搜索和认知服务所用的相同位置。If possible, choose the same location used for Azure Cognitive Search and Cognitive Services. 使用一个位置可以避免带宽费用。A single location voids bandwidth charges.

    • 帐户类型Account Kind. 选择默认设置“StorageV2 (常规用途 v2)” 。Choose the default, StorageV2 (general purpose v2).

  4. 单击“查看 + 创建”以创建服务。 Click Review + Create to create the service.

  5. 创建后,单击“转到资源”打开“概述”页。 Once it's created, click Go to the resource to open the Overview page.

  6. 单击“Blob”服务。 Click Blobs service.

  7. 创建一个 Blob 容器用于包含示例数据。Create a Blob container to contain sample data. 可将“公共访问级别”设为任何有效值。You can set the Public Access Level to any of its valid values.

  8. 创建容器后,将其打开,然后在命令栏中选择“上传” 。After the container is created, open it and select Upload on the command bar.

    在命令栏上上传Upload on command bar

  9. 导航到包含示例文件的文件夹。Navigate to the folder containing the sample files. 选择所有这些文件,然后单击“上传” 。Select all of them and then click Upload.

    上传文件Upload files

上传完成后,这些文件应会显示在数据容器内其自身的子文件夹中。After the upload completes, the files should appear in their own subfolder inside the data container.

下一个资源是可以在门户中创建的 Azure 认知搜索。The next resource is Azure Cognitive Search, which you can create in the portal. 可使用免费层完成本演练。You can use the Free tier to complete this walkthrough.

与处理 Azure Blob 存储时一样,请花片刻时间来收集访问密钥。As with Azure Blob storage, take a moment to collect the access key. 此外,在开始构建请求时,需要提供终结点和管理 API 密钥用于对每个请求进行身份验证。Further on, when you begin structuring requests, you will need to provide the endpoint and admin api-key used to authenticate each request.

获取密钥和 URLGet a key and URL

REST 调用需要在每个请求中使用服务 URL 和访问密钥。REST calls require the service URL and an access key on every request. 搜索服务是使用这二者创建的,因此,如果向订阅添加了 Azure 认知搜索,则请按以下步骤获取必需信息:A search service is created with both, so if you added Azure Cognitive Search to your subscription, follow these steps to get the necessary information:

  1. 登录到 Azure 门户,在搜索服务的“概述”页中获取 URL。 Sign in to the Azure portal, and in your search service Overview page, get the URL. 示例终结点可能类似于 https://mydemo.search.azure.cnAn example endpoint might look like https://mydemo.search.azure.cn.

  2. 在“设置” > “密钥”中,获取有关该服务的完全权限的管理员密钥 。In Settings > Keys, get an admin key for full rights on the service. 有两个可交换的管理员密钥,为保证业务连续性而提供,以防需要滚动一个密钥。There are two interchangeable admin keys, provided for business continuity in case you need to roll one over. 可以在请求中使用主要或辅助密钥来添加、修改和删除对象。You can use either the primary or secondary key on requests for adding, modifying, and deleting objects.

获取 HTTP 终结点和访问密钥Get an HTTP endpoint and access key

所有请求对发送到服务的每个请求都需要 API 密钥。All requests require an api-key on every request sent to your service. 具有有效的密钥可以在发送请求的应用程序与处理请求的服务之间建立信任关系,这种信任关系以每个请求为基础。Having a valid key establishes trust, on a per request basis, between the application sending the request and the service that handles it.

2 - 设置 Postman2 - Set up Postman

启动 Postman 并设置 HTTP 请求。Start Postman and set up an HTTP request. 如果不熟悉此工具,请参阅使用 Postman 探索 Azure 认知搜索 REST API 了解详细信息。If you are unfamiliar with this tool, see Explore Azure Cognitive Search REST APIs using Postman.

本教程中每个调用的请求方法是 POSTGETThe request methods for every call in this tutorial are POST and GET. 你将向搜索服务发出三个 API 调用,以创建数据源、索引和索引器。You'll make three API calls to your search service to create a data source, an index, and an indexer. 数据源包含指向存储帐户的指针以及 JSON 数据。The data source includes a pointer to your storage account and your JSON data. 加载数据时,搜索服务会建立连接。Your search service makes the connection when loading the data.

在标头中,将“Content-type”设置为 application/json,将 api-key 设置为 Azure 认知搜索服务的管理 API 密钥。In Headers, set "Content-type" to application/json and set api-key to the admin api-key of your Azure Cognitive Search service. 设置标头后,可将其用于本练习中的每个请求。Once you set the headers, you can use them for every request in this exercise.

Postman 请求 URL 和标头Postman request URL and header

URI 必须指定 api-version,每个调用应返回 201 CreatedURIs must specify an api-version and each call should return a 201 Created. 用于使用 JSON 数组的正式版 api-version 为 2020-06-30The generally available api-version for using JSON arrays is 2020-06-30.

3 - 创建数据源3 - Create a data source

创建数据源 API 可创建一个 Azure 认知搜索对象,用于指定要编制索引的数据。The Create Data Source API creates an Azure Cognitive Search object that specifies what data to index.

  1. 请将此调用的终结点设置为 https://[service name].search.azure.cn/datasources?api-version=2020-06-30Set the endpoint of this call to https://[service name].search.azure.cn/datasources?api-version=2020-06-30. 请将 [service name] 替换为搜索服务的名称。Replace [service name] with the name of your search service.

  2. 将以下 JSON 复制到请求正文中。Copy the following JSON into the request body.

    {
        "name" : "clinical-trials-json-ds",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=[storage account name];AccountKey=[storage account key];" },
        "container" : { "name" : "[blob container name]"}
    }
    
  3. 将连接字符串替换为帐户的有效字符串。Replace the connection string with a valid string for your account.

  4. 将“[blob container name]”替换成为示例数据创建的容器。Replace "[blob container name]" with the container you created for the sample data.

  5. 发送请求。Send the request. 响应应如下所示:The response should look like:

    {
        "@odata.context": "https://exampleurl.search.azure.cn/$metadata#datasources/$entity",
        "@odata.etag": "\"0x8D505FBC3856C9E\"",
        "name": "clinical-trials-json-ds",
        "description": null,
        "type": "azureblob",
        "subtype": null,
        "credentials": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=[mystorageaccounthere];AccountKey=[[myaccountkeyhere]]];"
        },
        "container": {
            "name": "[mycontainernamehere]",
            "query": null
        },
        "dataChangeDetectionPolicy": null,
        "dataDeletionDetectionPolicy": null
    }
    

4 - 创建索引4 - Create an index

第二次调用的是创建索引 API,用于创建可存储所有可搜索数据的 Azure 认知搜索索引。The second call is Create Index API, creating an Azure Cognitive Search index that stores all searchable data. 索引指定所有参数及其属性。An index specifies all the parameters and their attributes.

  1. 请将此调用的终结点设置为 https://[service name].search.azure.cn/indexes?api-version=2020-06-30Set the endpoint of this call to https://[service name].search.azure.cn/indexes?api-version=2020-06-30. 请将 [service name] 替换为搜索服务的名称。Replace [service name] with the name of your search service.

  2. 将以下 JSON 复制到请求正文中。Copy the following JSON into the request body.

    {
      "name": "clinical-trials-json-index",  
      "fields": [
      {"name": "FileName", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": false, "filterable": false, "sortable": true},
      {"name": "Description", "type": "Edm.String", "searchable": true, "retrievable": false, "facetable": false, "filterable": false, "sortable": false},
      {"name": "MinimumAge", "type": "Edm.Int32", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": true},
      {"name": "Title", "type": "Edm.String", "searchable": true, "retrievable": true, "facetable": false, "filterable": true, "sortable": true},
      {"name": "URL", "type": "Edm.String", "searchable": false, "retrievable": false, "facetable": false, "filterable": false, "sortable": false},
      {"name": "MyURL", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": false, "filterable": false, "sortable": false},
      {"name": "Gender", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": false},
      {"name": "MaximumAge", "type": "Edm.Int32", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": true},
      {"name": "Summary", "type": "Edm.String", "searchable": true, "retrievable": true, "facetable": false, "filterable": false, "sortable": false},
      {"name": "NCTID", "type": "Edm.String", "key": true, "searchable": true, "retrievable": true, "facetable": false, "filterable": true, "sortable": true},
      {"name": "Phase", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": false},
      {"name": "Date", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": false, "filterable": false, "sortable": true},
      {"name": "OverallStatus", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": false},
      {"name": "OrgStudyId", "type": "Edm.String", "searchable": true, "retrievable": true, "facetable": false, "filterable": true, "sortable": false},
      {"name": "HealthyVolunteers", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": false},
      {"name": "Keywords", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "facetable": true, "filterable": false, "sortable": false},
      {"name": "metadata_storage_last_modified", "type":"Edm.DateTimeOffset", "searchable": false, "retrievable": true, "filterable": true, "sortable": false},
      {"name": "metadata_storage_size", "type":"Edm.String", "searchable": false, "retrievable": true, "filterable": true, "sortable": false},
      {"name": "metadata_content_type", "type":"Edm.String", "searchable": true, "retrievable": true, "filterable": true, "sortable": false}
      ]
    }
    
  3. 发送请求。Send the request. 响应应如下所示:The response should look like:

    {
        "@odata.context": "https://exampleurl.search.azure.cn/$metadata#indexes/$entity",
        "@odata.etag": "\"0x8D505FC00EDD5FA\"",
        "name": "clinical-trials-json-index",
        "fields": [
            {
                "name": "FileName",
                "type": "Edm.String",
                "searchable": false,
                "filterable": false,
                "retrievable": true,
                "sortable": true,
                "facetable": false,
                "key": false,
                "indexAnalyzer": null,
                "searchAnalyzer": null,
                "analyzer": null,
                "synonymMaps": []
            },
            {
                "name": "Description",
                "type": "Edm.String",
                "searchable": true,
                "filterable": false,
                "retrievable": false,
                "sortable": false,
                "facetable": false,
                "key": false,
                "indexAnalyzer": null,
                "searchAnalyzer": null,
                "analyzer": null,
                "synonymMaps": []
            },
            ...
          }
    

5 - 创建并运行索引器5 - Create and run an indexer

索引器连接到数据源,将数据导入目标搜索索引,并选择性地提供一个计划来自动执行数据刷新。An indexer connects to the data source, imports data into the target search index, and optionally provides a schedule to automate the data refresh. REST API 为创建索引器The REST API is Create Indexer.

  1. 请将此调用的 URI 设置为 https://[service name].search.azure.cn/indexers?api-version=2020-06-30Set the URI for this call to https://[service name].search.azure.cn/indexers?api-version=2020-06-30. 请将 [service name] 替换为搜索服务的名称。Replace [service name] with the name of your search service.

  2. 将以下 JSON 复制到请求正文中。Copy the following JSON into the request body.

    {
      "name" : "clinical-trials-json-indexer",
      "dataSourceName" : "clinical-trials-json-ds",
      "targetIndexName" : "clinical-trials-json-index",
      "parameters" : { "configuration" : { "parsingMode" : "jsonArray" } }
    }
    
  3. 发送请求。Send the request. 系统会立即处理该请求。The request is processed immediately. 当响应返回时,便拥有了可进行全文搜索的索引。When the response comes back, you will have an index that is full-text searchable. 响应应如下所示:The response should look like:

    {
        "@odata.context": "https://exampleurl.search.azure.cn/$metadata#indexers/$entity",
        "@odata.etag": "\"0x8D505FDE143D164\"",
        "name": "clinical-trials-json-indexer",
        "description": null,
        "dataSourceName": "clinical-trials-json-ds",
        "targetIndexName": "clinical-trials-json-index",
        "schedule": null,
        "parameters": {
            "batchSize": null,
            "maxFailedItems": null,
            "maxFailedItemsPerBatch": null,
            "base64EncodeKeys": null,
            "configuration": {
                "parsingMode": "jsonArray"
            }
        },
        "fieldMappings": [],
        "enrichers": [],
        "disabled": null
    }
    

6 - 搜索 JSON 文件6 - Search your JSON files

加载第一个文档后,可立即开始搜索。You can start searching as soon as the first document is loaded.

  1. 将谓词更改为 GETChange the verb to GET.

  2. 请将此调用的 URI 设置为 https://[service name].search.azure.cn/indexes/clinical-trials-json-index/docs?search=*&api-version=2019-05-06&$count=trueSet the URI for this call to https://[service name].search.azure.cn/indexes/clinical-trials-json-index/docs?search=*&api-version=2019-05-06&$count=true. 请将 [service name] 替换为搜索服务的名称。Replace [service name] with the name of your search service.

  3. 发送请求。Send the request. 这是一个未指定的全文搜索查询,它返回索引中标记为可检索的所有字段,以及文档计数。This is an unspecified full text search query that returns all of the fields marked as retrievable in the index, along with a document count. 响应应如下所示:The response should look like:

    {
        "@odata.context": "https://exampleurl.search.azure.cn/indexes('clinical-trials-json-index')/$metadata#docs(*)",
        "@odata.count": 100,
        "value": [
            {
                "@search.score": 1.0,
                "FileName": "NCT00000102.txt",
                "MinimumAge": 14,
                "Title": "Congenital Adrenal Hyperplasia: Calcium Channels as Therapeutic Targets",
                "MyURL": "https://azure.storagedemos.com/clinical-trials/NCT00000102.txt",
                "Gender": "Both",
                "MaximumAge": 35,
                "Summary": "This study will test the ability of extended release nifedipine (Procardia XL), a blood pressure medication, to permit a decrease in the dose of glucocorticoid medication children take to treat congenital adrenal hyperplasia (CAH).",
                "NCTID": "NCT00000102",
                "Phase": "Phase 1/Phase 2",
                "Date": "ClinicalTrials.gov processed this data on October 25, 2016",
                "OverallStatus": "Completed",
                "OrgStudyId": "NCRR-M01RR01070-0506",
                "HealthyVolunteers": "No",
                "Keywords": [],
                "metadata_storage_last_modified": "2019-04-09T18:16:24Z",
                "metadata_storage_size": "33060",
                "metadata_content_type": null
            },
            . . . 
    
  4. 添加 $select 查询参数以将结果限制为更少的字段:https://[service name].search.azure.cn/indexes/clinical-trials-json-index/docs?search=*&$select=Gender,metadata_storage_size&api-version=2020-06-30&$count=trueAdd the $select query parameter to limit the results to fewer fields: https://[service name].search.azure.cn/indexes/clinical-trials-json-index/docs?search=*&$select=Gender,metadata_storage_size&api-version=2020-06-30&$count=true. 对于此查询,有 100 个匹配的文档,但默认情况下,Azure 认知搜索仅在结果中返回 50 个文档。For this query, 100 documents match, but by default, Azure Cognitive Search only returns 50 in the results.

    参数化查询Parameterized query

  5. 更复杂查询的示例包含 $filter=MinimumAge ge 30 and MaximumAge lt 75,它只返回参数 MinimumAge 大于或等于 30 且参数 MaximumAge 小于 75 的结果。An example of more complex query would include $filter=MinimumAge ge 30 and MaximumAge lt 75, which returns only results where the parameters MinimumAge is greater than or equal to 30 and MaximumAge is less than 75. 请将 $select 表达式替换为 $filter 表达式。Replace the $select expression with the $filter expression.

    半结构化搜索

还可以使用逻辑运算符(and、or、not)和比较运算符(eq、ne、gt、lt、ge、le)。You can also use Logical operators (and, or, not) and comparison operators (eq, ne, gt, lt, ge, le). 字符串比较区分大小写。String comparisons are case-sensitive. 有关详细信息和示例,请参阅创建简单查询For more information and examples, see Create a simple query.

备注

$filter 参数只适用于在创建索引时标记为可筛选的元数据。The $filter parameter only works with metadata that were marked filterable at the creation of your index.

重置并重新运行Reset and rerun

在开发的前期试验阶段,设计迭代的最实用方法是,删除 Azure 认知搜索中的对象,并允许代码重新生成它们。In the early experimental stages of development, the most practical approach for design iteration is to delete the objects from Azure Cognitive Search and allow your code to rebuild them. 资源名称是唯一的。Resource names are unique. 删除某个对象后,可以使用相同的名称重新创建它。Deleting an object lets you recreate it using the same name.

可以使用门户来删除索引、索引器和数据源。You can use the portal to delete indexes, indexers, and data sources. 或者使用 DELETE 并提供每个对象的 URL。Or use DELETE and provide URLs to each object. 以下命令删除一个索引器。The following command deletes an indexer.

DELETE https://[YOUR-SERVICE-NAME].search.azure.cn/indexers/clinical-trials-json-indexer?api-version=2020-06-30

成功删除后会返回状态代码 204。Status code 204 is returned on successful deletion.

清理资源Clean up resources

在自己的订阅中操作时,最好在项目结束时删除不再需要的资源。When you're working in your own subscription, at the end of a project, it's a good idea to remove the resources that you no longer need. 持续运行资源可能会产生费用。Resources left running can cost you money. 可以逐个删除资源,也可以删除资源组以删除整个资源集。You can delete resources individually or delete the resource group to delete the entire set of resources.

可以使用左侧导航窗格中的“所有资源”或“资源组”链接在门户中查找和管理资源。You can find and manage resources in the portal, using the All resources or Resource groups link in the left-navigation pane.

后续步骤Next steps

熟悉 Azure Blob 索引编制的基础知识后,接下来让我们更详细地了解 Azure 存储中 JSON Blob 的索引器配置。Now that you're familiar with the basics of Azure Blob indexing, let's take a closer look at indexer configuration for JSON blobs in Azure Storage.