教程:使用 REST 和 AI 从 Azure Blob 生成可搜索的内容Tutorial: Use REST and AI to generate searchable content from Azure blobs

如果在 Azure Blob 存储中有使用非结构化文本或图像,则 AI 扩充管道可以提取信息,并创建可用于全文搜索或知识挖掘方案的新内容。If you have unstructured text or images in Azure Blob storage, an AI enrichment pipeline can extract information and create new content that is useful for full-text search or knowledge mining scenarios. 尽管管道可以处理图像,但本 REST 教程侧重于如何分析文本、应用语言检测和自然语言处理,以创建可在查询、分面和筛选器中利用的新字段。Although a pipeline can process images, this REST tutorial focuses on text, applying language detection and natural language processing to create new fields that you can leverage in queries, facets, and filters.

本教程使用 Postman 和搜索 REST API 执行以下任务:This tutorial uses Postman and the Search REST APIs to perform the following tasks:

  • 从整个文档(非结构化文本,例如 Azure Blob 存储中的 PDF、HTML、DOCX 和 PPTX)着手。Start with whole documents (unstructured text) such as PDF, HTML, DOCX, and PPTX in Azure Blob storage.
  • 定义一个管道,用于提取文本、检测语言、识别实体和检测关键短语。Define a pipeline that extracts text, detects language, recognizes entities, and detects key phrases.
  • 定义用于存储输出(原始内容,加上管道生成的名称/值对)的索引。Define an index to store the output (raw content, plus pipeline-generated name-value pairs).
  • 执行管道以开始转换和分析,以及创建和加载索引。Execute the pipeline to start transformations and analysis, and to create and load the index.
  • 使用全文搜索和丰富的查询语法浏览结果。Explore results using full text search and a rich query syntax.

如果你没有 Azure 订阅,请在开始之前建立一个试用帐户If you don't have an Azure subscription, open a trial account before you begin.

先决条件Prerequisites

备注

可在本教程中使用免费服务。You can use the free service for this tutorial. 免费搜索服务限制为三个索引、三个索引器和三个数据源。A free search service limits you to three indexes, three indexers, and three data sources. 本教程每样创建一个。This tutorial creates one of each. 在开始之前,请确保服务中有足够的空间可接受新资源。Before starting, make sure you have room on your service to accept the new resources.

下载文件Download files

  1. 打开此 OneDrive 文件夹,然后单击左上角的“下载”将文件复制到计算机。 Open this OneDrive folder and on the top-left corner, click Download to copy the files to your computer.

  2. 右键单击 zip 文件并选择“全部提取”。 Right-click the zip file and select Extract All. 有 14 个不同类型的文件。There are 14 files of various types. 本练习将使用其中的 7 个文件。You'll use 7 for this exercise.

1 - 创建服务1 - Create services

本教程使用 Azure 认知搜索编制索引和进行查询、使用后端的认知服务进行 AI 扩充,并使用 Azure Blob 存储提供数据。This tutorial uses Azure Cognitive Search for indexing and queries, Cognitive Services on the backend for AI enrichment, and Azure Blob storage to provide the data. 本教程使用的认知服务不超过每日为每个索引器免费分配 20 个事务这一限制,因此,只需要创建搜索和存储服务。This tutorial stays under the free allocation of 20 transactions per indexer per day on Cognitive Services, so the only services you need to create are search and storage.

如果可能,请在同一区域和资源组中创建这两个服务,使它们相互靠近并易于管理。If possible, create both in the same region and resource group for proximity and manageability. 在实践中,Azure 存储帐户可位于任意区域。In practice, your Azure Storage account can be in any region.

从 Azure 存储开始Start with Azure Storage

  1. 登录到 Azure 门户并单击“+ 创建资源”。 Sign in to the Azure portal and click + Create Resource.

  2. 搜索“存储帐户”,并选择“Microsoft 的存储帐户”产品/服务。 Search for storage account and select Microsoft's Storage Account offering.

    创建存储帐户Create Storage account

  3. 在“基本信息”选项卡中,必须填写以下项。In the Basics tab, the following items are required. 对于其他任何字段,请接受默认设置。Accept the defaults for everything else.

    • 资源组 。Resource group. 选择现有的资源组或创建新资源组,但对于所有服务请使用相同的组,以便可以统一管理这些服务。Select an existing one or create a new one, but use the same group for all services so that you can manage them collectively.

    • 存储帐户名称Storage account name. 如果你认为将来可能会用到相同类型的多个资源,请使用名称来区分类型和区域,例如 blobstoragewestusIf you think you might have multiple resources of the same type, use the name to disambiguate by type and region, for example blobstoragewestus.

    • 位置Location. 如果可能,请选择 Azure 认知搜索和认知服务所用的相同位置。If possible, choose the same location used for Azure Cognitive Search and Cognitive Services. 使用一个位置可以避免带宽费用。A single location voids bandwidth charges.

    • 帐户类型Account Kind. 选择默认设置“StorageV2 (常规用途 v2)” 。Choose the default, StorageV2 (general purpose v2).

  4. 单击“查看 + 创建”以创建服务。 Click Review + Create to create the service.

  5. 创建后,单击“转到资源”打开“概述”页。 Once it's created, click Go to the resource to open the Overview page.

  6. 单击“Blob”服务。 Click Blobs service.

  7. 单击“+ 容器”创建容器,并将其命名为 cog-search-demoClick + Container to create a container and name it cog-search-demo.

  8. 选择“cog-search-demo”,然后单击“上传”打开下载文件所保存到的文件夹。 Select cog-search-demo and then click Upload to open the folder where you saved the download files. 选择所有的非图像文件。Select all of the non-image files. 应有 7 个文件。You should have 7 files. 单击“确定”以上传。 Click OK to upload.

    上传示例文件Upload sample files

  9. 在退出 Azure 存储之前获取一个连接字符串,以便可以在 Azure 认知搜索中构建连接。Before you leave Azure Storage, get a connection string so that you can formulate a connection in Azure Cognitive Search.

    1. 向后浏览到存储帐户的“概述”页(我们使用了 blobstragewestus 作为示例)。Browse back to the Overview page of your storage account (we used blobstragewestus as an example).

    2. 在左侧导航窗格中,选择“访问密钥”并复制其中一个连接字符串。 In the left navigation pane, select Access keys and copy one of the connection strings.

    连接字符串是类似于以下示例的 URL:The connection string is a URL similar to the following example:

    DefaultEndpointsProtocol=https;AccountName=cogsrchdemostorage;AccountKey=<your account key>;EndpointSuffix=core.chinacloudapi.cn
    
  10. 将连接字符串保存到记事本中。Save the connection string to Notepad. 稍后在设置数据源连接时需要用到它。You'll need it later when setting up the data source connection.

认知服务Cognitive Services

AI 扩充由认知服务(包括用于自然语言和图像处理的文本分析与计算机视觉)提供支持。AI enrichment is backed by Cognitive Services, including Text Analytics and Computer Vision for natural language and image processing. 如果你的目标是完成实际原型或项目,则此时应预配认知服务(在 Azure 认知搜索所在的同一区域中),以便可将认知服务附加到索引操作。If your objective was to complete an actual prototype or project, you would at this point provision Cognitive Services (in the same region as Azure Cognitive Search) so that you can attach it to indexing operations.

但是,对于本练习,可以跳过资源预配,因为 Azure 认知搜索在幕后可以连接到认知服务,并为每个索引器运行提供 20 个免费事务。For this exercise, however, you can skip resource provisioning because Azure Cognitive Search can connect to Cognitive Services behind the scenes and give you 20 free transactions per indexer run. 由于本教程使用 7 个事务,因此免费的分配已足够。Since this tutorial uses 7 transactions, the free allocation is sufficient. 对于大型项目,请计划在即用即付 S0 层预配认知服务。For larger projects, plan on provisioning Cognitive Services at the pay-as-you-go S0 tier. 有关详细信息,请参阅附加认知服务For more information, see Attach Cognitive Services.

第三个组件是可以在门户中创建的 Azure 认知搜索。The third component is Azure Cognitive Search, which you can create in the portal. 可使用免费层完成本演练。You can use the Free tier to complete this walkthrough.

与处理 Azure Blob 存储时一样,请花片刻时间来收集访问密钥。As with Azure Blob storage, take a moment to collect the access key. 此外,在开始构建请求时,需要提供终结点和管理 API 密钥用于对每个请求进行身份验证。Further on, when you begin structuring requests, you will need to provide the endpoint and admin api-key used to authenticate each request.

  1. 登录到 Azure 门户,在搜索服务的“概述”页中获取搜索服务的名称。 Sign in to the Azure portal, and in your search service Overview page, get the name of your search service. 可以通过查看终结点 URL 来确认服务名称。You can confirm your service name by reviewing the endpoint URL. 如果终结点 URL 为 https://mydemo.search.azure.cn,则服务名称为 mydemoIf your endpoint URL were https://mydemo.search.azure.cn, your service name would be mydemo.

  2. 在“设置” > “密钥”中,获取有关该服务的完全权限的管理员密钥 。In Settings > Keys, get an admin key for full rights on the service. 有两个可交换的管理员密钥,为保证业务连续性而提供,以防需要滚动一个密钥。There are two interchangeable admin keys, provided for business continuity in case you need to roll one over. 可以在请求中使用主要或辅助密钥来添加、修改和删除对象。You can use either the primary or secondary key on requests for adding, modifying, and deleting objects.

    此外,获取查询密钥。Get the query key as well. 最好使用只读权限发出查询请求。It's a best practice to issue query requests with read-only access.

    获取服务名称以及管理密钥和查询密钥

所有请求要求在发送到服务的每个请求的标头中指定 API 密钥。All requests require an api-key in the header of every request sent to your service. 具有有效的密钥可以在发送请求的应用程序与处理请求的服务之间建立信任关系,这种信任关系以每个请求为基础。A valid key establishes trust, on a per request basis, between the application sending the request and the service that handles it.

2 - 设置 Postman2 - Set up Postman

启动 Postman 并设置 HTTP 请求。Start Postman and set up an HTTP request. 如果不熟悉此工具,请参阅使用 Postman 探索 Azure 认知搜索 REST API 了解详细信息。If you are unfamiliar with this tool, see Explore Azure Cognitive Search REST APIs using Postman.

本教程中使用的请求方法是 POSTPUTGETThe request methods used in this tutorial are POST, PUT, and GET. 你将使用这些方法对搜索服务发出四个 API 调用:创建数据源、创建技能集、创建索引和创建索引器。You'll use the methods to make four API calls to your search service: create a data source, a skillset, an index, and an indexer.

在标头中,将“Content-type”设置为 application/json,将 api-key 设置为 Azure 认知搜索服务的管理 API 密钥。In Headers, set "Content-type" to application/json and set api-key to the admin api-key of your Azure Cognitive Search service. 设置标头后,可将其用于本练习中的每个请求。Once you set the headers, you can use them for every request in this exercise.

Postman 请求 URL 和标头Postman request URL and header

3 - 创建管道3 - Create the pipeline

在 Azure 认知搜索中,AI 处理是在索引编制(或数据引入)期间发生的。In Azure Cognitive Search, AI processing occurs during indexing (or data ingestion). 本演练部分将创建四个对象:数据源、索引定义、技能集和索引器。This part of the walkthrough creates four objects: data source, index definition, skillset, indexer.

步骤 1:创建数据源Step 1: Create a data source

数据源对象为包含文件的 Blob 容器提供连接字符串。A data source object provides the connection string to the Blob container containing the files.

  1. 使用 POST 和以下 URL(请将 YOUR-SERVICE-NAME 替换为实际的服务名称)。Use POST and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service.

    https://[YOUR-SERVICE-NAME].search.azure.cn/datasources?api-version=2019-05-06
    
  2. 在请求的正文中复制以下 JSON 定义(请将 connectionString 替换为存储帐户的实际连接)。In request Body, copy the following JSON definition, replacing the connectionString with the actual connection of your storage account.

    此外,请记得编辑容器名称。Remember to edit the container name as well. 在前一步骤中,我们已建议使用“cog-search-demo”作为容器名称。We suggested "cog-search-demo" for the container name in an earlier step.

    {
      "name" : "cog-search-demo-ds",
      "description" : "Demo files to demonstrate cognitive search capabilities.",
      "type" : "azureblob",
      "credentials" :
      { "connectionString" :
        "DefaultEndpointsProtocol=https;AccountName=<YOUR-STORAGE-ACCOUNT>;AccountKey=<YOUR-ACCOUNT-KEY>;"
      },
      "container" : { "name" : "<YOUR-BLOB-CONTAINER-NAME>" }
    }
    
  3. 发送请求。Send the request. 应会看到状态代码 201(确认成功)。You should see a status code of 201 confirming success.

如果收到 403 或 404 错误,请检查请求构造:api-version=2019-05-06 应位于终结点上,api-key 应位于标头中的 Content-Type 后面,并且其值必须对搜索服务有效。If you got a 403 or 404 error, check the request construction: api-version=2019-05-06 should be on the endpoint, api-key should be in the Header after Content-Type, and its value must be valid for a search service. 可以通过联机 JSON 验证程序运行 JSON 文档,以确保语法正确。You might want to run the JSON document through an online JSON validator to make sure the syntax is correct.

步骤 2:创建技能集Step 2: Create a skillset

技能集对象是应用到内容的一组扩充步骤。A skillset object is a set of enrichment steps applied to your content.

  1. 使用 PUT 和以下 URL(请将 YOUR-SERVICE-NAME 替换为实际的服务名称)。Use PUT and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service.

    https://[YOUR-SERVICE-NAME].search.azure.cn/skillsets/cog-search-demo-ss?api-version=2019-05-06
    
  2. 在请求的正文中复制以下 JSON 定义。In request Body, copy the JSON definition below. 此技能集包括以下内置技能。This skillset consists of the following built-in skills.

    技能Skill 说明Description
    实体识别Entity Recognition 从 Blob 容器中的内容提取人员、组织和位置的名称。Extracts the names of people, organizations, and locations from content in the blob container.
    语言检测Language Detection 检测内容的语言。Detects the content's language.
    文本拆分Text Split 将大段内容拆分为较小区块,然后调用关键短语提取技能。Breaks large content into smaller chunks before calling the key phrase extraction skill. 关键短语提取接受不超过 50,000 个字符的输入。Key phrase extraction accepts inputs of 50,000 characters or less. 有几个示例文件需要拆分才能保留在此限制范围内。A few of the sample files need splitting up to fit within this limit.
    关键短语提取Key Phrase Extraction 提取出最相关的关键短语。Pulls out the top key phrases.

    每个技能会针对文档的内容执行。Each skill executes on the content of the document. 在处理期间,Azure 认知搜索会解码每个文档,以从不同的文件格式读取内容。During processing, Azure Cognitive Search cracks each document to read content from different file formats. 从源文件中找到的文本将放入一个生成的 content 字段(每个文档对应一个字段)。Found text originating in the source file is placed into a generated content field, one for each document. 因此,输入将变为 "/document/content"As such, the input becomes "/document/content".

    对于关键短语提取,由于我们使用了文本拆分器技能将较大文件分解成多个页面,因此关键短语提取技能的上下文是 "document/pages/*"(适用于文档中的每个页面),而不是 "/document/content"For key phrase extraction, because we use the text splitter skill to break larger files into pages, the context for the key phrase extraction skill is "document/pages/*" (for each page in the document) instead of "/document/content".

    {
      "description": "Extract entities, detect language and extract key-phrases",
      "skills":
      [
        {
          "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
          "categories": [ "Person", "Organization", "Location" ],
          "defaultLanguageCode": "en",
          "inputs": [
            { "name": "text", "source": "/document/content" }
          ],
          "outputs": [
            { "name": "persons", "targetName": "persons" },
            { "name": "organizations", "targetName": "organizations" },
            { "name": "locations", "targetName": "locations" }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
          "inputs": [
            { "name": "text", "source": "/document/content" }
          ],
          "outputs": [
            { "name": "languageCode", "targetName": "languageCode" }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "textSplitMode" : "pages",
          "maximumPageLength": 4000,
          "inputs": [
            { "name": "text", "source": "/document/content" },
            { "name": "languageCode", "source": "/document/languageCode" }
          ],
          "outputs": [
            { "name": "textItems", "targetName": "pages" }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
          "context": "/document/pages/*",
          "inputs": [
            { "name": "text", "source": "/document/pages/*" },
            { "name":"languageCode", "source": "/document/languageCode" }
          ],
          "outputs": [
            { "name": "keyPhrases", "targetName": "keyPhrases" }
          ]
        }
      ]
    }
    

    技能集的图形表示形式如下所示。A graphical representation of the skillset is shown below.

    了解技能组Understand a skillset

  3. 发送请求。Send the request. Postman 应返回状态代码 201(确认成功)。Postman should return a status code of 201 confirming success.

备注

输出可以映射到索引、用作下游技能的输入,或者既映射到索引又用作输入(在语言代码中就是这样)。Outputs can be mapped to an index, used as input to a downstream skill, or both as is the case with language code. 在索引中,语言代码可用于筛选。In the index, a language code is useful for filtering. 文本分析技能使用语言代码作为输入来告知有关断字的语言规则。As an input, language code is used by text analysis skills to inform the linguistic rules around word breaking. 若要详细了解技能集的基础知识,请参阅如何定义技能集For more information about skillset fundamentals, see How to define a skillset.

步骤 3:创建索引Step 3: Create an index

索引提供所需的架构用于在反向索引中创建内容的实际表达形式,以及在 Azure 认知搜索中创建其他构造。An index provides the schema used to create the physical expression of your content in inverted indexes and other constructs in Azure Cognitive Search. 索引的最大组件是字段集合,其中的数据类型和属性确定了 Azure 认知搜索中的内容和行为。The largest component of an index is the fields collection, where data type and attributes determine contents and behaviors in Azure Cognitive Search.

  1. 使用 PUT 和以下 URL(请将 YOUR-SERVICE-NAME 替换为实际的服务名称)来命名索引。Use PUT and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service, to name your index.

    https://[YOUR-SERVICE-NAME].search.azure.cn/indexes/cog-search-demo-idx?api-version=2019-05-06
    
  2. 在请求的正文中复制以下 JSON 定义。In request Body, copy the following JSON definition. content 字段存储文档本身。The content field stores the document itself. languageCodekeyPhrasesorganizations 的附加字段表示技能集创建的新信息(字段和值)。Additional fields for languageCode, keyPhrases, and organizations represent new information (fields and values) created by the skillset.

    {
      "fields": [
        {
          "name": "id",
          "type": "Edm.String",
          "key": true,
          "searchable": true,
          "filterable": false,
          "facetable": false,
          "sortable": true
        },
        {
          "name": "metadata_storage_name",
          "type": "Edm.String",
          "searchable": false,
          "filterable": false,
          "facetable": false,
          "sortable": false
        },
        {
          "name": "content",
          "type": "Edm.String",
          "sortable": false,
          "searchable": true,
          "filterable": false,
          "facetable": false
        },
        {
          "name": "languageCode",
          "type": "Edm.String",
          "searchable": true,
          "filterable": false,
          "facetable": false
        },
        {
          "name": "keyPhrases",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "filterable": false,
          "facetable": false
        },
        {
          "name": "persons",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "sortable": false,
          "filterable": true,
          "facetable": true
        },
        {
          "name": "organizations",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "sortable": false,
          "filterable": true,
          "facetable": true
        },
        {
          "name": "locations",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "sortable": false,
          "filterable": true,
          "facetable": true
        }
      ]
    }
    
  3. 发送请求。Send the request. Postman 应返回状态代码 201(确认成功)。Postman should return a status code of 201 confirming success.

步骤 4:创建并运行索引器Step 4: Create and run an indexer

索引器驱动管道。An Indexer drives the pipeline. 到目前为止创建的三个组件(数据源、技能集、索引)是索引器的输入。The three components you have created thus far (data source, skillset, index) are inputs to an indexer. 在 Azure 认知搜索中创建索引器是运转整个管道的事件。Creating the indexer on Azure Cognitive Search is the event that puts the entire pipeline into motion.

  1. 使用 PUT 和以下 URL(请将 YOUR-SERVICE-NAME 替换为实际的服务名称)来命名索引器。Use PUT and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service, to name your indexer.

    https://[servicename].search.azure.cn/indexers/cog-search-demo-idxr?api-version=2019-05-06
    
  2. 在请求的正文中复制以下 JSON 定义。In request Body, copy the JSON definition below. 请注意字段映射元素;这些映射非常重要,因为它们定义了数据流。Notice the field mapping elements; these mappings are important because they define the data flow.

    fieldMappings 在技能集之前处理,它将数据源中的内容发送到索引中的目标字段。The fieldMappings are processed before the skillset, sending content from the data source to target fields in an index. 稍后要使用字段映射将未修改的现有内容发送到索引。You'll use field mappings to send existing, unmodified content to the index. 如果两端的字段名称和类型相同,则无需映射。If field names and types are the same at both ends, no mapping is required.

    outputFieldMappings 用于技能创建的字段,因此在技能集运行后进行处理。The outputFieldMappings are for fields created by skills, and thus processed after the skillset has run. outputFieldMappingssourceFieldNames 的引用在文档破解或扩充创建它们之前并不存在。The references to sourceFieldNames in outputFieldMappings don't exist until document cracking or enrichment creates them. targetFieldName 是索引中的字段,在索引架构中定义。The targetFieldName is a field in an index, defined in the index schema.

    {
      "name":"cog-search-demo-idxr",    
      "dataSourceName" : "cog-search-demo-ds",
      "targetIndexName" : "cog-search-demo-idx",
      "skillsetName" : "cog-search-demo-ss",
      "fieldMappings" : [
        {
          "sourceFieldName" : "metadata_storage_path",
          "targetFieldName" : "id",
          "mappingFunction" :
            { "name" : "base64Encode" }
        },
        {
          "sourceFieldName" : "metadata_storage_name",
          "targetFieldName" : "metadata_storage_name",
          "mappingFunction" :
            { "name" : "base64Encode" }
        },
        {
          "sourceFieldName" : "content",
          "targetFieldName" : "content"
        }
      ],
      "outputFieldMappings" :
      [
        {
          "sourceFieldName" : "/document/persons",
          "targetFieldName" : "persons"
        },
        {
          "sourceFieldName" : "/document/organizations",
          "targetFieldName" : "organizations"
        },
        {
          "sourceFieldName" : "/document/locations",
          "targetFieldName" : "locations"
        },
        {
          "sourceFieldName" : "/document/pages/*/keyPhrases/*",
          "targetFieldName" : "keyPhrases"
        },
        {
          "sourceFieldName": "/document/languageCode",
          "targetFieldName": "languageCode"
        }
      ],
      "parameters":
      {
        "maxFailedItems":-1,
        "maxFailedItemsPerBatch":-1,
        "configuration":
        {
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "default",
          "firstLineContainsHeaders": false,
          "delimitedTextDelimiter": ","
        }
      }
    }
    
  3. 发送请求。Send the request. Postman 应返回状态代码 201(确认处理成功)。Postman should return a status code of 201 confirming successful processing.

    此步骤预期需要几分钟时间才能完成。Expect this step to take several minutes to complete. 即使数据集较小,分析技能也会消耗大量的计算资源。Even though the data set is small, analytical skills are computation-intensive.

备注

创建索引器会调用管道。Creating an indexer invokes the pipeline. 如果访问数据、映射输入和输出或操作顺序出现问题,此阶段会显示这些问题。If there are problems reaching the data, mapping inputs and outputs, or order of operations, they appear at this stage. 若要结合代码或脚本更改重新运行管道,可能需要先删除对象。To re-run the pipeline with code or script changes, you might need to drop objects first. 有关详细信息,请参阅重置并重新运行For more information, see Reset and re-run.

关于索引器参数About indexer parameters

脚本将 "maxFailedItems" 设置为 -1,指示索引引擎在数据导入期间忽略错误。The script sets "maxFailedItems" to -1, which instructs the indexing engine to ignore errors during data import. 这是可接受的,因为演示数据源中的文档很少。This is acceptable because there are so few documents in the demo data source. 对于更大的数据源,请将值设置为大于 0。For a larger data source, you would set the value to greater than 0.

"dataToExtract":"contentAndMetadata" 语句告知索引器从不同的文件格式以及与每个文件相关的元数据中自动提取内容。The "dataToExtract":"contentAndMetadata" statement tells the indexer to automatically extract the content from different file formats as well as metadata related to each file.

提取内容后,可以设置 imageAction,以从数据源中的图像提取文本。When content is extracted, you can set imageAction to extract text from images found in the data source. "imageAction":"generateNormalizedImages" 配置与 OCR 技能和文本合并技能相结合,告知索引器从图像中提取文本(例如,禁行交通标志中的单词“stop”),并将其嵌入到内容字段中。The "imageAction":"generateNormalizedImages" configuration, combined with the OCR Skill and Text Merge Skill, tells the indexer to extract text from the images (for example, the word "stop" from a traffic Stop sign), and embed it as part of the content field. 此行为将应用到文档中嵌入的图像(例如 PDF 中的图像),以及数据源(例如 JPG 文件)中的图像。This behavior applies to both the images embedded in the documents (think of an image inside a PDF), as well as images found in the data source, for instance a JPG file.

4 - 监视索引4 - Monitor indexing

提交“创建索引器”请求后,索引编制和扩充立即开始。Indexing and enrichment commence as soon as you submit the Create Indexer request. 根据定义的认知技能,索引编制可能需要花费一段时间。Depending on which cognitive skills you defined, indexing can take a while. 若要确定索引器是否仍在运行,请发送以下请求来检查索引器状态。To find out whether the indexer is still running, send the following request to check the indexer status.

  1. 使用 GET 和以下 URL(请将 YOUR-SERVICE-NAME 替换为实际的服务名称)来命名索引器。Use GET and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service, to name your indexer.

    https://[YOUR-SERVICE-NAME].search.azure.cn/indexers/cog-search-demo-idxr/status?api-version=2019-05-06
    
  2. 检查响应,以了解索引器是否正在运行,或者查看错误和警告信息。Review the response to learn whether the indexer is running, or to view error and warning information.

如果使用的是免费层,则预期会出现以下消息“无法从文档中提取内容或元数据。If you are using the Free tier, the following message is expected: `"Could not extract content or metadata from your document. 提取的文本已截断为 "32768" 个字符”。Truncated extracted text to '32768' characters". 出现此消息的原因是,免费层上的 Blob 索引编制存在 32K 字符提取限制This message appears because blob indexing on the Free tier has a32K limit on character extraction. 在更高的层上,此数据集不会出现此消息。You won't see this message for this data set on higher tiers.

备注

在某些情况下经常出现警告,但不一定意味着存在问题。Warnings are common in some scenarios and do not always indicate a problem. 例如,如果某个 Blob 容器包含图像文件,而管道不处理图像,则会出现一条警告,指出图像未处理。For example, if a blob container includes image files, and the pipeline doesn't handle images, you'll get a warning stating that images were not processed.

创建新的字段和信息后,让我们运行一些查询来了解认知搜索对典型搜索方案产生的作用。Now that you've created new fields and information, let's run some queries to understand the value of cognitive search as it relates to a typical search scenario.

回顾前文,我们是从 Blob 内容着手的,整个文档已打包到一个 content 字段中。Recall that we started with blob content, where the entire document is packaged into a single content field. 可以搜索此字段并查找查询的匹配项。You can search this field and find matches to your queries.

  1. 使用 GET 和以下 URL(请将 YOUR-SERVICE-NAME 替换为实际的服务名称)来搜索某个字或短语的实例,并返回 content 字段和匹配文档的计数。Use GET and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service, to search for instances of a term or phrase, returning the content field and a count of the matching documents.

    https://[YOUR-SERVICE-NAME].search.azure.cn/indexes/cog-search-demo-idx?search=*&$count=true&$select=content?api-version=2019-05-06
    

    此查询的结果将返回文档内容,这与使用 Blob 索引器但不使用认知搜索管道时获取的结果相同。The results of this query return document contents, which is the same result you would get if used the blob indexer without the cognitive search pipeline. 此字段是可搜索的,但若要使用分面、筛选器或自动完成,则此字段不起作用。This field is searchable, but unworkable if you want to use facets, filters, or autocomplete.

    内容字段输出Content field output

  2. 第二个查询返回管道创建的一些新字段(人员、组织、位置、languageCode)。For the second query, return some of the new fields created by the pipeline (persons, organizations, locations, languageCode). 为简洁起见,我们省略了关键短语,但若要查看这些值,应包含关键短语。We're omitting keyPhrases for brevity, but you should include it if you want to see those values.

    https://mydemo.search.azure.cn/indexes/cog-search-demo-idx/docs?search=*&$count=true&$select=metadata_storage_name,persons,organizations,locations,languageCode&api-version=2019-05-06
    

    $select 语句中的字段包含认知服务的自然语言处理功能创建的新信息。The fields in the $select statement contain new information created from the natural language processing capabilities of Cognitive Services. 如你所料,结果中出现了一些干扰信息,并且各个文档的返回信息有差异,但在许多情况下,分析模型会生成准确的结果。As you might expect, there is some noise in the results and variation across documents, but in many instances, the analytical models produce accurate results.

    下图显示了 Satya Nadella 在担任 Microsoft 的 CEO 职务后发表的公开信结果。The following image shows results for Satya Nadella's open letter upon assuming the CEO role at Microsoft.

    管道输出Pipeline output

  3. 若要了解如何利用这些字段,请添加一个分面参数以按位置返回匹配文档的聚合。To see how you might take advantage of these fields, add a facet parameter to return an aggregation of matching documents by location.

    https://[YOUR-SERVICE-NAME].search.azure.cn/indexes/cog-search-demo-idx/docs?search=*&facet=locations&api-version=2019-05-06
    

    在此示例中,每个位置有 2 个或 3 个匹配项。In this example, for each location, there are 2 or 3 matches.

    Facet 输出Facet output

  4. 此最终示例对组织集合应用一个筛选器,以基于 NASDAQ 返回筛选条件的两个匹配项。In this final example, apply a filter on the organizations collection, returning two matches for filter criteria based on NASDAQ.

    cog-search-demo-idx/docs?search=*&$filter=organizations/any(organizations: organizations eq 'NASDAQ')&$select=metadata_storage_name,organizations&$count=true&api-version=2019-05-06
    

这些查询演示了对认知搜索创建的新字段使用查询语法和筛选器的多种方式。These queries illustrate a few of the ways you can work with query syntax and filters on new fields created by cognitive search. 有关更多查询示例,请参阅搜索文档 REST API 中的示例简单语法查询示例完整 Lucene 查询示例For more query examples, see Examples in Search Documents REST API, Simple syntax query examples, and Full Lucene query examples.

重置并重新运行Reset and rerun

在开发的前期试验阶段,设计迭代的最实用方法是,删除 Azure 认知搜索中的对象,并允许代码重新生成它们。In the early experimental stages of development, the most practical approach for design iteration is to delete the objects from Azure Cognitive Search and allow your code to rebuild them. 资源名称是唯一的。Resource names are unique. 删除某个对象后,可以使用相同的名称重新创建它。Deleting an object lets you recreate it using the same name.

可以使用门户来删除索引、索引器、数据源和技能集。You can use the portal to delete indexes, indexers, data sources, and skillsets. 删除索引器时,可以根据需要选择同时删除索引、技能组和数据源。When you delete the indexer, you can optionally, selectively delete the index, skillset, and data source at the same time.

删除搜索对象Delete search objects

或者使用 DELETE 并提供每个对象的 URL。Or use DELETE and provide URLs to each object. 以下命令删除一个索引器。The following command deletes an indexer.

DELETE https://[YOUR-SERVICE-NAME].search.azure.cn/indexers/cog-search-demo-idxr?api-version=2019-05-06

成功删除后会返回状态代码 204。Status code 204 is returned on successful deletion.

要点Takeaways

本教程演示了通过创建组件部件(数据源、技能集、索引和索引器)生成扩充索引管道的基本步骤。This tutorial demonstrates the basic steps for building an enriched indexing pipeline through the creation of component parts: a data source, skillset, index, and indexer.

其中介绍了内置技能组、技能集定义,以及通过输入和输出将技能链接在一起的机制。Built-in skills were introduced, along with skillset definition and the mechanics of chaining skills together through inputs and outputs. 此外,还提到需要使用索引器定义中的 outputFieldMappings,将管道中的扩充值路由到 Azure 认知搜索服务中的可搜索索引。You also learned that outputFieldMappings in the indexer definition is required for routing enriched values from the pipeline into a searchable index on an Azure Cognitive Search service.

最后,介绍了如何测试结果并重置系统以进一步迭代。Finally, you learned how to test results and reset the system for further iterations. 本教程提到,针对索引发出查询会返回扩充的索引管道创建的输出。You learned that issuing queries against the index returns the output created by the enriched indexing pipeline.

清理资源Clean up resources

在自己的订阅中操作时,最好在项目结束时删除不再需要的资源。When you're working in your own subscription, at the end of a project, it's a good idea to remove the resources that you no longer need. 持续运行资源可能会产生费用。Resources left running can cost you money. 可以逐个删除资源,也可以删除资源组以删除整个资源集。You can delete resources individually or delete the resource group to delete the entire set of resources.

可以使用左侧导航窗格中的“所有资源”或“资源组”链接在门户中查找和管理资源。You can find and manage resources in the portal, using the All resources or Resource groups link in the left-navigation pane.

后续步骤Next steps

熟悉 AI 扩充管道中的所有对象后,接下来让我们更详细地了解技能集定义和各项技能。Now that you're familiar with all of the objects in an AI enrichment pipeline, let's take a closer look at skillset definitions and individual skills.