教程:AI 使用 .NET SDK 从 Azure Blob 生成可搜索内容Tutorial: AI-generated searchable content from Azure blobs using the .NET SDK

如果在 Azure Blob 存储中有使用非结构化文本或图像,则 AI 扩充管道可以提取信息,并创建可用于全文搜索或知识挖掘方案的新内容。If you have unstructured text or images in Azure Blob storage, an AI enrichment pipeline can extract information and create new content that is useful for full-text search or knowledge mining scenarios. 本 C# 教程对图像应用光学字符识别 (OCR),并执行自然语言处理来创建可在查询、分面和筛选器中利用的新字段。In this C# tutorial, apply Optical Character Recognition (OCR) on images and perform natural language processing to create new fields that you can leverage in queries, facets, and filters.

本教程使用 C# 和 .NET SDK 执行以下任务:This tutorial uses C# and the .NET SDK to perform the following tasks:

  • 从 Azure Blob 存储中的应用程序文件和图像开始。Start with application files and images in Azure Blob storage.
  • 定义一个管道用于添加 OCR、提取文本、检测语言以及识别实体和关键短语。Define a pipeline to add OCR, text extraction, language detection, entity and key phrase recognition.
  • 定义用于存储输出(原始内容,加上管道生成的名称/值对)的索引。Define an index to store the output (raw content, plus pipeline-generated name-value pairs).
  • 执行管道以开始转换和分析,以及创建和加载索引。Execute the pipeline to start transformations and analysis, and to create and load the index.
  • 使用全文搜索和丰富的查询语法浏览结果。Explore results using full text search and a rich query syntax.

如果你没有 Azure 订阅,请在开始之前建立一个试用帐户If you don't have an Azure subscription, open a trial account before you begin.

先决条件Prerequisites

Note

可在本教程中使用免费服务。You can use the free service for this tutorial. 免费搜索服务限制为三个索引、三个索引器和三个数据源。A free search service limits you to three indexes, three indexers, and three data sources. 本教程每样创建一个。This tutorial creates one of each. 在开始之前,请确保服务中有足够的空间可接受新资源。Before starting, make sure you have room on your service to accept the new resources.

下载文件Download files

  1. 打开此 OneDrive 文件夹,然后单击左上角的“下载”将文件复制到计算机。Open this OneDrive folder and on the top-left corner, click Download to copy the files to your computer.

  2. 右键单击 zip 文件并选择“全部提取”。Right-click the zip file and select Extract All. 有 14 个不同类型的文件。There are 14 files of various types. 本练习将使用其中的 7 个文件。You'll use 7 for this exercise.

还可以下载本教程的源代码。You can also download the source code for this tutorial. 源代码位于 azure-search-dotnet-samples 存储库的 tutorial-ai-enrichment 文件夹中。Source code is in the tutorial-ai-enrichment folder in the azure-search-dotnet-samples repository.

1 - 创建服务1 - Create services

本教程使用 Azure 认知搜索编制索引和进行查询、使用后端的认知服务进行 AI 扩充,并使用 Azure Blob 存储提供数据。This tutorial uses Azure Cognitive Search for indexing and queries, Cognitive Services on the backend for AI enrichment, and Azure Blob storage to provide the data. 本教程使用的认知服务不超过每日为每个索引器免费分配 20 个事务这一限制,因此,只需要创建搜索和存储服务。This tutorial stays under the free allocation of 20 transactions per indexer per day on Cognitive Services, so the only services you need to create are search and storage.

如果可能,请在同一区域和资源组中创建这两个服务,使它们相互靠近并易于管理。If possible, create both in the same region and resource group for proximity and manageability. 在实践中,Azure 存储帐户可位于任意区域。In practice, your Azure Storage account can be in any region.

从 Azure 存储开始Start with Azure Storage

  1. 登录到 Azure 门户并单击“+ 创建资源”。Sign in to the Azure portal and click + Create Resource.

  2. 搜索“存储帐户”,并选择“Microsoft 的存储帐户”产品/服务。Search for storage account and select Microsoft's Storage Account offering.

    创建存储帐户Create Storage account

  3. 在“基本信息”选项卡中,必须填写以下项。In the Basics tab, the following items are required. 对于其他任何字段,请接受默认设置。Accept the defaults for everything else.

    • 资源组。Resource group. 选择现有的资源组或创建新资源组,但对于所有服务请使用相同的组,以便可以统一管理这些服务。Select an existing one or create a new one, but use the same group for all services so that you can manage them collectively.

    • 存储帐户名称Storage account name. 如果你认为将来可能会用到相同类型的多个资源,请使用名称来区分类型和区域,例如 blobstoragewestusIf you think you might have multiple resources of the same type, use the name to disambiguate by type and region, for example blobstoragewestus.

    • 位置Location. 如果可能,请选择 Azure 认知搜索和认知服务所用的相同位置。If possible, choose the same location used for Azure Cognitive Search and Cognitive Services. 使用一个位置可以避免带宽费用。A single location voids bandwidth charges.

    • 帐户类型Account Kind. 选择默认设置“StorageV2 (常规用途 v2)”。Choose the default, StorageV2 (general purpose v2).

  4. 单击“查看 + 创建”以创建服务。Click Review + Create to create the service.

  5. 创建后,单击“转到资源”打开“概述”页。Once it's created, click Go to the resource to open the Overview page.

  6. 单击“Blob”服务。Click Blobs service.

  7. 单击“+ 容器”创建容器,并将其命名为 cog-search-demoClick + Container to create a container and name it cog-search-demo.

  8. 选择“cog-search-demo”,然后单击“上传”打开下载文件所保存到的文件夹。Select cog-search-demo and then click Upload to open the folder where you saved the download files. 选择所有 14 个文件,然后单击“确定”以上传。Select all fourteen files and click OK to upload.

    上传示例文件Upload sample files

  9. 在退出 Azure 存储之前获取一个连接字符串,以便可以在 Azure 认知搜索中构建连接。Before you leave Azure Storage, get a connection string so that you can formulate a connection in Azure Cognitive Search.

    1. 向后浏览到存储帐户的“概览”页(我们使用了“blobstoragewestus”作为示例)。Browse back to the Overview page of your storage account (we used blobstoragewestus as an example).

    2. 在左侧导航窗格中,选择“访问密钥”并复制其中一个连接字符串。In the left navigation pane, select Access keys and copy one of the connection strings.

    连接字符串是类似于以下示例的 URL:The connection string is a URL similar to the following example:

    DefaultEndpointsProtocol=https;AccountName=blobstoragechinaeast;AccountKey=<your account key>;EndpointSuffix=core.chinacloudapi.cn
    
  10. 将连接字符串保存到记事本中。Save the connection string to Notepad. 稍后在设置数据源连接时需要用到它。You'll need it later when setting up the data source connection.

认知服务Cognitive Services

AI 扩充由认知服务(包括用于自然语言和图像处理的文本分析与计算机视觉)提供支持。AI enrichment is backed by Cognitive Services, including Text Analytics and Computer Vision for natural language and image processing. 如果你的目标是完成实际原型或项目,则此时应预配认知服务(在 Azure 认知搜索所在的同一区域中),以便可将认知服务附加到索引操作。If your objective was to complete an actual prototype or project, you would at this point provision Cognitive Services (in the same region as Azure Cognitive Search) so that you can attach it to indexing operations.

但是,对于本练习,可以跳过资源预配,因为 Azure 认知搜索在幕后可以连接到认知服务,并为每个索引器运行提供 20 个免费事务。For this exercise, however, you can skip resource provisioning because Azure Cognitive Search can connect to Cognitive Services behind the scenes and give you 20 free transactions per indexer run. 由于本教程使用 14 个事务,因此免费的分配已足够。Since this tutorial uses 14 transactions, the free allocation is sufficient. 对于大型项目,请计划在即用即付 S0 层预配认知服务。For larger projects, plan on provisioning Cognitive Services at the pay-as-you-go S0 tier. 有关详细信息,请参阅附加认知服务For more information, see Attach Cognitive Services.

第三个组件是可以在门户中创建的 Azure 认知搜索。The third component is Azure Cognitive Search, which you can create in the portal. 可使用免费层完成本演练。You can use the Free tier to complete this walkthrough.

必须有 Azure 认知搜索服务 URL 和访问密钥,才能与此服务交互。To interact with your Azure Cognitive Search service you will need the service URL and an access key. 搜索服务是使用这二者创建的,因此,如果向订阅添加了 Azure 认知搜索,则请按以下步骤获取必需信息:A search service is created with both, so if you added Azure Cognitive Search to your subscription, follow these steps to get the necessary information:

  1. 登录到 Azure 门户,在搜索服务的“概述”页中获取 URL。Sign in to the Azure portal, and in your search service Overview page, get the URL. 示例终结点可能类似于 https://mydemo.search.azure.cnAn example endpoint might look like https://mydemo.search.azure.cn.

  2. 在“设置” > “密钥”中,获取有关该服务的完全权限的管理员密钥 。In Settings > Keys, get an admin key for full rights on the service. 有两个可交换的管理员密钥,为保证业务连续性而提供,以防需要滚动一个密钥。There are two interchangeable admin keys, provided for business continuity in case you need to roll one over. 可以在请求中使用主要或辅助密钥来添加、修改和删除对象。You can use either the primary or secondary key on requests for adding, modifying, and deleting objects.

    此外,获取查询密钥。Get the query key as well. 最好使用只读权限发出查询请求。It's a best practice to issue query requests with read-only access.

    获取服务名称以及管理密钥和查询密钥

具有有效的密钥可以在发送请求的应用程序与处理请求的服务之间建立信任关系,这种信任关系以每个请求为基础。Having a valid key establishes trust, on a per request basis, between the application sending the request and the service that handles it.

2 - 设置环境2 - Set up your environment

首先,打开 Visual Studio,并新建能在 .NET Core 上运行的控制台应用项目。Begin by opening Visual Studio and creating a new Console App project that can run on .NET Core.

安装 NuGet 包Install NuGet packages

Azure 认知搜索 .NET SDK 由一些客户端库组成。借助这些库,不仅可以管理索引、数据源、索引器和技能集,还能上传和管理文档并执行查询,所有这些操作都无需处理 HTTP 和 JSON 的详细信息。The Azure Cognitive Search .NET SDK consists of a few client libraries that enable you to manage your indexes, data sources, indexers, and skillsets, as well as upload and manage documents and execute queries, all without having to deal with the details of HTTP and JSON. 这些客户端库全部作为 NuGet 包进行分发。These client libraries are all distributed as NuGet packages.

对于此项目,请安装版本 9 或更高版本的 Microsoft.Azure.Search NuGet 包。For this project, install version 9 or later of the Microsoft.Azure.Search NuGet package.

  1. 在浏览器中转到“Microsoft.Azure.Search NuGet 包”页In a browser, go to Microsoft.Azure.Search NuGet package page.

  2. 选择最新版本(9 或以上)。Select the latest version (9 or later).

  3. 复制包管理器命令。Copy the Package Manager command.

  4. 打开包管理器控制台。Open the Package Manager Console. 选择“工具” > “NuGet 包管理器” > “包管理器控制台”。Select Tools > NuGet Package Manager > Package Manager Console.

  5. 粘贴并运行在上一步复制的命令。Paste and run the command that you copied in the previous step.

接下来,安装最新的 Microsoft.Extensions.Configuration.Json NuGet 包。Next, install the latest Microsoft.Extensions.Configuration.Json NuGet package.

  1. 选择“工具” > “NuGet 包管理器” > “管理解决方案...的 NuGet 包”。 Select Tools > NuGet Package Manager > Manage NuGet Packages for Solution....

  2. 单击“浏览”并搜索 Microsoft.Extensions.Configuration.Json NuGet 包。Click Browse and search for the Microsoft.Extensions.Configuration.Json NuGet package.

  3. 选择该包和你的项目,确认版本是否为最新稳定版,然后单击“安装”。Select the package, select your project, confirm the version is the latest stable version, then click Install.

添加服务连接信息Add service connection information

  1. 在解决方案资源管理器中右键单击该项目,并选择“添加” > “新建项...”。 Right-click on your project in the Solution Explorer and select Add > New Item... .

  2. 将文件命名为“appsettings.json”,并选择“添加”。Name the file appsettings.json and select Add.

  3. 将此文件包含在输出目录中。Include this file in your output directory.

    1. 右键单击 appsettings.json 并选择“属性”。Right-click on appsettings.json and select Properties.
    2. 将“复制到输出目录”的值更改为“如果较新则复制”。 Change the value of Copy to Output Directory to Copy if newer.
  4. 将以下 JSON 复制到新 JSON 文件中。Copy the below JSON into your new JSON file.

    {
      "SearchServiceName": "Put your search service name here",
      "SearchServiceAdminApiKey": "Put your primary or secondary API key here",
      "SearchServiceQueryApiKey": "Put your query API key here",
      "AzureBlobConnectionString": "Put your Azure Blob connection string here",
    }
    

添加搜索服务信息和 Blob 存储帐户信息。Add your search service and blob storage account information. 请注意,可以从上一部分所述的服务预配步骤获取此信息。Recall that you can get this information from the service provisioning steps indicated in the previous section.

对于“SearchServiceName”,请输入短服务名称而不是完整 URL。For SearchServiceName, enter the short service name and not the full URL.

添加命名空间Add namespaces

Program.cs 中添加以下命名空间。In Program.cs, add the following namespaces.

using System;
using System.Collections.Generic;
using Microsoft.Azure.Search;
using Microsoft.Azure.Search.Models;
using Microsoft.Extensions.Configuration;

namespace EnrichwithAI

创建客户端Create a client

Main 下创建 SearchServiceClient 类的实例。Create an instance of the SearchServiceClient class under Main.

public static void Main(string[] args)
{
    // Create service client
    IConfigurationBuilder builder = new ConfigurationBuilder().AddJsonFile("appsettings.json");
    IConfigurationRoot configuration = builder.Build();
    SearchServiceClient serviceClient = CreateSearchServiceClient(configuration);

CreateSearchServiceClient 使用应用程序的配置文件 (appsettings.json) 中存储的值创建新的 SearchServiceClientCreateSearchServiceClient creates a new SearchServiceClient using values that are stored in the application's config file (appsettings.json).

private static SearchServiceClient CreateSearchServiceClient(IConfigurationRoot configuration)
{
   string searchServiceName = configuration["SearchServiceName"];
   string adminApiKey = configuration["SearchServiceAdminApiKey"];

   SearchServiceClient serviceClient = new SearchServiceClient(searchServiceName, new SearchCredentials(adminApiKey));
   serviceClient.SearchDnsSuffix = "search.azure.cn";
   return serviceClient;
}

Note

SearchServiceClient 类管理与搜索服务的连接。The SearchServiceClient class manages connections to your search service. 为了避免打开太多连接,应尝试在应用程序中共享 SearchServiceClient 的单个实例(如果可能)。In order to avoid opening too many connections, you should try to share a single instance of SearchServiceClient in your application if possible. 它的方法在启用此类共享时是线程安全的。Its methods are thread-safe to enable such sharing.

添加在程序失败时退出程序的函数Add function to exit the program during failure

本教程旨在帮助你了解索引管道的每个步骤。This tutorial is meant to help you understand each step of the indexing pipeline. 如果存在阻止程序创建数据源、技能组、索引或索引器的严重问题,程序将输出错误消息并退出,以便你了解并解决该问题。If there is a critical issue that prevents the program from creating the data source, skillset, index, or indexer the program will output the error message and exit so that the issue can be understood and addressed.

ExitProgram 添加到 Main,以处理需要程序退出的情况。Add ExitProgram to Main to handle scenarios that require the program to exit.

private static void ExitProgram(string message)
{
    Console.WriteLine("{0}", message);
    Console.WriteLine("Press any key to exit the program...");
    Console.ReadKey();
    Environment.Exit(0);
}

3 - 创建管道3 - Create the pipeline

在 Azure 认知搜索中,AI 处理是在索引编制(或数据引入)期间发生的。In Azure Cognitive Search, AI processing occurs during indexing (or data ingestion). 本演练部分将创建四个对象:数据源、索引定义、技能集和索引器。This part of the walkthrough creates four objects: data source, index definition, skillset, indexer.

步骤 1:创建数据源Step 1: Create a data source

SearchServiceClient 具有 DataSources 属性。SearchServiceClient has a DataSources property. 此属性提供创建、列出、更新或删除 Azure 认知搜索数据源所需的全部方法。This property provides all the methods you need to create, list, update, or delete Azure Cognitive Search data sources.

通过调用 serviceClient.DataSources.CreateOrUpdate(dataSource),新建 DataSource 实例。Create a new DataSource instance by calling serviceClient.DataSources.CreateOrUpdate(dataSource). DataSource.AzureBlobStorage 要求必须指定数据源名称、连接字符串和 Blob 容器名称。DataSource.AzureBlobStorage requires that you specify the data source name, connection string, and blob container name.

private static DataSource CreateOrUpdateDataSource(SearchServiceClient serviceClient, IConfigurationRoot configuration)
{
    DataSource dataSource = DataSource.AzureBlobStorage(
        name: "demodata",
        storageConnectionString: configuration["AzureBlobConnectionString"],
        containerName: "cog-search-demo",
        description: "Demo files to demonstrate cognitive search capabilities.");

    // The data source does not need to be deleted if it was already created
    // since we are using the CreateOrUpdate method
    try
    {
        serviceClient.DataSources.CreateOrUpdate(dataSource);
    }
    catch (Exception e)
    {
        Console.WriteLine("Failed to create or update the data source\n Exception message: {0}\n", e.Message);
        ExitProgram("Cannot continue without a data source");
    }

    return dataSource;
}

为了让请求成功,此方法将返回已创建的数据源。For a successful request, the method will return the data source that was created. 如果请求有问题(如参数无效),此方法将抛出异常。If there is a problem with the request, such as an invalid parameter, the method will throw an exception.

现在,在 Main 中添加一行,以调用刚刚添加的 CreateOrUpdateDataSource 函数。Now add a line in Main to call the CreateOrUpdateDataSource function that you've just added.

public static void Main(string[] args)
{
    // Create service client
    IConfigurationBuilder builder = new ConfigurationBuilder().AddJsonFile("appsettings.json");
    IConfigurationRoot configuration = builder.Build();
    SearchServiceClient serviceClient = CreateSearchServiceClient(configuration);

    // Create or Update the data source
    Console.WriteLine("Creating or updating the data source...");
    DataSource dataSource = CreateOrUpdateDataSource(serviceClient, configuration);

生成并运行解决方案。Build and run the solution. 由于这是发出的第一个请求,请检查 Azure 门户,确认是否在 Azure 认知搜索中创建了数据源。Since this is your first request, check the Azure portal to confirm the data source was created in Azure Cognitive Search. 在搜索服务的仪表板页上,检查“数据源”磁贴中是否包含一个新项。On the search service dashboard page, verify the Data Sources tile has a new item. 可能需要等待几分钟让门户页刷新。You might need to wait a few minutes for the portal page to refresh.

门户中的“数据源”磁贴Data sources tile in the portal

步骤 2:创建技能集Step 2: Create a skillset

在此部分中,你将定义一组要应用于数据的扩充步骤。In this section, you define a set of enrichment steps that you want to apply to your data. 每个扩充步骤称为“技能”,一组扩充步骤称为“技能集”。Each enrichment step is called a skill and the set of enrichment steps a skillset. 本教程对技能集使用以下内置认知技能This tutorial uses built-in cognitive skills for the skillset:

  • 光学字符识别:用于识别图像文件中的印刷文本和手写文本。Optical Character Recognition to recognize printed and handwritten text in image files.

  • 文本合并:用于将字段集合中的文本合并到单个字段中。Text Merger to consolidate text from a collection of fields into a single field.

  • 语言检测:识别内容的语言。Language Detection to identify the content's language.

  • 文本拆分:用于先将大段内容拆分为较小区块,再调用关键短语提取技能和实体识别技能。Text Split to break large content into smaller chunks before calling the key phrase extraction skill and the entity recognition skill. 关键短语提取和实体识别接受不超过 50,000 个字符的输入。Key phrase extraction and entity recognition accept inputs of 50,000 characters or less. 有几个示例文件需要拆分才能保留在此限制范围内。A few of the sample files need splitting up to fit within this limit.

  • 实体识别:从 Blob 容器中的内容提取组织名称。Entity Recognition for extracting the names of organizations from content in the blob container.

  • 关键短语提取:取出最关键的短语。Key Phrase Extraction to pull out the top key phrases.

在初始处理期间,Azure 认知搜索会破译每个文档,以读取不同文件格式的内容。During initial processing, Azure Cognitive Search cracks each document to read content from different file formats. 从源文件中找到的文本将放入一个生成的 content 字段(每个文档对应一个字段)。Found text originating in the source file is placed into a generated content field, one for each document. 因此,请将输入设置为 "/document/content",以使用此文本。As such, set the input as "/document/content" to use this text.

输出可以映射到索引、用作下游技能的输入,或者既映射到索引又用作输入(在语言代码中就是这样)。Outputs can be mapped to an index, used as input to a downstream skill, or both as is the case with language code. 在索引中,语言代码可用于筛选。In the index, a language code is useful for filtering. 文本分析技能使用语言代码作为输入来告知有关断字的语言规则。As an input, language code is used by text analysis skills to inform the linguistic rules around word breaking.

若要详细了解技能集的基础知识,请参阅如何定义技能集For more information about skillset fundamentals, see How to define a skillset.

OCR 技术OCR skill

OCR 技能从图像中提取文本。The OCR skill extracts text from images. 此技能假定存在“normalized_images”字段。This skill assumes that a normalized_images field exists. 为了生成此字段,本教程稍后会将索引器定义中的 "imageAction" 配置设置为 "generateNormalizedImages"To generate this field, later in the tutorial we'll set the "imageAction" configuration in the indexer definition to "generateNormalizedImages".

private static OcrSkill CreateOcrSkill()
{
    List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
    inputMappings.Add(new InputFieldMappingEntry(
        name: "image",
        source: "/document/normalized_images/*"));

    List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
    outputMappings.Add(new OutputFieldMappingEntry(
        name: "text",
        targetName: "text"));

    OcrSkill ocrSkill = new OcrSkill(
        description: "Extract text (plain and structured) from image",
        context: "/document/normalized_images/*",
        inputs: inputMappings,
        outputs: outputMappings,
        defaultLanguageCode: OcrSkillLanguage.En,
        shouldDetectOrientation: true);

    return ocrSkill;
}

合并技能Merge skill

在此部分中,你将创建合并技能,用于将文档内容字段与 OCR 技能生成的文本合并。In this section you'll create a Merge skill that merges the document content field with the text that was produced by the OCR skill.

private static MergeSkill CreateMergeSkill()
{
    List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
    inputMappings.Add(new InputFieldMappingEntry(
        name: "text",
        source: "/document/content"));
    inputMappings.Add(new InputFieldMappingEntry(
        name: "itemsToInsert",
        source: "/document/normalized_images/*/text"));
    inputMappings.Add(new InputFieldMappingEntry(
        name: "offsets",
        source: "/document/normalized_images/*/contentOffset"));

    List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
    outputMappings.Add(new OutputFieldMappingEntry(
        name: "mergedText",
        targetName: "merged_text"));

    MergeSkill mergeSkill = new MergeSkill(
        description: "Create merged_text which includes all the textual representation of each image inserted at the right location in the content field.",
        context: "/document",
        inputs: inputMappings,
        outputs: outputMappings,
        insertPreTag: " ",
        insertPostTag: " ");

    return mergeSkill;
}

语言检测技能Language detection skill

语言检测技能检测输入文本的语言,并报告在请求中提交的每个文档的单一语言代码。The Language Detection skill detects the language of the input text and reports a single language code for every document submitted on the request. 我们会将语言检测技能的输出用作文本拆分技能的输入的一部分。We'll use the output of the Language Detection skill as part of the input to the Text Split skill.

private static LanguageDetectionSkill CreateLanguageDetectionSkill()
{
    List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
    inputMappings.Add(new InputFieldMappingEntry(
        name: "text",
        source: "/document/merged_text"));

    List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
    outputMappings.Add(new OutputFieldMappingEntry(
        name: "languageCode",
        targetName: "languageCode"));

    LanguageDetectionSkill languageDetectionSkill = new LanguageDetectionSkill(
        description: "Detect the language used in the document",
        context: "/document",
        inputs: inputMappings,
        outputs: outputMappings);

    return languageDetectionSkill;
}

文本拆分技能Text split skill

下面的拆分技能按页面拆分文本,并将页面长度限制为 String.Length 度量的 4,000 个字符。The below Split skill will split text by pages and limit the page length to 4,000 characters as measured by String.Length. 此算法会尝试将文本拆分为最大为 maximumPageLength 的区块。The algorithm will try to split the text into chunks that are at most maximumPageLength in size. 在下面的示例中,此算法会尽可能在句子边界断开句子,所以区块大小可能略小于 maximumPageLengthIn this case, the algorithm will do its best to break the sentence on a sentence boundary, so the size of the chunk may be slightly less than maximumPageLength.

private static SplitSkill CreateSplitSkill()
{
    List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();

    inputMappings.Add(new InputFieldMappingEntry(
        name: "text",
        source: "/document/merged_text"));
    inputMappings.Add(new InputFieldMappingEntry(
        name: "languageCode",
        source: "/document/languageCode"));

    List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
    outputMappings.Add(new OutputFieldMappingEntry(
        name: "textItems",
        targetName: "pages"));

    SplitSkill splitSkill = new SplitSkill(
        description: "Split content into pages",
        context: "/document",
        inputs: inputMappings,
        outputs: outputMappings,
        textSplitMode: TextSplitMode.Pages,
        maximumPageLength: 4000);

    return splitSkill;
}

实体识别技能Entity recognition skill

设置此 EntityRecognitionSkill 实例是为了识别类别类型 organizationThis EntityRecognitionSkill instance is set to recognize category type organization. 此外,实体识别技能还可以识别类别类型 personlocationThe Entity Recognition skill can also recognize category types person and location.

请注意,“context”字段设置为包含星号的 "/document/pages/*";也就是说,将对 "/document/pages" 下的每个页面都调用扩充步骤。Notice that the "context" field is set to "/document/pages/*" with an asterisk, meaning the enrichment step is called for each page under "/document/pages".

private static EntityRecognitionSkill CreateEntityRecognitionSkill()
{
    List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
    inputMappings.Add(new InputFieldMappingEntry(
        name: "text",
        source: "/document/pages/*"));

    List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
    outputMappings.Add(new OutputFieldMappingEntry(
        name: "organizations",
        targetName: "organizations"));

    List<EntityCategory> entityCategory = new List<EntityCategory>();
    entityCategory.Add(EntityCategory.Organization);

    EntityRecognitionSkill entityRecognitionSkill = new EntityRecognitionSkill(
        description: "Recognize organizations",
        context: "/document/pages/*",
        inputs: inputMappings,
        outputs: outputMappings,
        categories: entityCategory,
        defaultLanguageCode: EntityRecognitionSkillLanguage.En);

    return entityRecognitionSkill;
}

关键短语提取技能Key phrase extraction skill

与刚刚创建的 EntityRecognitionSkill 实例一样,关键短语提取技能对文档的各个页面都调用。Like the EntityRecognitionSkill instance that was just created, the Key Phrase Extraction skill is called for each page of the document.

private static KeyPhraseExtractionSkill CreateKeyPhraseExtractionSkill()
{
    List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
    inputMappings.Add(new InputFieldMappingEntry(
        name: "text",
        source: "/document/pages/*"));
    inputMappings.Add(new InputFieldMappingEntry(
        name: "languageCode",
        source: "/document/languageCode"));

    List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
    outputMappings.Add(new OutputFieldMappingEntry(
        name: "keyPhrases",
        targetName: "keyPhrases"));

    KeyPhraseExtractionSkill keyPhraseExtractionSkill = new KeyPhraseExtractionSkill(
        description: "Extract the key phrases",
        context: "/document/pages/*",
        inputs: inputMappings,
        outputs: outputMappings);

    return keyPhraseExtractionSkill;
}

生成并创建技能集Build and create the skillset

使用已创建的技能来生成 SkillsetBuild the Skillset using the skills you created.

private static Skillset CreateOrUpdateDemoSkillSet(SearchServiceClient serviceClient, IList<Skill> skills)
{
    Skillset skillset = new Skillset(
        name: "demoskillset",
        description: "Demo skillset",
        skills: skills);

    // Create the skillset in your search service.
    // The skillset does not need to be deleted if it was already created
    // since we are using the CreateOrUpdate method
    try
    {
        serviceClient.Skillsets.CreateOrUpdate(skillset);
    }
    catch (Exception e)
    {
        Console.WriteLine("Failed to create the skillset\n Exception message: {0}\n", e.Message);
        ExitProgram("Cannot continue without a skillset");
    }

    return skillset;
}

将以下行添加到 MainAdd the following lines to Main.

    // Create the skills
    Console.WriteLine("Creating the skills...");
    OcrSkill ocrSkill = CreateOcrSkill();
    MergeSkill mergeSkill = CreateMergeSkill();
    EntityRecognitionSkill entityRecognitionSkill = CreateEntityRecognitionSkill();
    LanguageDetectionSkill languageDetectionSkill = CreateLanguageDetectionSkill();
    SplitSkill splitSkill = CreateSplitSkill();
    KeyPhraseExtractionSkill keyPhraseExtractionSkill = CreateKeyPhraseExtractionSkill();

    // Create the skillset
    Console.WriteLine("Creating or updating the skillset...");
    List<Skill> skills = new List<Skill>();
    skills.Add(ocrSkill);
    skills.Add(mergeSkill);
    skills.Add(languageDetectionSkill);
    skills.Add(splitSkill);
    skills.Add(entityRecognitionSkill);
    skills.Add(keyPhraseExtractionSkill);

    Skillset skillset = CreateOrUpdateDemoSkillSet(serviceClient, skills);

步骤 3:创建索引Step 3: Create an index

本部分通过指定要在可搜索索引中包含的字段以及每个字段的搜索特性,来定义索引架构。In this section, you define the index schema by specifying which fields to include in the searchable index, and the search attributes for each field. 字段具有某种类型,并可以采用特性来确定字段的使用方式(可搜索、可排序,等等)。Fields have a type and can take attributes that determine how the field is used (searchable, sortable, and so forth). 索引中的字段名称不一定要与源中的字段名称完全匹配。Field names in an index are not required to identically match the field names in the source. 在稍后的步骤中,我们将在索引器中添加字段映射以连接源-目标字段。In a later step, you add field mappings in an indexer to connect source-destination fields. 针对此步骤,请使用搜索应用程序相关的字段命名约定来定义索引。For this step, define the index using field naming conventions pertinent to your search application.

本演练使用以下字段和字段类型:This exercise uses the following fields and field types:

field-names:field-names: id contentcontent languageCodelanguageCode keyPhraseskeyPhrases 组织organizations
field-types:field-types: Edm.StringEdm.String Edm.StringEdm.String Edm.StringEdm.String List<Edm.String>List<Edm.String> List<Edm.String>List<Edm.String>

创建 DemoIndex 类Create DemoIndex Class

此索引的字段是使用模型类进行定义。The fields for this index are defined using a model class. 模型类的每个属性都具有一些特性,这些特性决定了相应索引字段的与搜索相关的行为。Each property of the model class has attributes which determine the search-related behaviors of the corresponding index field.

接下来,将把模型类添加到新 C# 文件中。We'll add the model class to a new C# file. 右键单击项目,并依次选择“添加” > “新项...”。选择“类”,并将文件命名为“DemoIndex.cs”,再选择“添加”。Right click on your project and select Add > New Item..., select "Class" and name the file DemoIndex.cs, then select Add.

请务必指明要使用 Microsoft.Azure.SearchMicrosoft.Azure.Search.Models 命名空间中的类型。Make sure to indicate that you want to use types from the Microsoft.Azure.Search and Microsoft.Azure.Search.Models namespaces.

将下面的模型类定义添加到 DemoIndex.cs 中,并将它添加到要在其中创建索引的同一命名空间中。Add the below model class definition to DemoIndex.cs and include it in the same namespace where you'll create the index.

using Microsoft.Azure.Search;
using Microsoft.Azure.Search.Models;

namespace EnrichwithAI
{
    // The SerializePropertyNamesAsCamelCase attribute is defined in the Azure Search .NET SDK.
    // It ensures that Pascal-case property names in the model class are mapped to camel-case
    // field names in the index.
    [SerializePropertyNamesAsCamelCase]
    public class DemoIndex
    {
        [System.ComponentModel.DataAnnotations.Key]
        [IsSearchable, IsSortable]
        public string Id { get; set; }

        [IsSearchable]
        public string Content { get; set; }

        [IsSearchable]
        public string LanguageCode { get; set; }

        [IsSearchable]
        public string[] KeyPhrases { get; set; }

        [IsSearchable]
        public string[] Organizations { get; set; }
    }
}

至此,已定义模型类。返回到 Program.cs,可以轻松创建索引定义了。Now that you've defined a model class, back in Program.cs you can create an index definition fairly easily. 此索引的名称为 demoindexThe name for this index will be demoindex. 如果已存在同名的索引,则会删除该索引。If an index already exists with that name, it will be deleted.

private static Index CreateDemoIndex(SearchServiceClient serviceClient)
{
    var index = new Index()
    {
        Name = "demoindex",
        Fields = FieldBuilder.BuildForType<DemoIndex>()
    };

    try
    {
        bool exists = serviceClient.Indexes.Exists(index.Name);

        if (exists)
        {
            serviceClient.Indexes.Delete(index.Name);
        }

        serviceClient.Indexes.Create(index);
    }
    catch (Exception e)
    {
        Console.WriteLine("Failed to create the index\n Exception message: {0}\n", e.Message);
        ExitProgram("Cannot continue without an index");
    }

    return index;
}

在测试期间,你可能会发现要多次尝试创建索引。During testing you may find that you're attempting to create the index more than once. 因此,请先检查要创建的索引是否已存在,再尝试创建索引。Because of this, check to see if the index that you're about to create already exists before attempting to create it.

将以下行添加到 MainAdd the following lines to Main.

    // Create the index
    Console.WriteLine("Creating the index...");
    Microsoft.Azure.Search.Models.Index demoIndex = CreateDemoIndex(serviceClient);

添加以下 using 语句以解析消歧引用。Add the following using statement to resolve the disambiguate reference.

using Index = Microsoft.Azure.Search.Models.Index;

若要详细了解如何定义索引,请参阅创建索引(Azure 认知搜索 REST API)To learn more about defining an index, see Create Index (Azure Cognitive Search REST API).

步骤 4:创建并运行索引器Step 4: Create and run an indexer

到目前为止,我们已创建数据源、技能集和索引。So far you have created a data source, a skillset, and an index. 这三个组件属于某个索引器,该索引器将每个片段一同提取到单个多阶段操作。These three components become part of an indexer that pulls each piece together into a single multi-phased operation. 若要在索引器中将这些组件捆绑在一起,必须定义字段映射。To tie these together in an indexer, you must define field mappings.

  • 先处理 fieldMapping,再处理技能集;将数据源中的源字段映射到索引中的目标字段。The fieldMappings are processed before the skillset, mapping source fields from the data source to target fields in an index. 如果两端的字段名称和类型相同,则无需映射。If field names and types are the same at both ends, no mapping is required.

  • 先处理技能集,再处理 outputFieldMapping;引用不存在的 sourceFieldName,直到文档破解或扩充功能创建了它们。The outputFieldMappings are processed after the skillset, referencing sourceFieldNames that don't exist until document cracking or enrichment creates them. targetFieldName 是索引中的字段。The targetFieldName is a field in an index.

除了将输入挂钩到输出外,还可以使用字段映射来平展数据结构。In addition to hooking up inputs to outputs, you can also use field mappings to flatten data structures. 有关详细信息,请参阅如何将扩充字段映射到可搜索索引For more information, see How to map enriched fields to a searchable index.

private static Indexer CreateDemoIndexer(SearchServiceClient serviceClient, DataSource dataSource, Skillset skillSet, Index index)
{
    IDictionary<string, object> config = new Dictionary<string, object>();
    config.Add(
        key: "dataToExtract",
        value: "contentAndMetadata");
    config.Add(
        key: "imageAction",
        value: "generateNormalizedImages");

    List<FieldMapping> fieldMappings = new List<FieldMapping>();
    fieldMappings.Add(new FieldMapping(
        sourceFieldName: "metadata_storage_path",
        targetFieldName: "id",
        mappingFunction: new FieldMappingFunction(
            name: "base64Encode")));
    fieldMappings.Add(new FieldMapping(
        sourceFieldName: "content",
        targetFieldName: "content"));

    List<FieldMapping> outputMappings = new List<FieldMapping>();
    outputMappings.Add(new FieldMapping(
        sourceFieldName: "/document/pages/*/organizations/*",
        targetFieldName: "organizations"));
    outputMappings.Add(new FieldMapping(
        sourceFieldName: "/document/pages/*/keyPhrases/*",
        targetFieldName: "keyPhrases"));
    outputMappings.Add(new FieldMapping(
        sourceFieldName: "/document/languageCode",
        targetFieldName: "languageCode"));

    Indexer indexer = new Indexer(
        name: "demoindexer",
        dataSourceName: dataSource.Name,
        targetIndexName: index.Name,
        description: "Demo Indexer",
        skillsetName: skillSet.Name,
        parameters: new IndexingParameters(
            maxFailedItems: -1,
            maxFailedItemsPerBatch: -1,
            configuration: config),
        fieldMappings: fieldMappings,
        outputFieldMappings: outputMappings);

    try
    {
        bool exists = serviceClient.Indexers.Exists(indexer.Name);

        if (exists)
        {
            serviceClient.Indexers.Delete(indexer.Name);
        }

        serviceClient.Indexers.Create(indexer);
    }
    catch (Exception e)
    {
        Console.WriteLine("Failed to create the indexer\n Exception message: {0}\n", e.Message);
        ExitProgram("Cannot continue without creating an indexer");
    }

    return indexer;
}

将以下行添加到 MainAdd the following lines to Main.

    // Create the indexer, map fields, and execute transformations
    Console.WriteLine("Creating the indexer and executing the pipeline...");
    Indexer demoIndexer = CreateDemoIndexer(serviceClient, dataSource, skillset, demoIndex);

创建索引器预计需要一段时间才能完成。Expect that creating the indexer will take a little time to complete. 即使数据集较小,分析技能也会消耗大量的计算资源。Even though the data set is small, analytical skills are computation-intensive. 某些技能(例如图像分析)会长时间运行。Some skills, such as image analysis, are long-running.

Tip

创建索引器会调用管道。Creating an indexer invokes the pipeline. 如果访问数据、映射输入和输出或操作顺序出现问题,此阶段会显示这些问题。If there are problems reaching the data, mapping inputs and outputs, or order of operations, they appear at this stage.

探索如何创建索引器Explore creating the indexer

代码将 "maxFailedItems" 设置为 -1,指示索引引擎在数据导入期间忽略错误。The code sets "maxFailedItems" to -1, which instructs the indexing engine to ignore errors during data import. 此设置非常有用,因为演示数据源中的文档很少。This is useful because there are so few documents in the demo data source. 对于更大的数据源,请将值设置为大于 0。For a larger data source, you would set the value to greater than 0.

另请注意,"dataToExtract" 设置为 "contentAndMetadata"Also notice the "dataToExtract" is set to "contentAndMetadata". 该语句告知索引器从不同的文件格式以及与每个文件相关的元数据中自动提取内容。This statement tells the indexer to automatically extract the content from different file formats as well as metadata related to each file.

提取内容后,可以设置 imageAction,以从数据源中的图像提取文本。When content is extracted, you can set imageAction to extract text from images found in the data source. 将设置为 "generateNormalizedImages" 配置的 "imageAction" 与 OCR 技能和文本合并技能相结合,指示索引器从图像中提取文本(例如,交通停车标志中的“停”一词),并将它嵌入为内容字段的一部分。The "imageAction" set to "generateNormalizedImages" configuration, combined with the OCR Skill and Text Merge Skill, tells the indexer to extract text from the images (for example, the word "stop" from a traffic Stop sign), and embed it as part of the content field. 此行为将应用到文档中嵌入的图像(例如 PDF 中的图像),以及数据源(例如 JPG 文件)中的图像。This behavior applies to both the images embedded in the documents (think of an image inside a PDF), as well as images found in the data source, for instance a JPG file.

4 - 监视索引4 - Monitor indexing

定义索引器后,提交请求时会自动运行索引器。Once the indexer is defined, it runs automatically when you submit the request. 根据定义的认知技能,索引编制花费的时间可能会超出预期。Depending on which cognitive skills you defined, indexing can take longer than you expect. 若要确定索引器是否仍在运行,请使用 GetStatus 方法。To find out whether the indexer is still running, use the GetStatus method.

private static void CheckIndexerOverallStatus(SearchServiceClient serviceClient, Indexer indexer)
{
    try
    {
        IndexerExecutionInfo demoIndexerExecutionInfo = serviceClient.Indexers.GetStatus(indexer.Name);

        switch (demoIndexerExecutionInfo.Status)
        {
            case IndexerStatus.Error:
                ExitProgram("Indexer has error status. Check the Azure Portal to further understand the error.");
                break;
            case IndexerStatus.Running:
                Console.WriteLine("Indexer is running");
                break;
            case IndexerStatus.Unknown:
                Console.WriteLine("Indexer status is unknown");
                break;
            default:
                Console.WriteLine("No indexer information");
                break;
        }
    }
    catch (Exception e)
    {
        Console.WriteLine("Failed to get indexer overall status\n Exception message: {0}\n", e.Message);
    }
}

IndexerExecutionInfo 表示索引器的当前状态和执行历史记录。IndexerExecutionInfo represents the current status and execution history of an indexer.

处理某些源文件和技能的组合时经常会出现警告,这并不总是意味着出现了问题。Warnings are common with some source file and skill combinations and do not always indicate a problem. 在本教程中,警告是良性的(例如,JPEG 文件中没有文本输入)。In this tutorial, the warnings are benign (for example, no text inputs from the JPEG files).

将以下行添加到 MainAdd the following lines to Main.

    // Check indexer overall status
    Console.WriteLine("Check the indexer overall status...");
    CheckIndexerOverallStatus(serviceClient, demoIndexer);

索引编制完成后,可以运行查询来返回各个字段的内容。After indexing is finished, you can run queries that return the contents of individual fields. 默认情况下,Azure 认知搜索返回前 50 条结果。By default, Azure Cognitive Search returns the top 50 results. 由于样本数据较小,因此使用默认设置即可正常操作。The sample data is small so the default works fine. 但是,在处理较大的数据集时,可能需要在查询字符串中包含参数来返回更多结果。However, when working with larger data sets, you might need to include parameters in the query string to return more results. 有关说明,请参阅如何将 Azure 认知搜索中的结果分页For instructions, see How to page results in Azure Cognitive Search.

作为验证步骤,请查询所有字段的索引。As a verification step, query the index for all of the fields.

将以下行添加到 MainAdd the following lines to Main.

DocumentSearchResult<DemoIndex> results;

ISearchIndexClient indexClientForQueries = CreateSearchIndexClient(configuration);

SearchParameters parameters =
    new SearchParameters
    {
        Select = new[] { "organizations" }
    };

try
{
    results = indexClientForQueries.Documents.Search<DemoIndex>("*", parameters);
}
catch (Exception e)
{
    // Handle exception
}

CreateSearchIndexClient 使用应用程序的配置文件 (appsettings.json) 中存储的值创建新的 SearchIndexClientCreateSearchIndexClient creates a new SearchIndexClient using values that are stored in the application's config file (appsettings.json). 请注意,使用的是搜索服务查询 API 密钥而不是管理员密钥。Notice that the search service query API key is used and not the admin key.

private static SearchIndexClient CreateSearchIndexClient(IConfigurationRoot configuration)
{
   string searchServiceName = configuration["SearchServiceName"];
   string queryApiKey = configuration["SearchServiceQueryApiKey"];

   SearchIndexClient indexClient = new SearchIndexClient(searchServiceName, "demoindex", new SearchCredentials(queryApiKey));
   indexClient.SearchDnsSuffix = "search.azure.cn";
   return indexClient;
}

将以下代码添加到 MainAdd the following code to Main. 第一个 try-catch 返回索引定义,其中包含每个字段的名称、类型和属性。The first try-catch returns the index definition, with the name, type, and attributes of each field. 第二个 try-catch 是参数化查询,其中的 Select 指定要包含在结果中的字段,例如 organizationsThe second is a parameterized query, where Select specifies which fields to include in the results, for example organizations. "*" 搜索字符串返回单个字段的所有内容。A search string of "*" returns all contents of a single field.

//Verify content is returned after indexing is finished
ISearchIndexClient indexClientForQueries = CreateSearchIndexClient(configuration);

try
{
    results = indexClientForQueries.Documents.Search<DemoIndex>("*");
    Console.WriteLine("First query succeeded with a result count of {0}", results.Results.Count);
}
catch (Exception e)
{
    Console.WriteLine("First query failed\n Exception message: {0}\n", e.Message);
}

SearchParameters parameters =
    new SearchParameters
    {
        Select = new[] { "organizations" }
    };

try
{
    results = indexClientForQueries.Documents.Search<DemoIndex>("*", parameters);
    Console.WriteLine("Second query succeeded with a result count of {0}", results.Results.Count);
}
catch (Exception e)
{
    Console.WriteLine("Second query failed\n Exception message: {0}\n", e.Message);
}

针对本练习中的其他字段(content、languageCode、keyPhrases 和 organizations)重复上述步骤。Repeat for additional fields: content, languageCode, keyPhrases, and organizations in this exercise. 可以使用逗号分隔列表通过 Select 属性返回多个字段。You can return multiple fields via the Select property using a comma-delimited list.

重置并重新运行Reset and rerun

在开发的前期试验阶段,设计迭代的最实用方法是,删除 Azure 认知搜索中的对象,并允许代码重新生成它们。In the early experimental stages of development, the most practical approach for design iteration is to delete the objects from Azure Cognitive Search and allow your code to rebuild them. 资源名称是唯一的。Resource names are unique. 删除某个对象后,可以使用相同的名称重新创建它。Deleting an object lets you recreate it using the same name.

本教程的示例代码将检查现有对象并将其删除,使你能够重新运行代码。The sample code for this tutorial checks for existing objects and deletes them so that you can rerun your code.

也可以使用门户来删除索引、索引器、数据源和技能集。You can also use the portal to delete indexes, indexers, data sources, and skillsets.

要点Takeaways

本教程演示了通过创建组件部件(数据源、技能集、索引和索引器)生成扩充索引管道的基本步骤。This tutorial demonstrated the basic steps for building an enriched indexing pipeline through the creation of component parts: a data source, skillset, index, and indexer.

其中介绍了内置技能组、技能集定义,以及通过输入和输出将技能链接在一起的机制。Built-in skills were introduced, along with skillset definition and the mechanics of chaining skills together through inputs and outputs. 此外,还提到需要使用索引器定义中的 outputFieldMappings,将管道中的扩充值路由到 Azure 认知搜索服务中的可搜索索引。You also learned that outputFieldMappings in the indexer definition is required for routing enriched values from the pipeline into a searchable index on an Azure Cognitive Search service.

最后,介绍了如何测试结果并重置系统以进一步迭代。Finally, you learned how to test results and reset the system for further iterations. 本教程提到,针对索引发出查询会返回扩充的索引管道创建的输出。You learned that issuing queries against the index returns the output created by the enriched indexing pipeline. 此外,本教程还介绍了如何检查索引器状态,以及在重新运行管道之前要删除的对象。You also learned how to check indexer status, and which objects to delete before rerunning a pipeline.

清理资源Clean up resources

在自己的订阅中操作时,最好在项目结束时删除不再需要的资源。When you're working in your own subscription, at the end of a project, it's a good idea to remove the resources that you no longer need. 持续运行资源可能会产生费用。Resources left running can cost you money. 可以逐个删除资源,也可以删除资源组以删除整个资源集。You can delete resources individually or delete the resource group to delete the entire set of resources.

可以使用左侧导航窗格中的“所有资源”或“资源组”链接在门户中查找和管理资源。You can find and manage resources in the portal, using the All resources or Resource groups link in the left-navigation pane.

后续步骤Next steps

熟悉 AI 扩充管道中的所有对象后,接下来让我们更详细地了解技能集定义和各项技能。Now that you're familiar with all of the objects in an AI enrichment pipeline, let's take a closer look at skillset definitions and individual skills.