快速入门:在 Azure 门户中创建 Azure 认知搜索认知技能集Quickstart: Create an Azure Cognitive Search cognitive skillset in the Azure portal

技能集是一项基于 AI 的功能,它可以从大型的无差别文本或图像文件提取信息和结构,使内容可编制索引并可在 Azure 认知搜索中搜索。A skillset is an AI-based feature that extracts information and structure from large undifferentiated text or image files, and makes the content both indexable and searchable in Azure Cognitive Search.

在本快速入门中,你将合并 Azure 云中的服务和数据以创建技能集。In this quickstart, you'll combine services and data in the Azure cloud to create the skillset. 一切准备就绪后,可在 Azure 门户中运行“导入数据”向导,以将这些数据提取到一起。Once everything is in place, you'll run the Import data wizard in the Azure portal to pull it all together. 最终结果是一个可在门户(搜索资源管理器)中查询的可搜索索引,其中填充了 AI 处理功能创建的数据。The end result is a searchable index populated with data created by AI processing that you can query in the portal (Search explorer).


在开始之前,必须满足以下条件:Before you begin, you must have the following:


此快速入门还将 Azure 认知服务用于 AI。This quickstart also uses Azure Cognitive Services for the AI. 由于工作负荷很小,因此,认知服务在幕后会抽调一部分算力来免费处理事务(最多 20 个)。Because the workload is so small, Cognitive Services is tapped behind the scenes for free processing for up to 20 transactions. 这意味着,无需创建其他认知服务资源即可完成此练习。This means that you can complete this exercise without having to create an additional Cognitive Services resource.

设置数据Set up your data

在以下步骤中,在 Azure 存储中设置 blob 容器以存储异类内容文件。In the following steps, set up a blob container in Azure Storage to store heterogeneous content files.

  1. 下载示例数据,其中包括不同类型的小型文件集。Download sample data consisting of a small file set of different types. 解压缩文件。Unzip the files.

  2. 创建 Azure 存储帐户查找现有帐户Create an Azure storage account or find an existing account.

    • 选择 Azure 认知搜索所在的同一区域,以避免带宽费用。Choose the same region as Azure Cognitive Search to avoid bandwidth charges.

    • 如果你希望以后在另一篇演练中试用知识存储功能,请选择 StorageV2(常规用途 V2)帐户类型。Choose the StorageV2 (general purpose V2) account type if you want to try out the knowledge store feature later, in another walkthrough. 否则请选择任意类型。Otherwise, choose any type.

  3. 打开 Blob 服务页并创建一个容器。Open the Blob services pages and create a container. 可以使用默认的公共访问级别。You can use the default public access level.

  4. 在容器中,单击“上传”以上传在第一个步骤中下载的示例文件。In container, click Upload to upload the sample files you downloaded in the first step. 请注意,内容类型非常广泛,其中包括图像和应用程序文件,而这些内容在使用其本机格式时不支持全文搜索。Notice that you have a wide range of content types, including images and application files that are not full text searchable in their native formats.

    Azure Blob 存储中的源文件

现在可以在“导入数据”向导中转到下一步。You are now ready to move on the Import data wizard.

运行“导入数据”向导Run the Import data wizard

  1. 使用 Azure 帐户登录到 Azure 门户Sign in to the Azure portal with your Azure account.

  2. 查找搜索服务,并在“概述”页中,单击命令栏上的“导入数据”,通过四个步骤设置认知扩充。Find your search service and on the Overview page, click Import data on the command bar to set up cognitive enrichment in four steps.


步骤 1 - 创建数据源Step 1 - Create a data source

  1. 在“连接到数据”中选择“Azure Blob 存储”,然后选择创建的存储帐户和容器 。In Connect to your data, choose Azure Blob storage, select the Storage account and container you created. 为数据源命名,并对余下的设置使用默认值。Give the data source a name, and use default values for the rest.

    Azure Blob 配置

    继续转到下一页。Continue to the next page.

步骤 2 - 添加认知技能Step 2 - Add cognitive skills

接下来,配置 AI 扩充来调用 OCR、图像分析和自然语言处理。Next, configure AI enrichment to invoke OCR, image analysis, and natural language processing.

  1. 本快速入门将使用免费的认知服务资源。For this quickstart, we are using the Free Cognitive Services resource. 示例数据包括 14 个文件,因此,认知服务免费提供的 20 个事务配额足以完成本快速入门。The sample data consists of 14 files, so the free allotment of 20 transaction on Cognitive Services is sufficient for this quickstart.


  2. 展开“添加扩充”,进行四个选择。Expand Add enrichments and make four selections.

    启用 OCR,将图像分析技能添加到向导页。Enable OCR to add image analysis skills to wizard page.

    将“粒度”设置为“页面”,以将文本拆分为较小的区块。Set granularity to Pages to break up text into smaller chunks. 几种文本技能仅限 5 KB 输入。Several text skills are limited to 5-KB inputs.

    选择实体识别(人员、组织和位置)和图像分析技能。Choose entity recognition (people, organizations, locations) and image analysis skills.


    继续转到下一页。Continue to the next page.

步骤 3 - 配置索引Step 3 - Configure the index

索引包含可搜索的内容,“导入数据”向导通常可以通过对数据源采样来创建架构。An index contains your searchable content and the Import data wizard can usually create the schema for you by sampling the data source. 在此步骤中查看生成的架构,并根据情况修改任何设置。In this step, review the generated schema and potentially revise any settings. 以下是为演示 Blob 数据集创建的默认架构。Below is the default schema created for the demo Blob data set.

在本快速入门中,向导能够很好地设置合理的默认值:For this quickstart, the wizard does a good job setting reasonable defaults:

  • 默认字段基于现有 blob 的属性以及包含扩充输出的新字段(例如 peopleorganizationslocations)。Default fields are based on properties for existing blobs plus new fields to contain enrichment output (for example, people, organizations, locations). 数据类型从元数据和数据采样推断。Data types are inferred from metadata and by data sampling.

  • 默认文档键是 metadata_storage_path(由于字段包含唯一值,因此选择了此键)。Default document key is metadata_storage_path (selected because the field contains unique values).

  • 默认属性为可检索可搜索Default attributes are Retrievable and Searchable. 可搜索允许对字段进行全文搜索。Searchable allows full text search a field. 可检索意味着可以在结果中返回字段值。Retrievable means field values can be returned in results. 向导假设你希望这些字段可检索且可搜索,因为它们是通过技能集创建的。The wizard assumes you want these fields to be retrievable and searchable because you created them via a skillset.


请注意 content 字段旁边的 Retrievable 属性带有删除线和问号。Notice the strikethrough and question mark on the Retrievable attribute by the content field. 对于包含大量的文本的 Blob 文档,content 字段包含文件主体,因此可能包含数千行。For text-heavy blob documents, the content field contains the bulk of the file, potentially running into thousands of lines. 此类字段在搜索结果中不实用,应在此演示中排除它。A field like this is unwieldy in search results and you should exclude it for this demo.

但是,如果需要将文件内容传递到客户端代码,请确保“可检索”保持选定状态。However, if you need to pass file contents to client code, make sure that Retrievable stays selected. 否则,如果提取的元素(例如,peopleorganizationslocations 等)可以满足需要,请考虑在 content 中清除此属性。Otherwise, consider clearing this attribute on content if the extracted elements (such as people, organizations, locations, and so forth) are sufficient.

将某个字段标记为 Retrievable 并不意味着该字段一定会出现在搜索结果中。Marking a field as Retrievable does not mean that the field must be present in the search results. 可以使用 $select 查询参数指定要包含的字段,来精确控制搜索结果的构成。You can precisely control search results composition by using the $select query parameter to specify which fields to include. 对于包含大量文本的字段(例如 content),可以使用 $select 参数向应用程序的用户提供可管理的搜索结果,同时确保客户端代码可以通过 Retrievable 属性访问全部所需信息。For text-heavy fields like content, the $select parameter is your solution for providing manageable search results to the human users of your application, while ensuring client code has access to all the information it needs via the Retrievable attribute.

继续转到下一页。Continue to the next page.

步骤 4 - 配置索引器Step 4 - Configure the indexer

索引器是推动索引过程的高级资源。The indexer is a high-level resource that drives the indexing process. 它指定数据源名称、目标索引和执行频率。It specifies the data source name, a target index, and frequency of execution. “导入数据”向导将创建多个对象,其中始终包括一个可以重复运行的索引器。The Import data wizard creates several objects, and of them is always an indexer that you can run repeatedly.

  1. 在“索引器”页中,可以接受默认名称并单击“一次”计划选项来立即运行该索引器 。In the Indexer page, you can accept the default name and click the Once schedule option to run it immediately.


  2. 单击“提交”以创建并同时运行索引器。Click Submit to create and simultaneously run the indexer.

监视状态Monitor status

与典型的基于文本的索引相比,认知技能索引编制需要花费更长的时间才能完成,OCR 和图像分析尤其如此。Cognitive skills indexing takes longer to complete than typical text-based indexing, especially OCR and image analysis. 若要监视进度,请转到“概述”页,然后单击页面中间的“索引器”。To monitor progress, go to the Overview page and click Indexers in the middle of page.

Azure 认知搜索通知

由于内容类型广泛,因此警告很常见。Warnings are normal given the wide range of content types. 某些内容类型对于特定技能并不有效,在较低层级上,常常会遇到索引器限制Some content types aren't valid for certain skills and on lower tiers it's common to encounter indexer limits. 例如,32,000 字符的截断通知是“免费”层级上的索引器限制。For example, truncation notifications of 32,000 characters are an indexer limit on the Free tier. 如果在更高的层级上运行此演示,许多截断警告会消失。If you ran this demo on a higher tier, many truncation warnings would go away.

若要检查警告或错误,请在“索引器”列表中单击“警告”状态以打开“执行历史记录”页。To check warnings or errors, click on the Warning status on the Indexers list to open the Execution History page.

在该页上再次单击“警告”状态以查看警告列表,如下所示。On that page, click Warning status again to view the list of warnings similar to the one shown below.


单击特定的状态行时将显示详细信息。Details appear when you click a specific status line. 此警告表明合并在达到最大阈值(此特定 PDF 较大)后停止。This warning says that that merging stopped after reaching a maximum threshold (this particular PDF is large).


搜索浏览器中的查询Query in Search explorer

创建索引后,可以运行查询以返回结果。After an index is created, you can run queries to return results. 为完成此任务,请在门户中使用搜索浏览器In the portal, use Search explorer for this task.

  1. 在搜索服务仪表板页上,单击命令栏上的“搜索浏览器”。On the search service dashboard page, click Search explorer on the command bar.

  2. 选择顶部的“更改索引”,选择创建的索引。Select Change Index at the top to select the index you created.

  3. 输入要在其中查询索引的搜索字符串,例如 search=Microsoft&$select=people,organizations,locations,imageTagsEnter a search string to query the index, such as search=Microsoft&$select=people,organizations,locations,imageTags.

随后会返回 JSON 格式的结果。这些结果可能非常冗长且难以阅读,尤其是出现在源自 Azure Blob 的大型文档中时。Results are returned as JSON, which can be verbose and hard to read, especially in large documents originating from Azure blobs. 在此工具中搜索时,可以借鉴一些提示,其中包括以下技术:Some tips for searching in this tool include the following techniques:

  • 追加 $select,以指定要包含在结果中的字段。Append $select to specify which fields to include in results.
  • 使用 CTRL-F 在 JSON 中搜索特定属性或术语。Use CTRL-F to search within the JSON for specific properties or terms.

查询字符串区分大小写,因此如果收到“未知字段”消息,请检查“字段”或“索引定义(JSON)”以验证名称和大小写。Query strings are case-sensitive so if you get an "unknown field" message, check Fields or Index Definition (JSON) to verify name and case.



现在,你已创建第一个技能集并了解了一些重要概念,这些概念可帮助你使用自己的数据为扩充的搜索解决方案制作原型。You've now created your first skillset and learned important concepts useful for prototyping an enriched search solution using your own data.

我们希望学习的某些重要概念也涉及到了 Azure 数据源的依赖关系。Some key concepts that we hope you picked up include the dependency on Azure data sources. 技能集绑定到索引器,索引器特定于 Azure 和源。A skillset is bound to an indexer, and indexers are Azure and source-specific. 尽管本快速入门使用的是 Azure Blob 存储,但也可以使用其他 Azure 数据源。Although this quickstart uses Azure Blob storage, other Azure data sources are possible. 有关详细信息,请参阅 Azure 认知搜索中的索引器For more information, see Indexers in Azure Cognitive Search.

另一个重要概念是技能针对内容类型运行,因此在处理异源内容时,会跳过某些输入。Another important concept is that skills operate over content types, and when working with heterogeneous content, some inputs will be skipped. 而且,大型文件或字段可能会超出服务层级的索引器限制。Also, large files or fields might exceed the indexer limits of your service tier. 正常情况下,在发生这些事件时会看到警告。It's normal to see warnings when these events occur.

输出将定向到搜索索引,在编制索引期间创建的名称/值对与索引中的各个字段之间存在映射关系。Output is directed to a search index, and there is a mapping between name-value pairs created during indexing and individual fields in your index. 在内部,门户将设置批注并定义技能集,以建立操作顺序和常规流。Internally, the portal sets up annotations and defines a skillset, establishing the order of operations and general flow. 这些步骤隐藏在门户中,但开始编写代码时,这些概念就很重要。These steps are hidden in the portal, but when you start writing code, these concepts become important.

最后,你已了解可以通过查询索引来验证内容。Finally, you learned that can verify content by querying the index. Azure 认知搜索最终提供的结果是一个可搜索的索引,可以简单全面扩展的查询语法来查询它。In the end, what Azure Cognitive Search provides is a searchable index, which you can query using either the simple or fully extended query syntax. 包含扩充字段的索引与其他任何索引类似。An index containing enriched fields is like any other. 若要合并标准或自定义分析器评分配置文件同义词分面筛选器、异地搜索或其他任何 Azure 认知搜索功能,完全可以这样做。If you want to incorporate standard or custom analyzers, scoring profiles, synonyms, faceted filters, geo-search, or any other Azure Cognitive Search feature, you can certainly do so.

清理资源Clean up resources

在自己的订阅中操作时,最好在项目结束时确定是否仍需要已创建的资源。When you're working in your own subscription, it's a good idea at the end of a project to identify whether you still need the resources you created. 持续运行资源可能会产生费用。Resources left running can cost you money. 可以逐个删除资源,也可以删除资源组以删除整个资源集。You can delete resources individually or delete the resource group to delete the entire set of resources.

可以使用左侧导航窗格中的“所有资源”或“资源组”链接 ,在门户中查找和管理资源。You can find and manage resources in the portal, using the All resources or Resource groups link in the left-navigation pane.

如果使用的是免费服务,请记住只能设置三个索引、索引器和数据源。If you are using a free service, remember that you are limited to three indexes, indexers, and data sources. 可以在门户中删除单个项目,以不超出此限制。You can delete individual items in the portal to stay under the limit.

后续步骤Next steps

可以使用门户、.NET SDK 或 REST API 创建技能集。You can create skillsets using the portal, .NET SDK, or REST API. 若要学习更多的知识,请使用 Postman 和更多示例数据来尝试运行 REST API。To further your knowledge, try the REST API using Postman and more sample data.


若要重复此练习或尝试其他 AI 扩充演练,请在门户中删除该索引器。If you want to repeat this exercise or try a different AI enrichment walkthrough, delete the indexer in the portal. 删除该索引器会将认知服务处理功能的每日免费事务计数器重置为零。Deleting the indexer resets the free daily transaction counter back to zero for Cognitive Services processing.