Azure 认知搜索中的 AI 扩充AI enrichment in Azure Cognitive Search

AI 扩充是索引器的扩展,可用于从图像、Blob 和其他非结构化数据源中提取文本。AI enrichment is an extension of indexers that can be used to extract text from images, blobs, and other unstructured data sources. 利用扩充和提取,可以使内容在索引器输出对象(搜索索引知识存储)中更容易搜索。Enrichment and extraction make your content more searchable in indexer output objects, either a search index or a knowledge store.

提取和扩充使用附加到索引器驱动管道上的认知技能来实现。Extraction and enrichment are implemented using cognitive skills attached to the indexer-driven pipeline. 可以使用 Microsoft 的内置技能,也可以将外部处理嵌入到所创建的自定义技能中。You can use built-in skills from Microsoft or embed external processing into a custom skill that you create. 自定义技能的示例可能包括面向特定领域(例如金融、科技出版或医疗)的自定义实体模块或文档分类器。Examples of a custom skill might be a custom entity module or document classifier targeting a specific domain such as finance, scientific publications, or medicine.

内置技能分为以下类别:Built-in skills fall into these categories:

扩充管道关系图Enrichment pipeline diagram

Azure 认知搜索中的内置技能基于认知服务 API 中预先训练的机器学习模型:计算机视觉文本分析Built-in skills in Azure Cognitive Search are based on pre-trained machine learning models in Cognitive Services APIs: Computer Vision and Text Analytics. 若要在内容处理期间利用这些资源,可以附加认知服务资源。You can attach a Cognitive Services resource if you want to leverage these resources during content processing.

数据引入阶段应用了自然语言和图形处理,其结果会成为 Azure 认知搜索的可搜索索引中文档撰写内容的一部分。Natural language and image processing is applied during the data ingestion phase, with results becoming part of a document's composition in a searchable index in Azure Cognitive Search. 数据作为 Azure 数据集的来源,然后使用任意所需的内置技能通过索引管道进行推送。Data is sourced as an Azure data set and then pushed through an indexing pipeline using whichever built-in skills you need.

何时使用 AI 扩充When to use AI enrichment

如果原始内容为非结构化文本、图像内容或需要语言检测和翻译的内容,则应考虑使用内置认知技能。You should consider using built-in cognitive skills if your raw content is unstructured text, image content, or content that needs language detection and translation. 通过内置认知技能应用 AI,可以对此内容进行解锁,在搜索和数据科学应用中提高其价值和实用性。Applying AI through the built-in cognitive skills can unlock this content, increasing its value and utility in your search and data science apps.

此外,如果你有要集成到管道中的开源、第三方或第一方代码,则可以考虑添加自定义技能。Additionally, you might consider adding a custom skill if you have open-source, third-party, or first-party code that you'd like to integrate into the pipeline. 标识各种文档类型的突出特征的分类模型属于此类别,但可以使用将值添加到内容的任何包。Classification models that identify salient characteristics of various document types fall into this category, but any package that adds value to your content could be used.

有关内置技能的详细信息More about built-in skills

使用内置技能组合起来的技能组非常适合以下应用方案:A skillset that's assembled using built-in skills is well suited for the following application scenarios:

  • 需要对其启用全文搜索的已扫描文档 (JPEG)。Scanned documents (JPEG) that you want to make full-text searchable. 可以附加光学字符识别 (OCR) 技能,以便标识、提取和引入 JPEG 文件中的文本。You can attach an optical character recognition (OCR) skill to identify, extract, and ingest text from JPEG files.

  • 组合使用图像和文本的 PDF。PDFs with combined image and text. PDF 中的文本可以在索引期间提取,不需使用扩充步骤,但在添加图像并进行自然语言处理的情况下,所产生的结果通常比标准索引提供的结果要好。Text in PDFs can be extracted during indexing without the use of enrichment steps, but the addition of image and natural language processing can often produce a better outcome than a standard indexing provides.

  • 需对其应用语言检测并可能对其应用文本翻译的多语言内容。Multi-lingual content against which you want to apply language detection and possibly text translation.

  • 非结构化或半结构化的文档,其中包含的内容有固有的含义,或者其上下文隐藏在更大的文档中。Unstructured or semi-structured documents containing content that has inherent meaning or context that is hidden in the larger document.

    具体说来,Blob 通常包含大量的内容,这些内容打包到单个“字段”中。Blobs in particular often contain a large body of content that is packed into a singled "field". 将图像和自然语言处理技能附加到索引器以后,即可创建新信息,该信息存在于原始内容中,但在其他情况下并不显示为非重复字段。By attaching image and natural language processing skills to an indexer, you can create new information that is extant in the raw content, but not otherwise surfaced as distinct fields. 某些对你有帮助的可用内置认知技能:关键短语提取、情绪分析、实体识别(人、组织和位置)。Some ready-to-use built-in cognitive skills that can help: key phrase extraction, sentiment analysis, and entity recognition (people, organizations, and locations).

    另外,内置技能还可以用来通过文本拆分、合并和形状操作来重新构造内容。Additionally, built-in skills can also be used restructure content through text split, merge, and shape operations.

有关自定义技能的详细信息More about custom skills

自定义技能可以支持更复杂的方案,例如识别表单,或者使用你提供的模型进行自定义实体检测,以及在自定义技能 Web 界面中进行包装。Custom skills can support more complex scenarios, such as recognizing forms, or custom entity detection using a model that you provide and wrap in the custom skill web interface. 自定义技能的几个示例包括自定义实体识别Several examples of custom skills include custom entity recognition.

扩充管道中的步骤Steps in an enrichment pipeline

扩充管道基于索引器An enrichment pipeline is based on indexers. 索引器根据索引与数据源之间的字段到字段映射填充索引,以进行文档破解。Indexers populate an index based on field-to-field mappings between the index and your data source for document cracking. 技能(现已附加到索引器)根据你定义的技能组截获并扩充文档。Skills, now attached to indexers, intercept and enrich documents according to the skillset(s) you define. 编制索引后,可以使用所有受 Azure 认知搜索支持的查询类型通过搜索请求来访问内容。Once indexed, you can access content via search requests through all query types supported by Azure Cognitive Search. 本部分引导索引器的新手完成这些步骤。If you are new to indexers, this section walks you through the steps.

步骤 1:连接和文档破解阶段Step 1: Connection and document cracking phase

在管道的开头部分包含非结构化文本或非文本内容(例如图像、扫描的文档或 JPEG 文件)。At the start of the pipeline, you have unstructured text or non-text content (such as images, scanned documents, or JPEG files). 数据必须存在于可由索引器访问的 Azure 数据存储服务中。Data must exist in an Azure data storage service that can be accessed by an indexer. 索引器可以“破解”源文档,以从源数据提取文本。Indexers can "crack" source documents to extract text from source data. 文档破解是在编制索引期间从非文本源提取或创建文本内容的过程。Document cracking is the process of extracting or creating text content from non-text sources during indexing.

文档破解阶段Document cracking phase

支持的源包括 Azure Blob 存储、Azure 表存储、Azure SQL 数据库和 Azure Cosmos DB。Supported sources include Azure blob storage, Azure table storage, Azure SQL Database, and Azure Cosmos DB. 可从以下类型的文件提取基于文本的内容:PDF、Word、PowerPoint、CSV 文件。Text-based content can be extracted from the following file types: PDFs, Word, PowerPoint, CSV files. 有关完整列表,请参阅支持的格式For the full list, see Supported formats. 编制索引需要花费一定的时间,因此请从较少的有代表性数据集着手,然后随着解决方案的不断成熟,逐渐增加数据集的大小。Indexing takes time so start with a small, representative data set and then build it up incrementally as your solution matures.

步骤 2:认知技能和扩充阶段Step 2: Cognitive skills and enrichment phase

扩充通过认知技能执行,这些技能执行原子操作。Enrichment is performed with cognitive skills performing atomic operations. 例如,在破解 PDF 后,可以应用实体识别、语言检测或关键短语提取,以便在索引中生成本来未在源代码中提供的新字段。For example, once you have cracked a PDF, you can apply entity recognition, language detection, or key phrase extraction to produce new fields in your index that are not available natively in the source. 管道中使用的技能的集合统称为技能集。Altogether, the collection of skills used in your pipeline is called a skillset.

扩充阶段Enrichment phase

技能集基于你提供的、与该技能集连接的内置认知技能自定义技能A skillset is based on built-in cognitive skills or custom skills you provide and connect to the skillset. 技能集既可以很精简,也可以很复杂,它不仅确定处理的类型,而且还确定运算的顺序。A skillset can be minimal or highly complex, and determines not only the type of processing, but also the order of operations. 技能集以及定义为索引器一部分的字段映射全面指定扩充管道。A skillset plus the field mappings defined as part of an indexer fully specifies the enrichment pipeline. 有关将所有组成部分一起提取的详细信息,请参阅定义技能集For more information about pulling all of these pieces together, see Define a skillset.

在内部,管道生成扩充文档的集合。Internally, the pipeline generates a collection of enriched documents. 可以确定要将扩充文档的哪些部分映射到搜索索引中可编制索引的字段。You can decide which parts of the enriched documents should be mapped to indexable fields in your search index. 例如,如果应用了关键短语提取和实体识别技能,则这些新字段将成为扩充文档的一部分,并可以映射到索引中的字段。For example, if you applied the key phrase extraction and the entity recognition skills, those new fields would become part of the enriched document, and can be mapped to fields on your index. 请参阅注释详细了解输入/输出的形成。See Annotations to learn more about input/output formations.

添加用于保存扩充的 knowledgeStore 元素Add a knowledgeStore element to save enrichments

搜索 REST api-version=2020-06-30 使用 knowledgeStore 定义来扩展技能组。该定义提供 Azure 存储连接以及描述如何存储扩充的投影。Search REST api-version=2020-06-30 extends skillsets with a knowledgeStore definition that provides an Azure storage connection and projections that describe how the enrichments are stored. 这是对索引的补充。This is in addition to your index. 在标准的 AI 管道中,扩充文档是临时的,仅在编制索引期间使用,然后被丢弃。In a standard AI pipeline, enriched documents are transitory, used only during indexing and then discarded. 扩充文档将通过知识存储保存起来。With knowledge store, enriched documents are preserved. 有关详细信息,请参阅知识存储For more information, see Knowledge store.

步骤 3:搜索索引和基于查询的访问Step 3: Search index and query-based access

完成处理后,便会获得由扩充的文档组成的搜索索引,这些文档在 Azure 认知搜索中可全文搜索。When processing is finished, you have a search index consisting of enriched documents, fully text-searchable in Azure Cognitive Search. 开发者和用户可以通过查询索引来访问管道生成的扩充内容。Querying the index is how developers and users access the enriched content generated by the pipeline.

带搜索图标的索引Index with search icon

索引类似于可为 Azure 认知搜索创建的其他任何对象:可以使用自定义分析器进行补充、调用模糊搜索查询、添加筛选的搜索结果,或试着使用评分配置文件为搜索结果重新整型。The index is like any other you might create for Azure Cognitive Search: you can supplement with custom analyzers, invoke fuzzy search queries, add filtered search, or experiment with scoring profiles to reshape the search results.

索引从某个索引架构生成。该架构定义字段、属性,以及附加到特定索引的其他构造,例如评分配置文件和同义词映射。Indexes are generated from an index schema that defines the fields, attributes, and other constructs attached to a specific index, such as scoring profiles and synonym maps. 定义并填充索引后,可以增量方式编制索引,以拾取新的和更新的源文档。Once an index is defined and populated, you can index incrementally to pick up new and updated source documents. 某些修改需要完全重新生成。Certain modifications require a full rebuild. 在架构设计稳定之前,应使用小型数据集。You should use a small data set until the schema design is stable. 有关详细信息,请参阅如何重新生成索引For more information, see How to rebuild an index.

清单:典型工作流Checklist: A typical workflow

  1. 将 Azure 源数据分解为代表性样本。Subset your Azure source data into a representative sample. 编制索引需要花费一定的时间,因此请从较少的有代表性数据集着手,然后随着解决方案的不断成熟,逐渐增加数据集的大小。Indexing takes time so start with a small, representative data set and then build it up incrementally as your solution matures.

  2. 在 Azure 认知搜索中创建数据源对象,以便提供用于数据检索的连接字符串。Create a data source object in Azure Cognitive Search to provide a connection string for data retrieval.

  3. 使用扩充步骤创建技能集Create a skillset with enrichment steps.

  4. 定义索引架构Define the index schema. 字段集合包含源数据中的字段。The Fields collection includes fields from source data. 还应该抽出其他字段,以保存扩充期间创建的内容的生成值。You should also stub out additional fields to hold generated values for content created during enrichment.

  5. 定义引用数据源、技能集和索引的索引器Define the indexer referencing the data source, skillset, and index.

  6. 在索引器中,添加 outputFieldMappings。Within the indexer, add outputFieldMappings. 此节将技能集的输出(步骤 3)映射到索引架构中的输入字段(步骤 4)。This section maps output from the skillset (in step 3) to the inputs fields in the index schema (in step 4).

  7. 发送刚刚创建的“创建索引器”请求(一个 POST 请求,其请求正文包含索引器定义),用于表示 Azure 认知搜索中的索引器。Send Create Indexer request you just created (a POST request with an indexer definition in the request body) to express the indexer in Azure Cognitive Search. 通过此步骤运行索引器,并调用管道。This step is how you run the indexer, invoking the pipeline.

  8. 运行查询以评估结果,并修改代码以更新技能集、架构或索引器配置。Run queries to evaluate results and modify code to update skillsets, schema, or indexer configuration.

  9. 重新生成管道之前重置索引器Reset the indexer before rebuilding the pipeline.

后续步骤Next steps