Azure 认知搜索中的知识存储Knowledge store in Azure Cognitive Search

知识存储是 Azure 认知搜索中的一项功能,它可以保留 AI 扩充管道的输出进行独立分析或下游处理。Knowledge store is a feature of Azure Cognitive Search that persists output from an AI enrichment pipeline for independent analysis or downstream processing. 扩充文档是管道的输出,是基于使用 AI 流程提取、结构化和分析的内容创建的。An enriched document is a pipeline's output, created from content that has been extracted, structured, and analyzed using AI processes. 在标准的 AI 管道中,扩充文档是临时的,仅在编制索引期间使用,然后被丢弃。In a standard AI pipeline, enriched documents are transitory, used only during indexing and then discarded. 选择创建知识存储将允许你保存扩充文档。Choosing to create a knowledge store will allow you to preserve the enriched documents.

如果你过去曾经用过认知技能,则已经知道技能集可以通过一系列扩充来移动文档。If you have used cognitive skills in the past, you already know that skillsets move a document through a sequence of enrichments. 结果可以是搜索索引,也可以是知识存储中的投影。The outcome can be a search index, or projections in a knowledge store. 搜索索引和知识存储这两种输出是同一管道的产出;它们派生自相同的输入,但导致以非常不同的方式构建、存储和使用输出。The two outputs, search index and knowledge store, are products of the same pipeline; derived from the same inputs, but resulting in output that is structured, stored, and used in very different ways.

在物理上,知识存储是一个 Azure 存储,可以是 Azure 表存储和/或 Azure Blob 存储。Physically, a knowledge store is Azure Storage, either Azure Table storage, Azure Blob storage, or both. 任何可以连接到 Azure 存储的工具或进程都可以使用知识存储的内容。Any tool or process that can connect to Azure Storage can consume the contents of a knowledge store.

管道中的知识存储示意图Knowledge store in pipeline diagram

知识存储的优势Benefits of knowledge store

知识存储提供结构、上下文和实际内容 - 这些内容收集自非结构化和半结构化数据文件(例如 Blob)、已经过分析的图像文件甚至结构化数据,并整形为新的形式。A knowledge store gives you structure, context, and actual content - gleaned from unstructured and semi-structured data files like blobs, image files that have undergone analysis, or even structured data, reshaped into new forms. 分步演练中可以获得有关密集的 JSON 文档如何分区成子结构、如何重新构造成新结构,以及如何可用于下游过程(例如机器学习和数据科学工作负荷)的第一手资料。In a step-by-step walkthrough, you can see first-hand how a dense JSON document is partitioned out into substructures, reconstituted into new structures, and otherwise made available for downstream processes like machine learning and data science workloads.

尽管了解 AI 扩充管道可以生成哪些内容会很有用,但知识存储的真正潜力是能够整形数据。Although it's useful to see what an AI enrichment pipeline can produce, the real potential of a knowledge store is the ability to reshape data. 你可以从基本技能集入手,然后循环访问它以添加越来越多的结构级别,这样就能将它们合并成新结构,可用于除 Azure 认知搜索以外的其他应用。You might start with a basic skillset, and then iterate over it to add increasing levels of structure, which you can then combine into new structures, consumable in other apps besides Azure Cognitive Search.

知识存储的优势已枚举如下:Enumerated, the benefits of knowledge store include the following:

  • 在除搜索以外的分析和报表工具中使用扩充文档。Consume enriched documents in analytics and reporting tools other than search. 包含 Power Query 的 Power BI 就是个极具吸引力的选择,但只要是能连接到 Azure 存储,任何工具或应用都可以从你创建的知识存储中提取文档。Power BI with Power Query is a compelling choice, but any tool or app that can connect to Azure Storage can pull from a knowledge store that you create.

  • 优化 AI 索引管道,同时调试步骤和技能集定义。Refine an AI-indexing pipeline while debugging steps and skillset definitions. 知识存储展示 AI 索引管道中的技能集定义的结果。A knowledge store shows you the product of a skillset definition in an AI-indexing pipeline. 这些结果可用于设计更好的技能集,因为你可以清楚地看到扩充是什么样的。You can use those results to design a better skillset because you can see exactly what the enrichments look like. 可以使用 Azure 存储中的存储资源管理器来查看知识存储的内容。You can use Storage Explorer in Azure Storage to view the contents of a knowledge store.

  • 将数据整形到新表单中。Shape the data into new forms. 整形在技能集中编码化,但重点是技能集现在可以提供此功能。The reshaping is codified in skillsets, but the point is that a skillset can now provide this capability. Azure 认知搜索中的整形程序技能已扩展为包含此任务。The Shaper skill in Azure Cognitive Search has been extended to accommodate this task. 通过整形,可以定义与数据预期用途保持一致的投影,同时保留关系。Reshaping allows you to define a projection that aligns with your intended use of the data while preserving relationships.

备注

刚刚接触 AI 扩充和认知技能?New to AI enrichment and cognitive skills? Azure 认知搜索与认知服务视觉和语言功能集成,以对图像文件使用光学字符识别 (OCR)、对文本文件使用实体识别和关键短语提取等来提取和扩充源数据。Azure Cognitive Search integrates with Cognitive Services Vision and Language features to extract and enrich source data using Optical Character Recognition (OCR) over image files, entity recognition and key phrase extraction from text files, and more. 有关详细信息,请参阅 Azure 认知搜索中的 AI 扩充For more information, see AI enrichment in Azure Cognitive Search.

物理存储Physical storage

知识存储的物理表达形式是通过技能集中 knowledgeStore 定义的 projections 元素阐释的。The physical expression of a knowledge store is articulated through the projections element of a knowledgeStore definition in a Skillset. 投影定义输出的结构,使之与预期用途相符。The projection defines a structure of the output so that it matches your intended use.

可将投影阐释为表、对象或文件。Projections can be articulated as tables, objects, or files.

"knowledgeStore": { 
    "storageConnectionString": "<YOUR-AZURE-STORAGE-ACCOUNT-CONNECTION-STRING>", 
    "projections": [ 
        { 
            "tables": [ ], 
            "objects": [ ], 
            "files": [ ]
        },
                { 
            "tables": [ ], 
            "objects": [ ], 
            "files": [ ]
        }

在此结构中指定的投影类型确定了知识存储使用的存储类型。The type of projection you specify in this structure determines the type of storage used by knowledge store.

  • 定义 tables 时将使用表存储。Table storage is used when you define tables. 如果需要使用表格式报告结构作为分析工具的输入或者要以数据帧的形式导出到其他数据存储,请定义表投影。Define a table projection when you need tabular reporting structures for inputs to analytical tools or export as data frames to other data stores. 可以指定多个 tables,以获取扩充文档的子集或截面。You can specify multiple tables to get a subset or cross section of enriched documents. 在同一投影组内,将会保留表关系,以便可以使用所有表。Within the same projection group, table relationships are preserved so that you can work with all of them.

  • 定义 objectsfiles 时,将使用 Blob 存储。Blob storage is used when you define objects or files. object 的物理表示形式是一个表示扩充文档的分层 JSON 结构。The physical representation of an object is a hierarchical JSON structure that represents an enriched document. file 是从文档中提取的图像,它按原样传输到 Blob 存储。A file is an image extracted from a document, transferred intact to Blob storage.

单个投影对象包含一组 tablesobjectsfiles,在许多情况下,创建一个投影可能就已足够。A single projection object contains one set of tables, objects, files, and for many scenarios, creating one projection might be enough.

但是,可以创建多组table-object-file 投影,如果需要不同的数据关系,也可以这样做。However, it is possible to create multiple sets of table-object-file projections, and you might do that if you want different data relationships. 在某个集内,数据是相关的,前提是这些关系存在并且可以检测到它们。Within a set, data is related, assuming those relationships exist and can be detected. 如果创建更多的集,每个组中的文档将永远不相关。If you create additional sets, the documents in each group are never related. 例如,如果你希望将相同的投影数据用于联机系统,而这些数据需要以特定的方式表示,同时,你还希望将相同的投影数据用于以不同方式表示的数据科学管道,则可以使用多个投影组。An example of using multiple projection groups might be if you want the same data projected for use with your online system and it needs to be represented a specific way, you also want the same data projected for use in a data science pipeline that is represented differently.

要求Requirements

需要 Azure 存储Azure Storage is required. 它提供物理存储。It provides physical storage. 可以使用 Blob 存储和/或表存储。You can use Blob storage, Table storage or both. Blob 存储用于未修改的扩充文档,通常是当输出要转移到下游过程时使用。Blob storage is used for intact enriched documents, usually when the output is going to downstream processes. 表存储用于扩充文档的切片,通常用于分析和报告。Table storage is for slices of enriched documents, commonly used for analysis and reporting.

需要技能集Skillset is required. 它包含 knowledgeStore 定义,并确定扩充文档的结构和构成部分。It contains the knowledgeStore definition, and it determines the structure and composition of an enriched document. 不能使用空技能集创建知识存储。You cannot create a knowledge store using an empty skillset. 必须在技能集中至少包含一个技能。You must have at least one skill in a skillset.

需要索引器Indexer is required. 技能集由索引器调用,以驱动执行。A skillset is invoked by an indexer, which drives the execution. 索引器附带自身的一组要求和属性。Indexers come with their own set of requirements and attributes. 其中的多个属性直接影响知识存储:Several of these attributes have a direct bearing on a knowledge store:

  • 索引器需要受支持的 Azure 数据源(最终创建知识存储的管道首先从 Azure 支持的源提取数据)。Indexers require a supported Azure data source (the pipeline that ultimately creates the knowledge store starts by pulling data from a supported source on Azure).

  • 索引器需要搜索索引。Indexers require a search index. 索引器要求提供索引架构,即使你从未打算使用该架构。An indexer requires that you provide an index schema, even if you never plan to use it. 最低要求的索引是一个字符串字段,指定为键。A minimal index has one string field, designated as the key.

  • 索引器提供可选的字段映射,用于将源字段设置为目标字段的别名。Indexers provide optional field mappings, used to alias a source field to a destination field. 如果默认字段映射需要修改(以使用不同的名称或类型),可以在索引器中创建字段映射If a default field mapping needs modification (to use a different name or type), you can create a field mapping within an indexer. 对于知识存储输出,目标可以是 Blob 对象或表中的字段。For knowledge store output, the destination can be a field in a blob object or table.

  • 索引器具有计划和其他属性(例如各种数据源提供的更改检测机制),也可应用于知识存储。Indexers have schedules and other properties, such as change detection mechanisms provided by various data sources, can also be applied to a knowledge store. 例如,可以按固定的间隔计划扩充以刷新内容。For example, you can schedule enrichment at regular intervals to refresh the contents.

如何创建知识存储How to create a knowledge store

若要创建知识存储,请使用门户或 REST API (api-version=2020-06-30)。To create knowledge store, use the portal or the REST API (api-version=2020-06-30).

使用 Azure 门户Use the Azure portal

“导入数据”向导包含用于创建知识存储的选项。The Import data wizard includes options for creating a knowledge store. 若要进行初始探索,请通过四个步骤创建第一个知识存储For initial exploration, create your first knowledge store in four steps.

  1. 选择支持的数据源。Select a supported data source.

  2. 指定扩充:附加资源,选择技能,并指定知识存储。Specify enrichment: attach a resource, select skills, and specify a knowledge store.

  3. 创建索引架构。Create an index schema. 向导需要该架构,并可以推断一个架构。The wizard requires it and can infer one for you.

  4. 运行向导。Run the wizard. 提取、扩充和存储操作在此最后一个步骤中发生。Extraction, enrichment, and storage occur in this last step.

使用“创建技能组”(REST API)Use Create Skillset (REST API)]

knowledgeStore 是在技能集中定义的,后者又由索引器调用。A knowledgeStore is defined within a skillset, which in turn is invoked by an indexer. 在扩充期间,Azure 认知搜索会在 Azure 存储帐户中创建一个空间,并根据配置将扩充文档投影到 Blob 或表。During enrichment, Azure Cognitive Search creates a space in your Azure Storage account and projects the enriched documents as blobs or into tables, depending on your configuration.

REST API 是一种可以通过编程方式创建知识存储的机制。The REST API is one mechanism by which you can create a knowledge store programmatically. 一种简单的探索方法是使用 Postman 和 REST API 创建第一个知识存储An easy way to explore is create your first knowledge store using Postman and the REST API.

如何连接工具和应用How to connect with tools and apps

只要扩充存在于存储中,连接到 Azure Blob 存储或表存储的任何工具或技术,都可用于浏览、分析或使用内容。Once the enrichments exist in storage, any tool or technology that connects to Azure Blob or Table storage can be used to explore, analyze, or consume the contents. 请从以下列表入手:The following list is a start:

API 参考API reference

REST API 版本 2020-06-30 通过技能集中的附加定义提供知识存储。REST API version 2020-06-30 provides knowledge store through additional definitions on skillsets. 除了参考资料以外,另请参阅使用 Postman 创建知识存储来了解有关如何调用 API 的详细信息。In addition to the reference, see Create a knowledge store using Postman for details on how to call the APIs.

后续步骤Next steps

知识存储提供扩充文档的持久性,在设计技能集,或者在创建新的结构和内容供可访问 Azure 存储帐户的任何客户端应用程序使用时,知识存储非常有用。Knowledge store offers persistence of enriched documents, useful when designing a skillset, or the creation of new structures and content for consumption by any client applications capable of accessing an Azure Storage account.

创建扩充文档的最简单方法是使用门户,但也可以使用 Postman 和 REST API,想要深入了解对象的创建和引用方式时,后者更有用。The simplest approach for creating enriched documents is through the portal, but you can also use Postman and the REST API, which is more useful if you want insight into how objects are created and referenced.

详细了解投影、相关功能以及在技能组中对其进行定义的方法To learn more about projections, the capabilities and how you define them in a skillset

有关涵盖高级投影概念(如切片、内联整形和关系)的教程,请先参阅在知识存储中定义投影For a tutorial covering advanced projections concepts like slicing, inline shaping and relationships, start with define projections in a knowledge store