Azure 认知搜索中的知识存储“投影”Knowledge store "projections" in Azure Cognitive Search

Azure 认知搜索允许通过编制索引功能附带的内置认知技能和自定义技能来扩充内容。Azure Cognitive Search enables content enrichment through built-in cognitive skills and custom skills as part of indexing. 扩充创建以前不存在的新信息:从图像中提取信息,从文本中检测情感、关键短语和实体等等。Enrichments create new information where none previously existed: extracting information from images, detecting sentiment, key phrases, and entities from text, to name a few. 扩充还向无差别文本中添加结构。Enrichments also add structure to undifferentiated text. 所有这些过程将产生使全文搜索更有效的文档。All of these processes result in documents that make full text search more effective. 在许多情况下,扩充的文档可用于除搜索以外的方案,例如知识挖掘。In many instances, enriched documents are useful for scenarios other than search, such as for knowledge mining.

投影(知识存储的一个组件)是可以保存到物理存储的、用于实现知识挖掘的扩充文档的视图。Projections, a component of knowledge store, are views of enriched documents that can be saved to physical storage for knowledge mining purposes. 使用投影可将数据“投影”到符合需求的形状,并保持相应的关系,使 Power BI 等工具能够在不增大负载的情况下读取数据。A projection lets you "project" your data into a shape that aligns with your needs, preserving relationships so that tools like Power BI can read the data with no additional effort.

投影可以是表格式的,其中的数据存储在 Azure Blob 存储中的行与列中;也可以是存储在 Azure 表存储中的 JSON 对象。Projections can be tabular, with data stored in rows and columns in Azure Table storage, or JSON objects stored in Azure Blob storage. 可以在扩充数据时定义数据的多个投影。You can define multiple projections of your data as it is being enriched. 希望以不同的方式为各种用例塑造相同的数据时,多个投影非常有用。Multiple projections are useful when you want the same data shaped differently for individual use cases.

知识存储支持三种类型的投影:The knowledge store supports three types of projections:

  • :对于最好是以行和列表示的数据,可以使用表投影在表存储中定义架构化的形状或投影。Tables: For data that's best represented as rows and columns, table projections allow you to define a schematized shape or projection in Table storage. 只有有效的 JSON 对象可以投影为表,扩充的文档可以包含不是已命名 JSON 对象的节点,并在将这些对象投影时,使用整形程序技能或内联整形创建有效的 JSON 对象。Only valid JSON objects can be projected as tables, the enriched document can contain nodes that are not named JSON objects and when projecting these objects, create a valid JSON object with a shaper skill or inline shaping.

  • 对象:需要数据和扩充内容的 JSON 表示形式时,可将对象投影保存为 Blob。Objects: When you need a JSON representation of your data and enrichments, object projections are saved as blobs. 只有有效的 JSON 对象可以投影为对象,扩充的文档可以包含不是已命名 JSON 对象的节点,并在将这些对象投影时,使用整形程序技能或内联整形创建有效的 JSON 对象。Only valid JSON objects can be projected as objects, the enriched document can contain nodes that are not named JSON objects and when projecting these objects, create a valid JSON object with a shaper skill or inline shaping.

  • 文件:当需要保存从文档中提取的图像时,文件投影允许你将规范化图像保存到 blob 存储中。Files: When you need to save the images extracted from the documents, file projections allow you to save the normalized images to blob storage.

若要查看上下文中定义的投影,请执行在 REST 中创建知识存储中的每个步骤。To see projections defined in context, step through Create a knowledge store in REST.

投影组Projection groups

在某些情况下,你需要投影采用不同形状的扩充数据,以符合不同的目标。In some cases, you will need to project your enriched data in different shapes to meet different objectives. 使用知识存储可以定义多个投影组。The knowledge store allows you to define multiple groups of projections. 投影组具有以下重要的互斥性和相关性特征。Projection groups have the following key characteristics of mutual exclusivity and relatedness.

互斥性Mutual exclusivity

投影到单个组的所有内容独立于投影到其他投影组的数据。All content projected into a single group is independent of data projected into other projection groups. 此独立性意味着,能够以不同的方式塑造相同的数据,不过需要在每个投影组中重复该操作。This independence implies that you can have the same data shaped differently, yet repeated in each projection group.

相关性Relatedness

投影组现在允许你跨投影类型来将文档投影,同时跨投影类型保留关系。Projection groups now allow you to project your documents across projection types while preserving the relationships across projection types. 在单个投影组中投影的所有内容跨投影类型保留数据中的关系。All content projected within a single projection group preserves relationships within the data across projection types. 在表中,关系基于生成的键,每个子节点保留对父节点的引用。Within tables, relationships are based on a generated key and each child node retains a reference to the parent node. 跨类型(表、对象和文件)时,当跨不同类型对单个节点进行投影时,关系将保留。Across types (tables, objects and files), relationships are preserved when a single node is projected across different types. 例如,假设有一个文档包含图像和文本。For example, consider a scenario where you have a document containing images and text. 你可以将文本投影到表或对象,将图像投影到文件,但文件中的表或对象必须具有包含文件 URL 的列/属性。You could project the text to tables or objects and the images to files where the tables or objects have a column/property containing the file URL.

输入整形Input shaping

获取采用适当形状或结构(表或对象)的数据对于有效利用数据而言至关重要。Getting your data in the right shape or structure is key to effective use, be it tables or objects. 根据访问和使用数据的方式塑造或结构化数据的功能,是在技能集中作为“整形程序”公开的关键功能。The ability to shape or structure your data based on how you plan to access and use it is a key capability exposed as the Shaper skill within the skillset.

如果扩充树中存在与投影架构匹配的对象,则可以更轻松地定义投影。Projections are easier to define when you have an object in the enrichment tree that matches the schema of the projection. 使用更新的整形程序技能可以从扩充树的不同节点编写对象,并将其指定为新节点下的父级。The updated Shaper skill allows you to compose an object from different nodes of the enrichment tree and parent them under a new node. 使用“整形程序”技能可以定义包含嵌套对象的复杂类型。The Shaper skill allows you to define complex types with nested objects.

如果定义的新形状包含需要投影出的所有元素,则你现在可以使用此形状作为投影的源,或作为另一技能的输入。When you have a new shape defined that contains all the elements you need to project out, you can now use this shape as the source for your projections or as an input to another skill.

投影切片Projection slicing

定义投影组时,可将扩充树中的单个节点切片为多个相关的表或对象。When defining a projection group, a single node in the enrichment tree can be sliced into multiple related tables or objects. 添加源路径为现有投影的子级的投影将导致子节点从父节点中分割出来,并投影到新的但相关的表或对象中。Adding a projection with a source path that is a child of an existing projection will result in the child node being sliced out of the parent node and projected into the new yet related table or object. 利用此技术,你可以使用整形程序技能定义单个节点,该节点可以作为所有投影的源。This technique allows you to define a single node in a shaper skill that can be the source for all of your projections.

表投影Table projections

我们建议在 Power BI 中使用表投影浏览数据,因为这可以更方便地导入。Because it makes importing easier, we recommend table projections for data exploration with Power BI. 此外,表投影允许更改表关系之间的基数。Additionally, table projections allow for changing the cardinality between table relationships.

可将索引中的单个文档投影到多个表,并保留关系。You can project a single document in your index into multiple tables, preserving the relationships. 投影到多个表时,除非子节点是同一个组中其他表的源,否则整个形状将投影到每个表。When projecting to multiple tables, the complete shape will be projected into each table, unless a child node is the source of another table within the same group.

定义表投影Defining a table projection

在技能集的 knowledgeStore 元素中定义表投影时,请先将扩充树中的某个节点映射到表源。When defining a table projection within the knowledgeStore element of your skillset, start by mapping a node on the enrichment tree to the table source. 此节点通常是添加到技能列表的、用于生成需要投影到表的“整形程序”技能的输出。Typically this node is the output of a Shaper skill that you added to the list of skills to produce a specific shape that you need to project into tables. 选择投影的节点可以分片,以投影到多个表。The node you choose to project can be sliced to project into multiple tables. 表定义是要投影的表列表。The tables definition is a list of tables that you want to project.

每个表需要三个属性:Each table requires three properties:

  • tableName:Azure 存储中的表名称。tableName: The name of the table in Azure Storage.

  • generatedKeyName:用于唯一标识此行的键的列名。generatedKeyName: The column name for the key that uniquely identifies this row.

  • source:扩充树中用作扩充来源的节点。source: The node from the enrichment tree you are sourcing your enrichments from. 此节点通常是整形程序的输出,但也可能是任何技能的输出。This node is usually the output of a shaper, but could be the output of any of the skills.

下面是表投影的示例。Here is an example of table projections.

{
    "name": "your-skillset",
    "skills": [
      …your skills
    ],
"cognitiveServices": {
… your cognitive services key info
    },

    "knowledgeStore": {
      "storageConnectionString": "an Azure storage connection string",
      "projections" : [
        {
          "tables": [
            { "tableName": "MainTable", "generatedKeyName": "SomeId", "source": "/document/EnrichedShape" },
            { "tableName": "KeyPhrases", "generatedKeyName": "KeyPhraseId", "source": "/document/EnrichedShape/*/KeyPhrases/*" },
            { "tableName": "Entities", "generatedKeyName": "EntityId", "source": "/document/EnrichedShape/*/Entities/*" }
          ]
        },
        {
          "objects": [ ]
        },
        {
            "files": [ ]
        }
      ]
    }
}

如此示例中所示,关键短语和实体已建模到不同的表中,将包含对每行的父级 (MainTable) 的反向引用。As demonstrated in this example, the key phrases and entities are modeled into different tables and will contain a reference back to the parent (MainTable) for each row.

对象投影Object projections

对象投影是可以从任何节点寻源的扩充树的 JSON 表示形式。Object projections are JSON representations of the enrichment tree that can be sourced from any node. 在许多情况下,可以使用用于创建表投影的同一个“整形程序”技能来生成对象投影。In many cases, the same Shaper skill that creates a table projection can be used to generate an object projection.

{
    "name": "your-skillset",
    "skills": [
      …your skills
    ],
"cognitiveServices": {
… your cognitive services key info
    },

    "knowledgeStore": {
      "storageConnectionString": "an Azure storage connection string",
      "projections" : [
        {
          "tables": [ ]
        },
        {
          "objects": [
            {
              "storageContainer": "hotelreviews", 
              "source": "/document/hotel"
            }
          ]
        },
        {
            "files": [ ]
        }
      ]
    }
}

生成对象投影需要几个特定于对象的属性:Generating an object projection requires a few object-specific attributes:

  • storageContainer:对象将保存到的 blob 容器storageContainer: The blob container where the objects will be saved
  • source:扩充树中充当投影根的节点的路径source: The path to the node of the enrichment tree that is the root of the projection

文件投影File projection

文件投影类似于对象投影,只对 normalized_images 集合进行操作。File projections are similar to object projections and only act on the normalized_images collection. 与对象投影类似,文件投影保存在 blob 容器中,其文件夹前缀为文档 ID 的 base64 编码值。Similar to object projections, file projections are saved in the blob container with folder prefix of the base64 encoded value of the document ID. 文件投影不能与对象投影共享同一个容器,需要将其投影到不同的容器。File projections cannot share the same container as object projections and need to be projected into a different container.

{
    "name": "your-skillset",
    "skills": [
      …your skills
    ],
"cognitiveServices": {
… your cognitive services key info
    },

    "knowledgeStore": {
      "storageConnectionString": "an Azure storage connection string",
      "projections" : [
        {
          "tables": [ ]
        },
        {
          "objects": [ ]
        },
        {
            "files": [
                 {
                  "storageContainer": "ReviewImages",
                  "source": "/document/normalized_images/*"
                }
            ]
        }
      ]
    }
}

投影生命周期Projection lifecycle

投影具有一个与数据源中的源数据绑定的生命周期。Your projections have a lifecycle that is tied to the source data in your data source. 更新数据并重建其索引时,将使用扩充的结果更新投影,以确保投影最终与数据源中的数据保持一致。As your data is updated and reindexed, your projections are updated with the results of the enrichments ensuring your projections are eventually consistent with the data in your data source. 投影继承你为索引配置的删除策略。The projections inherit the delete policy you've configured for your index. 删除索引器或搜索服务本身时不会删除投影。Projections are not deleted when the indexer or the search service itself is deleted.

使用投影Using projections

运行索引器后,可以在通过投影指定的容器或表中读取投影的数据。After the indexer is run, you can read the projected data in the containers or tables you specified through projections.

分析时,可以方便地在 Power BI 中浏览数据,只需将 Azure 表存储设置为数据源即可。For analytics, exploration in Power BI is as simple as setting Azure Table storage as the data source. 可以利用数据内部的关系,轻松地基于数据创建一组可视化效果。You can easily create a set of visualizations on your data using the relationships within.

另外,如果你需要在数据科学管道中使用扩充的数据,则可以将数据从 blob 加载到 Pandas 数据帧Alternatively, if you need to use the enriched data in a data science pipeline, you could load the data from blobs into a Pandas DataFrame.

最后,如果需要从知识存储导出数据,可以使用 Azure 数据工厂提供的连接器来导出数据,然后将其载入所选的数据库。Finally, if you need to export your data from the knowledge store, Azure Data Factory has connectors to export the data and land it in the database of your choice.

后续步骤Next steps

接下来,请使用示例数据遵照说明创建第一个知识存储。As a next step, create your first knowledge store using sample data and instructions.

有关涵盖高级投影概念(如切片、内联整形和关系)的教程,请先参阅在知识存储中定义投影For a tutorial covering advanced projections concepts like slicing, inline shaping and relationships, start with define projections in a knowledge store