如何整形和导出扩充内容How to shape and export enrichments

投影是知识存储中扩充的文档的物理表达形式。Projections are the physical expression of enriched documents in a knowledge store. 有效利用扩充的文档需要结构。Effective use of enriched documents requires structure. 本文探讨结构和关系,介绍如何生成投影属性,以及如何关联所创建的不同投影类型的数据。In this article, you'll explore both structure and relationships, learning how to build out projection properties, as well as how to relate data across the projection types created.

若要创建投影,请使用整形器技能为数据整形以创建自定义对象,或者在投影定义中使用内联的整形语法。To create a projection, the data is shaped using either a Shaper skill to create a custom object or using the inline shaping syntax within a projection definition.

数据形状包含要投影的所有数据,其形式为节点的层次结构。A data shape contains all the data intended to project, formed as a hierarchy of nodes. 本文将介绍多种数据整形方法,使用户能够将数据投影到有利于报告、分析或下游处理的物理结构。This article shows several techniques for shaping data so that it can be projected into physical structures conducive to reporting, analysis, or downstream processing.

本文中演示的示例可在此 REST API 示例中找到,可以在 HTTP 客户端中下载和运行该示例。The examples presented in this article can be found in this REST API sample, which you can download and run in an HTTP client.

投影示例简介Introduction to projection examples

有三种类型的投影There are three types of projections:

  • Tables
  • 对象Objects
  • 文件Files

表投影存储在 Azure 表存储中。Table projections are stored in Azure Table storage. 对象和文件投影将写入 Blob 存储,在此存储中,对象投影保存为 JSON 文件,可以包含源文档中的内容,以及任何技能输出或扩充内容。Object and file projections are written to blob storage, where object projections are saved as JSON files, and can contain content from the source document as well as any skill outputs or enrichments. 扩充管道还可以提取图像等二进制文件,这些二进制文件投影为文件投影。The enrichment pipeline can also extract binaries like images, these binaries are projected as file projections. 将二进制对象投影为对象投影时,只有与之关联的元数据才保存为 JSON Blob。When a binary object is projected as an object projection, only the metadata associated with it is saved as a JSON blob.

为了理解数据整形与投影之间的相互关系,我们将使用以下技能集作为探索各种配置的基础。To understand the intersection between data shaping and projections, we'll use the following skillset as the basis for exploring various configurations. 此技能集处理原始图像和文本内容。This skillset processes raw image and text content. 将会根据所需的方案,基于文档的内容以及技能的输出定义投影。Projections will be defined from the contents of the document and the outputs of the skills, for the desired scenarios.

重要

体验投影时,设置索引器缓存属性以确保控制成本的做法会很有用。When experimenting with projections, it is useful to set the indexer cache property to ensure cost control. 如果未设置索引器缓存,则编辑投影会导致再次扩充整个文档。Editing projections will result in the entire document being enriched again if the indexer cache is not set. 设置缓存并仅更新投影后,对以前已扩充的文档执行技能集不会产生任何新的认知服务费用。When the cache is set and only the projections updated, skillset executions for previously enriched documents do not result in any new Cognitive Services charges.

{
    "name": "azureblob-skillset",
    "description": "Skillset created from the portal. skillsetName: azureblob-skillset; contentField: merged_content; enrichmentGranularity: document; knowledgeStoreStorageAccount: confdemo;",
    "skills": [
        {
            "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
            "name": "#1",
            "description": null,
            "context": "/document/merged_content",
            "categories": [
                "Person",
                "Quantity",
                "Organization",
                "URL",
                "Email",
                "Location",
                "DateTime"
            ],
            "defaultLanguageCode": "en",
            "minimumPrecision": null,
            "includeTypelessEntities": null,
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/merged_content"
                },
                {
                    "name": "languageCode",
                    "source": "/document/language"
                }
            ],
            "outputs": [
                {
                    "name": "persons",
                    "targetName": "people"
                },
                {
                    "name": "organizations",
                    "targetName": "organizations"
                },
                {
                    "name": "locations",
                    "targetName": "locations"
                },
                {
                    "name": "entities",
                    "targetName": "entities"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
            "name": "#2",
            "description": null,
            "context": "/document/merged_content",
            "defaultLanguageCode": "en",
            "maxKeyPhraseCount": null,
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/merged_content"
                },
                {
                    "name": "languageCode",
                    "source": "/document/language"
                }
            ],
            "outputs": [
                {
                    "name": "keyPhrases",
                    "targetName": "keyphrases"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
            "name": "#3",
            "description": null,
            "context": "/document",
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/merged_content"
                }
            ],
            "outputs": [
                {
                    "name": "languageCode",
                    "targetName": "language"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
            "name": "#4",
            "description": null,
            "context": "/document",
            "insertPreTag": " ",
            "insertPostTag": " ",
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/content"
                },
                {
                    "name": "itemsToInsert",
                    "source": "/document/normalized_images/*/text"
                },
                {
                    "name": "offsets",
                    "source": "/document/normalized_images/*/contentOffset"
                }
            ],
            "outputs": [
                {
                    "name": "mergedText",
                    "targetName": "merged_content"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
            "name": "#5",
            "description": null,
            "context": "/document/normalized_images/*",
            "textExtractionAlgorithm": "printed",
            "lineEnding": "Space",
            "defaultLanguageCode": "en",
            "detectOrientation": true,
            "inputs": [
                {
                    "name": "image",
                    "source": "/document/normalized_images/*"
                }
            ],
            "outputs": [
                {
                    "name": "text",
                    "targetName": "text"
                },
                {
                    "name": "layoutText",
                    "targetName": "layoutText"
                }
            ]
        }
    ],
    "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
        "description": "DemosCS",
        "key": "<COGNITIVE SERVICES KEY>"
    },
    "knowledgeStore": null
}

使用此技能集并将其 null knowledgeStore 用作基础,第一个示例将填充 knowledgeStore 对象,该对象是使用创建表格数据结构的投影配置的,可在其他方案中使用这些结构。Using this skillset, with its null knowledgeStore as the basis, our first example fills in the knowledgeStore object, configured with projections that create tabular data structures we can use in other scenarios.

投影到表Projecting to tables

投影到 Azure 存储中的表对于使用 Power BI 等工具进行报告和分析非常有用。Projecting to tables in Azure Storage is useful for reporting and analysis using tools like Power BI. Power BI 可从表中读取数据,并基于投影期间生成的键来发现关系。Power BI can read from tables and discover relationships based on the keys that are generated during projection. 如果你正在尝试生成仪表板,准备好相关的数据可以简化该任务。If you're trying to build a dashboard, having related data will simplify that task.

让我们生成一个仪表板,以便将从文档中提取的关键短语可视化为字云。Let's build a dashboard to visualize the key phrases extracted from documents as a word cloud. 若要创建适当的数据结构,请在技能组中添加一个整形器技能,以创建包含文档特定的详细信息和关键短语的自定义形状。To create the right data structure, add a Shaper skill to the skillset to create a custom shape that has the document-specific details and key phrases. 该自定义形状在 document 根节点上称为 pbiShapeThe custom shape will be called pbiShape on the document root node.

备注

表预测是 Azure 存储表,受限于 Azure 存储施加的存储限制。Table projections are Azure Storage tables, governed by the storage limits imposed by Azure Storage. 有关详细信息,请参阅表存储限制For more information, see table storage limits. 最好是知道实体大小不能超过 1 MB,且单个属性不能大于 64 KB。It is useful to know that the entity size cannot exceed 1 MB and a single property can be no bigger than 64 KB. 这些约束使得表成了用于存储大量小型实体的适当解决方案。These constraints make tables a good solution for storing a large number of small entities.

使用整形器技能创建自定义形状Using a Shaper skill to create a custom shape

创建可投影到表存储的自定义形状。Create a custom shape that you can project into table storage. 如果没有自定义形状,投影只能引用单个节点(每个输出对应一个投影)。Without a custom shape, a projection can only reference a single node (one projection per output). 创建自定义形状可将各个元素聚合成新的逻辑整体,而该整体又可以投影为单个表,或者在一系列表之间进行切片和分布。Creating a custom shape aggregates various elements into a new logical whole that can be projected as a single table, or sliced and distributed across a collection of tables.

在此示例中,自定义形状合并了元数据并标识了实体和关键短语。In this example, the custom shape combines metadata and identified entities and key phrases. 对象称为 pbiShape,是 /document 下的父级。The object is called pbiShape and is parented under /document.

重要

整形的目的之一是确保所有扩充节点以适当格式的 JSON 表示,只有这样才能投影到知识存储。One purpose of shaping is to ensure that all enrichment nodes are expressed in well-formed JSON, which is required for projecting into knowledge store. 当扩充树包含不是正确格式的 JSON 的节点时(例如,当扩充是字符串等基元的父级时),更是如此。This is especially true when an enrichment tree contains nodes that are not well-formed JSON (for example, when an enrichment is parented to a primitive like a string).

请注意最后两个节点 KeyPhrasesEntitiesNotice the last two nodes, KeyPhrases and Entities. 这些节点通过 sourceContext 包装成有效的 JSON 对象。These are wrapped into a valid JSON object with the sourceContext. 之所以需要这样做,是因为 keyphrasesentities 是针对基元的扩充,只有在转换为有效的 JSON 后才能投影。This is required as keyphrases and entities are enrichments on primitives and need to be converted to valid JSON before they can be projected.

{
    "@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
    "name": "ShaperForTables",
    "description": null,
    "context": "/document",
    "inputs": [
        {
            "name": "metadata_storage_content_type",
            "source": "/document/metadata_storage_content_type",
            "sourceContext": null,
            "inputs": []
        },
        {
            "name": "metadata_storage_name",
            "source": "/document/metadata_storage_name",
            "sourceContext": null,
            "inputs": []
        },
        {
            "name": "metadata_storage_path",
            "source": "/document/metadata_storage_path",
            "sourceContext": null,
            "inputs": []
        },
        {
            "name": "metadata_content_type",
            "source": "/document/metadata_content_type",
            "sourceContext": null,
            "inputs": []
        },
        {
            "name": "keyPhrases",
            "source": null,
            "sourceContext": "/document/merged_content/keyphrases/*",
            "inputs": [
                {
                    "name": "KeyPhrases",
                    "source": "/document/merged_content/keyphrases/*"
                }

            ]
        },
        {
            "name": "Entities",
            "source": null,
            "sourceContext": "/document/merged_content/entities/*",
            "inputs": [
                {
                    "name": "Entities",
                    "source": "/document/merged_content/entities/*/name"
                }

            ]
        }
    ],
    "outputs": [
        {
            "name": "output",
            "targetName": "pbiShape"
        }
    ]
}

将上述整形器技能添加到技能集。Add the above Shaper skill to the skillset.

    "name": "azureblob-skillset",
    "description": "A friendly description of the skillset goes here.",
    "skills": [
        {
            Shaper skill goes here
            }
        ],
    "cognitiveServices":  "A key goes here",
    "knowledgeStore": []
}  

准备好投影到表所需的所有数据后,接下来请使用表定义更新 knowledgeStore 对象。Now that we have all the data needed to project to tables, update the knowledgeStore object with the table definitions. 此示例包含通过设置 tableNamesourcegeneratedKeyName 属性定义的三个表。In this example, we have three tables, defined by setting the tableName, source and generatedKeyName properties.

"knowledgeStore" : {
    "storageConnectionString": "DefaultEndpointsProtocol=https;AccountName=<Acct Name>;AccountKey=<Acct Key>;",
    "projections": [
        {
            "tables": [
                {
                    "tableName": "pbiDocument",
                    "generatedKeyName": "Documentid",
                    "source": "/document/pbiShape"
                },
                {
                    "tableName": "pbiKeyPhrases",
                    "generatedKeyName": "KeyPhraseid",
                    "source": "/document/pbiShape/keyPhrases/*"
                },
                {
                    "tableName": "pbiEntities",
                    "generatedKeyName": "Entityid",
                    "source": "/document/pbiShape/Entities/*"
                }
            ],
            "objects": [],
            "files": []
        }
    ]
}

可以执行以下步骤来处理工作:You can process your work by following these steps:

  1. storageConnectionString 属性设置为有效的 V2 常规用途存储帐户连接字符串。Set the storageConnectionString property to a valid V2 general purpose storage account connection string.

  2. 通过发出 PUT 请求来更新技能集。Update the skillset by issuing the PUT request.

  3. 更新技能集后,运行索引器。After updating the skillset, run the indexer.

现已获得一个包含三个表的正常运行的投影。You now have a working projection with three tables. 将这些表导入 Power BI 后,Power BI 应会自动发现关系。Importing these tables into Power BI should result in Power BI auto-discovering the relationships.

在继续学习下一个示例之前,让我们回顾一下表投影的各个方面,以理解进行数据切片和关联的机制。Before moving on to the next example, let's revisit aspects of the table projection to understand the mechanics of slicing and relating data.

切片Slicing

切片是将合并的整个形状细分为组成部分的技术。Slicing is a technique that subdivides a whole consolidated shape into constituent parts. 结果包括独立但相关的、可单独使用的表。The outcome consists of separate but related tables that you can work with individually.

在此示例中,pbiShape 是合并的形状(或扩充节点)。In the example, pbiShape is the consolidated shape (or enrichment node). 在投影定义中,pbiShape 切片为附加的表,这样你就可以提取形状的组成部分:keyPhrasesEntitiesIn the projection definition, pbiShape is sliced into additional tables, which enables you to pull out parts of the shape, keyPhrases and Entities. 此技术在 Power BI 中非常有用,因为多个实体和关键短语与每个文档相关联,如果可以看到分类数据形式的实体和关键短语,则会获得更多的见解。In Power BI, this is useful as multiple entities and keyPhrases are associated with each document, and you will get more insights if you can see entities and keyPhrases as categorized data.

切片在父表与子表之间隐式生成关系,使用父表中的 generatedKeyName 在子表中创建同名的列。Slicing implicitly generates a relationship between the parent and child tables, using the generatedKeyName in the parent table to create a column with the same name in the child table.

命名关系Naming relationships

generatedKeyNamereferenceKeyName 属性用于关联表之间的数据,甚至可以关联投影类型之间的数据。The generatedKeyName and referenceKeyName properties are used to relate data across tables or even across projection types. 子表/投影中的每一行都有一个指向回父级的属性。Each row in the child table/projection has a property pointing back to the parent. 子级中的列或属性的名称是来自父级的 referenceKeyNameThe name of the column or property in the child is the referenceKeyName from the parent. 如果未提供 referenceKeyName,服务默认使用来自父级中 generatedKeyName 的名称。When the referenceKeyName is not provided, the service defaults it to the generatedKeyName from the parent.

Power BI 依赖于这些生成的键来发现表中的关系。Power BI relies on these generated keys to discover relationships within the tables. 如果需要以不同的方式命名子表中的列,请在父表中设置 referenceKeyName 属性。If you need the column in the child table named differently, set the referenceKeyName property on the parent table. 例如,将 generatedKeyName 设置为 pbiDocument 表的 ID,并将 referenceKeyName 设置为 DocumentID。One example would be to set the generatedKeyName as ID on the pbiDocument table and the referenceKeyName as DocumentID. 这会导致包含文档 ID 的 pbiEntities 和 pbiKeyPhrases 表中的列命名为 DocumentID。This would result in the column in the pbiEntities and pbiKeyPhrases tables containing the document id being named DocumentID.

投影到对象Projecting to objects

对象投影不存在表投影那样的限制,更适合用于投影大型文档。Object projections do not have the same limitations as table projections and are better suited for projecting large documents. 此示例将整个文档作为对象投影发送。In this example, the entire document is sent as an object projection. 对象投影限制为容器中的单个投影,并且无法切片。Object projections are limited to a single projection in a container and cannot be sliced.

为了定义对象投影,将在投影中使用 objects 数组。To define an object projection, use the objects array in the projections. 可以使用整形器技能或使用对象投影的内联整形来生成新形状。You can generate a new shape using the Shaper skill or use inline shaping of the object projection. 表示例演示的是创建形状和切片的方法,而本示例演示的是内联整形的用法。While the tables example demonstrated the approach of creating a shape and slicing, this example demonstrates the use of inline shaping.

内联整形是指在投影输入的定义中创建新形状的功能。Inline shaping is the ability to create a new shape in the definition of the inputs to a projection. 内联整形创建一个匿名对象,该对象与整形器技能生成的对象相同(在本例中为 pbiShape)。Inline shaping creates an anonymous object that is identical to what a Shaper skill would produce (in our case, pbiShape). 定义一个你不打算重复使用的形状时,内联整形非常有用。Inline shaping is useful if you are defining a shape that you do not plan to reuse.

投影属性是一个数组。The projections property is an array. 此示例将一个新的投影实例添加到数组,其中的 knowledgeStore 定义包含内联投影。This example adds a new projection instance to the array, where the knowledgeStore definition contains inline projections. 使用内联投影时,可以省略整形器技能。When using inline projections, you can omit the Shaper skill.

"knowledgeStore" : {
        "storageConnectionString": "DefaultEndpointsProtocol=https;AccountName=<Acct Name>;AccountKey=<Acct Key>;",
        "projections": [
             {
                "tables": [ ],
                "objects": [
                    {
                        "storageContainer": "sampleobject",
                        "source": null,
                        "generatedKeyName": "myobject",
                        "sourceContext": "/document",
                        "inputs": [
                            {
                                "name": "metadata_storage_name",
                                "source": "/document/metadata_storage_name"
                            },
                            {
                                "name": "metadata_storage_path",
                                "source": "/document/metadata_storage_path"
                            },
                            {
                                "name": "content",
                                "source": "/document/content"
                            },
                            {
                                "name": "keyPhrases",
                                "source": "/document/merged_content/keyphrases/*"
                            },
                            {
                                "name": "entities",
                                "source": "/document/merged_content/entities/*/name"
                            },
                            {
                                "name": "ocrText",
                                "source": "/document/normalized_images/*/text"
                            },
                            {
                                "name": "ocrLayoutText",
                                "source": "/document/normalized_images/*/layoutText"
                            }
                        ]

                    }
                ],
                "files": []
            }
        ]
    }

投影到文件Projecting to file

文件投影是从源文档提取的图像,或者是可从扩充过程中投影出来的扩充的输出。File projections are images that are either extracted from the source document or outputs of enrichment that can be projected out of the enrichment process. 类似于对象投影,文件投影实现为 Azure 存储中的 Blob,并包含图像。File projections, similar to object projections, are implemented as blobs in Azure Storage, and contain the image.

为了生成文件投影,将在投影对象中使用 files 数组。To generate a file projection, use the files array in the projection object. 此示例将从文档中提取的所有图像投影到名为 samplefile 的容器。This example projects all images extracted from the document to a container called samplefile.

"knowledgeStore" : {
        "storageConnectionString": "DefaultEndpointsProtocol=https;AccountName=<Acct Name>;AccountKey=<Acct Key>;",
        "projections": [
            {
                "tables": [ ],
                "objects": [ ],
                "files": [
                    {
                        "storageContainer": "samplefile",
                        "source": "/document/normalized_images/*"
                    }
                ]
            }
        ]
    }

投影到多个类型Projecting to multiple types

更复杂的方案可能要求跨投影类型投影内容。A more complex scenario might require you to project content across projection types. 例如,如果需要将关键短语和实体等某些数据投影到表,请将文本的 OCR 结果和布局文本保存为对象,然后将图像投影为文件。For example, if you need to project some data like key phrases and entities to tables, save the OCR results of text and layout text as objects, and then project the images as files.

此示例通过以下更改更新了技能组:This example updates the skillset with the following changes:

  1. 为每个文档创建包含一行的表。Create a table with a row for each document.
  2. 创建与文档表相关的表,并将每个关键短语标识为此表中的一行。Create a table related to the document table with each key phrase identified as a row in this table.
  3. 创建与文档表相关的表,并将每个实体标识为此表中的一行。Create a table related to the document table with each entity identified as a row in this table.
  4. 创建一个对象投影,其中包含每个图像的布局文本。Create an object projection with the layout text for each image.
  5. 创建一个文件投影,用于投影提取的每个图像。Create a file projection, projecting each extracted image.
  6. 创建一个交叉引用表,其中包含对文档表、带有布局文本的对象投影以及文件投影的引用。Create a cross reference table that contains references to the document table, object projection with the layout text and the file projection.

这些更改进一步反映在 knowledgeStore 定义中。These changes are reflected in the knowledgeStore definition further down.

为交叉投影整形数据Shape data for cross-projection

若要获取这些投影所需的形状,请首先添加一个新的整形器技能来创建名为 crossProjection 的形状对象。To get the shapes needed for these projections, start by adding a new Shaper skill that creates a shaped object called crossProjection.

{
    "@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
    "name": "ShaperForCross",
    "description": null,
    "context": "/document",
    "inputs": [
        {
            "name": "metadata_storage_name",
            "source": "/document/metadata_storage_name",
            "sourceContext": null,
            "inputs": []
        },
        {
            "name": "keyPhrases",
            "source": null,
            "sourceContext": "/document/merged_content/keyphrases/*",
            "inputs": [
                {
                    "name": "KeyPhrases",
                    "source": "/document/merged_content/keyphrases/*"
                }

            ]
        },
        {
            "name": "entities",
            "source": null,
            "sourceContext": "/document/merged_content/entities/*",
            "inputs": [
                {
                    "name": "Entities",
                    "source": "/document/merged_content/entities/*/name"
                }

            ]
        },
        {
            "name": "images",
            "source": null,
            "sourceContext": "/document/normalized_images/*",
            "inputs": [
                {
                    "name": "image",
                    "source": "/document/normalized_images/*"
                },
                {
                    "name": "layoutText",
                    "source": "/document/normalized_images/*/layoutText"
                },
                {
                    "name": "ocrText",
                    "source": "/document/normalized_images/*/text"
                }
                ]
        }
 
    ],
    "outputs": [
        {
            "name": "output",
            "targetName": "crossProjection"
        }
    ]
}

定义表、对象和文件投影Define table, object, and file projections

在合并的 crossProjection 对象中,将对象切片成多个表,将 OCR 输出捕获为 Blob,然后将图像另存为文件(也在 Blob 存储中)。From the consolidated crossProjection object, slice the object into multiple tables, capture the OCR output as blobs, and then save the image as files (also in Blob storage).

"knowledgeStore" : {
        "storageConnectionString": "DefaultEndpointsProtocol=https;AccountName=<Acct Name>;AccountKey=<Acct Key>;",
        "projections": [
             {
                "tables": [
                    {
                        "tableName": "crossDocument",
                        "generatedKeyName": "Id",
                        "source": "/document/crossProjection"
                    },
                    {
                        "tableName": "crossEntities",
                        "generatedKeyName": "EntityId",
                        "source": "/document/crossProjection/entities/*"
                    },
                    {
                        "tableName": "crossKeyPhrases",
                        "generatedKeyName": "KeyPhraseId",
                        "source": "/document/crossProjection/keyPhrases/*"
                    },
                    {
                        "tableName": "crossReference",
                        "generatedKeyName": "CrossId",
                        "source": "/document/crossProjection/images/*"
                    }
                     
                ],
                "objects": [
                    {
                        "storageContainer": "crossobject",
                        "generatedKeyName": "crosslayout",
                        "source": null,
                        "sourceContext": "/document/crossProjection/images/*/layoutText",
                        "inputs": [
                            {
                                "name": "OcrLayoutText",
                                "source": "/document/crossProjection/images/*/layoutText"
                            }
                        ]
                    }
                ],
                "files": [
                     {
                        "storageContainer": "crossimages",
                        "generatedKeyName": "crossimages",
                        "source": "/document/crossProjection/images/*/image"
                    }
                    ]
                
            }
        ]
    }

对象投影要求每个投影(对象投影或文件投影)的容器名称不能共享某个容器。Object projections require a container name for each projection, object projections or file projections cannot share a container.

表投影、对象投影与文件投影之间的关系Relationships among table, object, and file projections

此示例还重点演示了投影的另一项功能。This example also highlights another feature of projections. 通过在同一投影对象内定义多种类型的投影,可以在不同类型(表、对象、文件)之内和之间表达一种关系。By defining multiple types of projections within the same projection object, there is a relationship expressed within and across the different types (tables, objects, files). 这样你就可以从文档的表行开始,在对象投影中查找该文档中的图像的所有 OCR 文本。This allows you to start with a table row for a document and find all the OCR text for the images within that document in the object projection.

如果你不希望关联数据,请在不同的投影对象中定义投影。If you do not want the data related, define the projections in different projection objects. 例如,以下代码片段会导致关联表,但不会在表与对象(OCR 文本)投影之间建立关系。For example, the following snippet will result in the tables being related, but without relationships between the tables and the object (OCR text) projections.

根据不同的需求投影不同形状中的相同数据时,投影组非常有用。Projection groups are useful when you want to project the same data in different shapes for different needs. 例如,Power BI 仪表板的投影组,以及用于捕获数据(这些数据用于训练自定义技能中包装的机器学习模型)的另一个投影组。For example, a projection group for the Power BI dashboard, and another projection group for capturing data used to train a machine learning model wrapped in a custom skill.

生成不同类型的投影时,首先会生成文件和对象投影,并将路径添加到表中。When building projections of different types, file and object projections are generated first, and the paths are added to the tables.

"knowledgeStore" : {
        "storageConnectionString": "DefaultEndpointsProtocol=https;AccountName=<Acct Name>;AccountKey=<Acct Key>;",
        "projections": [
            {
                "tables": [
                    {
                        "tableName": "unrelatedDocument",
                        "generatedKeyName": "Documentid",
                        "source": "/document/pbiShape"
                    },
                    {
                        "tableName": "unrelatedKeyPhrases",
                        "generatedKeyName": "KeyPhraseid",
                        "source": "/document/pbiShape/keyPhrases"
                    }
                ],
                "objects": [
                    
                ],
                "files": []
            }, 
            {
                "tables": [],
                "objects": [
                    {
                        "storageContainer": "unrelatedocrtext",
                        "source": null,
                        "sourceContext": "/document/normalized_images/*/text",
                        "inputs": [
                            {
                                "name": "ocrText",
                                "source": "/document/normalized_images/*/text"
                            }
                        ]
                    },
                    {
                        "storageContainer": "unrelatedocrlayout",
                        "source": null,
                        "sourceContext": "/document/normalized_images/*/layoutText",
                        "inputs": [
                            {
                                "name": "ocrLayoutText",
                                "source": "/document/normalized_images/*/layoutText"
                            }
                        ]
                    }
                ],
                "files": []
            }
        ]
    }

常见问题Common Issues

定义投影时,有几个常见问题可能会导致意外的结果。When defining a projection, there are a few common issues that can cause unanticipated results. 如果知识存储中的输出不符合预期,请检查是否存在这些问题。Check for these issues if the output in knowledge store isn't what you expect.

  • 无法将字符串扩充整形为有效的 JSON。Not shaping string enrichments into valid JSON. 扩充字符串后(例如,使用关键短语扩充 merged_content),扩充的属性将表示为扩充树中 merged_content 的子级。When strings are enriched, for example merged_content enriched with key phrases, the enriched property is represented as a child of merged_content within the enrichment tree. 默认的表示形式不是适当格式的 JSON。The default representation is not well-formed JSON. 因此在投影时,请确保将扩充转换为包含名称和值的有效 JSON 对象。So at projection time, make sure to transform the enrichment into a valid JSON object with a name and a value.

  • 省略源路径末尾的 /*Omitting the /* at the end of a source path. 如果投影的源是 /document/pbiShape/keyPhrases,则关键字短语数组将投影为单个对象/行。If the source of a projection is /document/pbiShape/keyPhrases, the key phrases array is projected as a single object/row. 请改为将源路径设置为 /document/pbiShape/keyPhrases/*,以便为每个关键短语生成单个行或对象。Instead, set the source path to /document/pbiShape/keyPhrases/* to yield a single row or object for each of the key phrases.

  • 路径语法错误。Path syntax errors. 路径选择器区分大小写,如果不对选择器使用确切的大小写,可能会导致出现“缺少输入”警告。Path selectors are case-sensitive and can lead to missing input warnings if you do not use the exact case for the selector.

后续步骤Next steps

本文中的示例演示了有关如何创建投影的常用模式。The examples in this article demonstrate common patterns on how to create projections. 充分理解概念后,接下来可以运用更丰富的知识来为特定的方案生成投影。Now that you have a good understanding of the concepts, you are better equipped to build projections for your specific scenario.

在探索新功能时,请考虑将增量扩充作为下一步。As you explore new features, consider incremental enrichment as your next step. 增量扩充基于缓存,可让你重复使用任何不受技能集修改所影响的扩充。Incremental enrichment is based on caching, which lets you reuse any enrichments that are not otherwise affected by a skillset modification. 增量扩充对于包含 OCR 和图像分析的管道特别有用。This is especially useful for pipelines that include OCR and image analysis.

有关投影的概述,请详细了解组和切片等功能,以及如何在技能组中对其进行定义For an overview on projections, learn more about capabilities like groups and slicing, and how you define them in a skillset