文档提取认知技能Document Extraction cognitive skill

重要

此技能目前为公共预览版。This skill is currently in public preview. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负荷。目前未提供门户或 .NET SDK 支持。Preview functionality is provided without a service level agreement, and is not recommended for production workloads.There is currently no portal or .NET SDK support.

文档提取技能从扩充管道内的文件中提取内容。The Document Extraction skill extracts content from a file within the enrichment pipeline. 这使你可以利用文档提取步骤,该步骤通常在技能组执行之前对可能由其他技能生成的文件执行。This allows you to take advantage of the document extraction step that normally happens before the skillset execution with files that may be generated by other skills.

备注

通过增大处理频率、添加更多文档或添加更多 AI 算法来扩大范围时,需要附加可计费的认知服务资源As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to attach a billable Cognitive Services resource. 在调用认知服务中的 API 时,以及在索引编制中的文档破解阶段提取图像时,会产生费用。Charges accrue when calling APIs in Cognitive Services, and for image extraction as part of the document-cracking stage in indexing. 提取文档中的文本不会产生费用。There are no charges for text extraction from documents.

当你行使内置技能时,我们会按现有的认知服务预付费价格收费。Execution of built-in skills is charged at the existing Cognitive Services pay-in-advance price. 图像提取定价如定价页所述。Image extraction pricing is described on the pricing page.

@odata.type

Microsoft.Skills.Util.DocumentExtractionSkillMicrosoft.Skills.Util.DocumentExtractionSkill

技能参数Skill parameters

参数区分大小写。Parameters are case-sensitive.

输入Inputs 允许的值Allowed Values 说明Description
parsingMode default
text
json
对于从非纯文本或 json 文件进行的文档提取,请设置为 defaultSet to default for document extraction from files that are not pure text or json. 若要提高针对纯文本文件的性能,请设置为 textSet to text to improve performance on plain text files. 若要从 json 文件中提取结构化内容,请设置为 jsonSet to json to extract structured content from json files. 如果未显式定义 parsingMode,则它将设置为 defaultIf parsingMode is not defined explicitly, it will be set to default.
dataToExtract contentAndMetadata
allMetadata
若要提取每个文件的所有元数据和文本内容,请设置为 contentAndMetadataSet to contentAndMetadata to extract all metadata and textual content from each file. 若要仅提取特定于内容类型的元数据(例如,仅 .png 文件独有的元数据),请设置为 allMetadataSet to allMetadata to extract only the content-type specific metadata (for example, metadata unique to just .png files). 如果未显式定义 dataToExtract,则它将设置为 contentAndMetadataIf dataToExtract is not defined explicitly, it will be set to contentAndMetadata.
configuration 请参阅下文。See below. 可选参数的字典,这些参数用于调整文档提取执行方式。A dictionary of optional parameters that adjust how the document extraction is performed. 有关支持的配置属性的说明,请参阅下表。See the below table for descriptions of supported configuration properties.
配置参数Configuration Parameter 允许的值Allowed Values 说明Description
imageAction none
generateNormalizedImages
generateNormalizedImagePerPage
若要忽略数据集中的嵌入图像或图像文件,请设置为 noneSet to none to ignore embedded images or image files in the data set. 这是默认值。This is the default.
对于使用认知技能的图像分析,设置为 generateNormalizedImages 可让技能在文档破解过程中创建规范化图像的数组。For image analysis using cognitive skills, set to generateNormalizedImages to have the skill create an array of normalized images as part of document cracking. 此操作需要将 parsingMode 设置为 default,将 dataToExtract 设置为 contentAndMetadataThis action requires that parsingMode is set to default and dataToExtract is set to contentAndMetadata. 规范化的图像是指在视觉搜索结果中包含图像时,对图像进行额外的处理,使图像的输出一致,并通过调整大小和旋转方向使图像在呈现时更一致(例如,使图像控件中的照片大小一致,如 JFK 演示中所示)。A normalized image refers to additional processing resulting in uniform image output, sized and rotated to promote consistent rendering when you include images in visual search results (for example, same-size photographs in a graph control as seen in the JFK demo). 当使用此选项时,将为每个图像生成此信息。This information is generated for each image when you use this option.
如果设置为 generateNormalizedImagePerPage,则将以不同的方式对待 PDF 文件,将不会提取嵌入的图像,而是将每个页面呈现为图像并相应地规范化。If you set to generateNormalizedImagePerPage, PDF files will be treated differently in that instead of extracting embedded images, each page will be rendered as an image and normalized accordingly. 对待非 PDF 文件类型的方式与设置了 generateNormalizedImages 时相同。Non-PDF file types will be treated the same as if generateNormalizedImages was set.
normalizedImageMaxWidth 一个介于 50 和 10000 之间的整数Any integer between 50-10000 生成的规范化图像的最大宽度(以像素为单位)。The maximum width (in pixels) for normalized images generated. 默认为 2000。The default is 2000.
normalizedImageMaxHeight 一个介于 50 和 10000 之间的整数Any integer between 50-10000 生成的规范化图像的最大高度(以像素为单位)。The maximum height (in pixels) for normalized images generated. 默认为 2000。The default is 2000.

备注

将规范化图像的最大宽度和高度默认设置为 2000 像素是考虑到 OCR 技术所能够支持的最大大小以及图像分析技术The default of 2000 pixels for the normalized images maximum width and height is based on the maximum sizes supported by the OCR skill and the image analysis skill. OCR 技能支持非英语语言的最大宽度和高度为 4200,支持英语语言的最大宽度和高度为 10000。The OCR skill supports a maximum width and height of 4200 for non-English languages, and 10000 for English. 如果增加最大限制,则根据技能组定义和文档语言,对较大的图像进行处理可能会失败。If you increase the maximum limits, processing could fail on larger images depending on your skillset definition and the language of the documents.

技能输入Skill inputs

输入名称Input name 说明Description
file_data 应从其中提取内容的文件。The file that content should be extracted from.

“file_data”必须是按如下方式定义的一个对象:The "file_data" input must be an object defined as follows:

{
  "$type": "file",
  "data": "BASE64 encoded string of the file"
}

可以通过以下三种方式之一生成此文件引用对象:This file reference object can be generated one of 3 ways:

  • 在你的索引器定义中将 allowSkillsetToReadFileData 参数设置为“true”。Setting the allowSkillsetToReadFileData parameter on your indexer definition to "true". 这将创建路径 /document/file_data,该路径是一个对象,表示从 blob 数据源下载的原始文件数据。This will create a path /document/file_data that is an object representing the original file data downloaded from your blob data source. 此参数仅适用于 Blob 存储中的数据。This parameter only applies to data in Blob storage.

  • 在你的索引器定义中将 imageAction 参数设置为 none 之外的值。Setting the imageAction parameter on your indexer definition to a value other than none. 这将创建一个图像数组,该数组将遵循此技能的输入在逐个传递的情况下所需的约定(即 /document/normalized_images/*)。This creates an array of images that follows the required convention for input to this skill if passed individually (i.e. /document/normalized_images/*).

  • 让自定义技能返回严格如上所述定义的 json 对象。Having a custom skill return a json object defined EXACTLY as above. $type 参数必须确切地设置为 file,并且 data 参数必须是文件内容的 base64 编码的字节数组数据。The $type parameter must be set to exactly file and the data parameter must be the base 64 encoded byte array data of the file content.

技能输出Skill outputs

输出名称Output name 说明Description
content 文档的文本内容。The textual content of the document.
normalized_images imageAction 设置为 none 以外的值后,新的 normalized_images 字段将包含一个图像数组。When the imageAction is set to a value other then none, the new normalized_images field will contain an array of images. 有关每个图像的输出格式的更多详细信息,请参阅有关图像提取的文档See the documentation for image extraction for more details on the output format of each image.

示例定义Sample definition

 {
    "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
    "parsingMode": "default",
    "dataToExtract": "contentAndMetadata",
    "configuration": {
        "imageAction": "generateNormalizedImages",
        "normalizedImageMaxWidth": 2000,
        "normalizedImageMaxHeight": 2000
    },
    "context": "/document",
    "inputs": [
      {
        "name": "file_data",
        "source": "/document/file_data"
      }
    ],
    "outputs": [
      {
        "name": "content",
        "targetName": "content"
      },
      {
        "name": "normalized_images",
        "targetName": "normalized_images"
      }
    ]
  }

示例输入Sample input

{
  "values": [
    {
      "recordId": "1",
      "data":
      {
        "file_data": {
          "$type": "file",
          "data": "aGVsbG8="
        }
      }
    }
  ]
}

示例输出Sample output

{
  "values": [
    {
      "recordId": "1",
      "data": {
        "content": "hello",
        "normalized_images": []
      }
    }
  ]
}

另请参阅See also