如何处理和提取 AI 扩充方案中的图像中的信息How to process and extract information from images in AI enrichment scenarios

Azure 认知搜索有多项适用于图像和图像文件的功能。Azure Cognitive Search has several capabilities for working with images and image files. 在文档破解过程中,可以使用 imageAction 参数从包含字母数字文本的照片或图片中提取文本,例如停车标志中的“停”字样。During document cracking, you can use the imageAction parameter to extract text from photos or pictures containing alphanumeric text, such as the word "STOP" in a stop sign. 其他场景包括:生成图像的文本表示形式,例如代表蒲公英照片的“蒲公英”字样,或者“黄色”颜色。Other scenarios include generating a text representation of an image, such as "dandelion" for a photo of a dandelion, or the color "yellow". 还可以提取图像的元数据,例如其大小。You can also extract metadata about the image, such as its size.

本文详细介绍图像处理,并提供了在 AI 扩充管道中处理图像的指南。This article covers image processing in more detail and provides guidance for working with images in an AI enrichment pipeline.

获取规范化的图像Get normalized images

在文档破解过程中,可以使用新的一组索引器配置参数来处理图像文件或嵌入文件中的图像。As part of document cracking, there are a new set of indexer configuration parameters for handling image files or images embedded in files. 这些参数用于将图像规范化,以便进行进一步的下游处理。These parameters are used to normalize images for further downstream processing. 规范化图像可以使图像更统一。Normalizing images makes them more uniform. 可以根据最大高度和宽度来重设大型图像的大小,使之可用。Large images are resized to a maximum height and width to make them consumable. 对于提供方向元数据的图像,可以调整图像旋转,使之适合垂直加载。For images providing metadata on orientation, image rotation is adjusted for vertical loading. 元数据调整项在为每个图像创建的复杂类型中捕获。Metadata adjustments are captured in a complex type created for each image.

无法关闭图像规范化功能。You cannot turn off image normalization. 循环访问图像的技术需要规范化的图像。Skills that iterate over images expect normalized images. 在索引器上启用图像规范化需要将技能组附加到该索引器。Enabling image normalization on an indexer requires that a skillset be attached to that indexer.

配置参数Configuration Parameter 说明Description
imageActionimageAction 如果在遇到嵌入图像或图像文件时无需执行任何操作,请将此项设置为 "none"。Set to "none" if no action should be taken when embedded images or image files are encountered.
设置为 "generateNormalizedImages" 会在文档破解过程中生成一系列规范化的图像。Set to "generateNormalizedImages" to generate an array of normalized images as part of document cracking.
设置为“generateNormalizedImagePerPage”,以生成一系列规范化的图像,对于数据源中的 PDF 文件,每一页呈现为一个输出图像。Set to "generateNormalizedImagePerPage" to generate an array of normalized images where, for PDFs in your data source, each page is rendered to one output image. 对于非 PDF 文件类型,该功能与“generateNormalizedImages”相同。The functionality is the same as "generateNormalizedImages" for non-PDF file types.
对于任何不是“none”的选项,这些图像会在 normalized_images 字段中公开。For any option that is not "none", the images will be exposed in the normalized_images field.
默认为 "none"。The default is "none." 将 "dataToExtract" 设置为 "contentAndMetadata" 时,此配置仅与 Blob 数据源相关。This configuration is only pertinent to blob data sources, when "dataToExtract" is set to "contentAndMetadata."
将从给定文档中提取最多 1000 个图像。A maximum of 1000 images will be extracted from a given document. 如果在文档中有超过 1000 个图像,则将提取前 1000 个,并将生成警告。If there are more than 1000 images in a document, the first 1000 will be extracted and a warning will be generated.
normalizedImageMaxWidthnormalizedImageMaxWidth 生成的规范化图像的最大宽度(以像素为单位)。The maximum width (in pixels) for normalized images generated. 默认为 2000。The default is 2000. 允许的最大值为 10000。The maximum value allowed is 10000.
normalizedImageMaxHeightnormalizedImageMaxHeight 生成的规范化图像的最大高度(以像素为单位)。The maximum height (in pixels) for normalized images generated. 默认为 2000。The default is 2000. 允许的最大值为 10000。The maximum value allowed is 10000.

备注

如果将 imageAction 属性设置为 "none" 之外的其他值,则只能将 parsingMode 属性设置为 "default"。If you set the imageAction property to anything other than "none", you'll not be able to set the parsingMode property to anything other than "default". 在索引器配置中,只能将这两个属性中的一个设置为非默认值。You may only set one of these two properties to a non-default value in your indexer configuration.

parsingMode 参数设置为 json(将每个 Blob 作为单个文档进行索引编制)或 jsonArray(如果 Blob 包含 JSON 数组,且需要将数组的每个元素视为单独的文档)。Set the parsingMode parameter to json (to index each blob as a single document) or jsonArray (if your blobs contain JSON arrays and you need each element of an array to be treated as a separate document).

将规范化图像的最大宽度和高度默认设置为 2000 像素是考虑到 OCR 技术所能够支持的最大大小以及图像分析技术The default of 2000 pixels for the normalized images maximum width and height is based on the maximum sizes supported by the OCR skill and the image analysis skill. OCR 技能支持非英语语言的最大宽度和高度为 4200,支持英语语言的最大宽度和高度为 10000。The OCR skill supports a maximum width and height of 4200 for non-English languages, and 10000 for English. 如果增加最大限制,则根据技能组定义和文档语言,对较大的图像进行处理可能会失败。If you increase the maximum limits, processing could fail on larger images depending on your skillset definition and the language of the documents.

可以指定索引器定义中所述的 imageAction,如下所示:You specify the imageAction in your indexer definition as follows:

{
  //...rest of your indexer definition goes here ...
  "parameters":
  {
    "configuration": 
    {
        "dataToExtract": "contentAndMetadata",
        "imageAction": "generateNormalizedImages"
    }
  }
}

imageAction 设置为“none”以外的值后,新的 normalized_images 字段会包含一系列图像。When the imageAction is set to a value other then "none", the new normalized_images field will contain an array of images. 每个图像都是一个包含以下成员的复杂类型:Each image is a complex type that has the following members:

图像成员Image member 说明Description
datadata JPEG 格式的规范化图像的 BASE64 编码字符串。BASE64 encoded string of the normalized image in JPEG format.
widthwidth 规范化图像的宽度(以像素为单位)。Width of the normalized image in pixels.
heightheight 规范化图像的高度(以像素为单位)。Height of the normalized image in pixels.
originalWidthoriginalWidth 图像在规范化之前的原始宽度。The original width of the image before normalization.
originalHeightoriginalHeight 图像在规范化之前的原始高度。The original height of the image before normalization.
rotationFromOriginalrotationFromOriginal 在创建规范化图像过程中进行的逆时针旋转(以度为单位)。Counter-clockwise rotation in degrees that occurred to create the normalized image. 值的范围为 0 度到 360 度。A value between 0 degrees and 360 degrees. 此步骤从图像读取由照相机或扫描仪生成的元数据。This step reads the metadata from the image that is generated by a camera or scanner. 通常为 90 度的倍数。Usually a multiple of 90 degrees.
contentOffsetcontentOffset 从其提取图像的内容字段中的字符偏移。The character offset within the content field where the image was extracted from. 此字段仅适用于包含嵌入图像的文件。This field is only applicable for files with embedded images.
pageNumberpageNumber 如果图像是从 PDF 提取或呈现的,则此字段包含从中提取或呈现图像的 PDF 中的页码(从 1 开始)。If the image was extracted or rendered from a PDF, this field contains the page number in the PDF it was extracted or rendered from, starting from 1. 如果图像不是来自 PDF,则此字段将为 0。If the image was not from a PDF, this field will be 0.

normalized_images 的示例值:Sample value of normalized_images:

[
  {
    "data": "BASE64 ENCODED STRING OF A JPEG IMAGE",
    "width": 500,
    "height": 300,
    "originalWidth": 5000,  
    "originalHeight": 3000,
    "rotationFromOriginal": 90,
    "contentOffset": 500,
    "pageNumber": 2
  }
]

有两项内置的认知技术以图像为输入:OCR图像分析There are two built-in cognitive skills that take images as an input: OCR and Image Analysis.

目前,这些技术仅适用于通过文档破解步骤生成的图像。Currently, these skills only work with images generated from the document cracking step. 因此,唯一支持的输入为 "/document/normalized_images"As such, the only supported input is "/document/normalized_images".

图像分析技术Image Analysis skill

图像分析技术根据图像内容提取丰富的视觉特征。The Image Analysis skill extracts a rich set of visual features based on the image content. 例如,可从图像生成标题栏、生成标记或识别名人和地标。For instance, you can generate a caption from an image, generate tags, or identify celebrities and landmarks.

OCR 技术OCR skill

OCR 技术可从图像文件(例如 JPG、PNG、位图)中提取文本。The OCR skill extracts text from image files such as JPGs, PNGs, and bitmaps. 它可以提取文本和布局信息。It can extract text as well as layout information. 布局信息为每个确定的字符串提供边框。The layout information provides bounding boxes for each of the strings identified.

嵌入图像场景Embedded image scenario

常见的场景包括通过执行以下步骤,创建单个包含所有文件内容的字符串,既有文本,又有图像原点文本:A common scenario involves creating a single string containing all file contents, both text and image-origin text, by performing the following steps:

  1. 提取 normalized_imagesExtract normalized_images
  2. 运行 OCR 技术,使用 "/document/normalized_images" 作为输入Run the OCR skill using "/document/normalized_images" as input
  3. 将这些图像的文本表示形式与从文件提取的原始文本合并。Merge the text representation of those images with the raw text extracted from the file. 可以使用文本合并技术将两个文本区块合并成单个大型字符串。You can use the Text Merge skill to consolidate both text chunks into a single large string.

以下示例技术集会创建的 merged_text 字段包含文档的文本内容,The following example skillset creates a merged_text field containing the textual content of your document. 以及每个嵌入图像中的 OCR 化文本。It also includes the OCRed text from each of the embedded images.

请求正文语法Request body syntax

{
  "description": "Extract text from images and merge with content text to produce merged_text",
  "skills":
  [
    {
        "description": "Extract text (plain and structured) from image.",
        "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
        "context": "/document/normalized_images/*",
        "defaultLanguageCode": "en",
        "detectOrientation": true,
        "inputs": [
          {
            "name": "image",
            "source": "/document/normalized_images/*"
          }
        ],
        "outputs": [
          {
            "name": "text"
          }
        ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name":"text", "source": "/document/content"
        },
        {
          "name": "itemsToInsert", "source": "/document/normalized_images/*/text"
        },
        {
          "name":"offsets", "source": "/document/normalized_images/*/contentOffset" 
        }
      ],
      "outputs": [
        {
          "name": "mergedText", "targetName" : "merged_text"
        }
      ]
    }
  ]
}

有了 merged_text 字段以后,即可将其映射为索引器定义中的可搜索字段。Now that you have a merged_text field, you could map it as a searchable field in your indexer definition. 文件的所有内容(包括图像的文本)将均可搜索。All of the content of your files, including the text of the images, will be searchable.

将已提取文本的边框可视化Visualize bounding boxes of extracted text

另一个常见场景是将搜索结果布局信息可视化。Another common scenario is visualizing search results layout information. 例如,可能需要突出显示在搜索结果的图像中找到一段文本的位置。For example, you might want to highlight where a piece of text was found in an image as part of your search results.

由于 OCR 步骤是在规范化图像上执行的,因此布局坐标位于规范化图像空间中。Since the OCR step is performed on the normalized images, the layout coordinates are in the normalized image space. 显示规范化图像时,存在坐标通常不是一个问题,但在某些情况下,可能需要显示原始图像。When displaying the normalized image, the presence of coordinates is generally not a problem, but in some situations you might want to display the original image. 在这种情况下,请将布局中的每个坐标点转换为原始图像坐标系统。In this case, convert each of coordinate points in the layout to the original image coordinate system.

提示:如果需要将规范化的坐标转换为原始坐标空间,可以使用以下算法:As a helper, if you need to transform normalized coordinates to the original coordinate space, you could use the following algorithm:

        /// <summary>
        ///  Converts a point in the normalized coordinate space to the original coordinate space.
        ///  This method assumes the rotation angles are multiples of 90 degrees.
        /// </summary>
        public static Point GetOriginalCoordinates(Point normalized,
                                    int originalWidth,
                                    int originalHeight,
                                    int width,
                                    int height,
                                    double rotationFromOriginal)
        {
            Point original = new Point();
            double angle = rotationFromOriginal % 360;

            if (angle == 0 )
            {
                original.X = normalized.X;
                original.Y = normalized.Y;
            } else if (angle == 90)
            {
                original.X = normalized.Y;
                original.Y = (width - normalized.X);
            } else if (angle == 180)
            {
                original.X = (width -  normalized.X);
                original.Y = (height - normalized.Y);
            } else if (angle == 270)
            {
                original.X = height - normalized.Y;
                original.Y = normalized.X;
            }

            double scalingFactor = (angle % 180 == 0) ? originalHeight / height : originalHeight / width;
            original.X = (int) (original.X * scalingFactor);
            original.Y = (int)(original.Y * scalingFactor);

            return original;
        }

另请参阅See also