Azure Vision multimodal embeddings skill

Important

This skill is in public preview under Supplemental Terms of Use. The 2024-05-01-Preview REST API and newer preview APIs support this feature.

The Azure Vision multimodal embeddings skill uses the multimodal embeddings API from Azure Vision in Foundry Tools to generate embeddings for text or image input.

For transactions that exceed 20 documents per indexer per day, this skill requires that you attach a billable Azure Foundry resource to your skillset. Execution of built-in skills is charged at the existing Foundry Tools Standard price. Image extraction is also billable by Azure AI Search.

Location of resources is a consideration for billing. Because you're using a preview REST API version to create a skillset that contains preview skills, you can use a keyless connection to bypass the same-region requirement. However, for key-based connections, Azure AI Search and Foundry must be in the same region. To ensure region compatibility:

Find a supported region for multimodal embeddings.
Verify the region provides AI enrichment.

The Foundry resource is used for billing purposes only. Content processing occurs on separate resources managed and maintained by Azure AI Search within the same geo. Your data is processed in the Geo where your resource is deployed.

@odata.type

Microsoft.Skills.Vision.VectorizeSkill

Data limits

The input limits for the skill can be found in the Azure Vision documentation for images and text. Consider using the Text Split skill if you need data chunking for text inputs.

Applicable inputs include:

Image input file size must be less than 20 megabytes (MB). Image size must be greater than 10 x 10 pixels and less than 16,000 x 16,000 pixels.
Text input string must be between (inclusive) one word and 70 words.

Skill parameters

Parameters are case sensitive.

Inputs Description

Inputs	Description
`modelVersion`	(Required) The model version (`2023-04-15`) to be passed to the Azure Vision multimodal embeddings API for generating embeddings. Vector embeddings can only be compared and matched if they're from the same model type. Images vectorized by one model won't be searchable through a different model. The latest Image Analysis API offers two models: The `2023-04-15` version, which supports text search in many languages. Azure AI Search uses this version. The legacy `2022-04-11` model, which supports only English.

modelVersion

(Required) The model version (2023-04-15) to be passed to the Azure Vision multimodal embeddings API for generating embeddings. Vector embeddings can only be compared and matched if they're from the same model type. Images vectorized by one model won't be searchable through a different model. The latest Image Analysis API offers two models:

The 2023-04-15 version, which supports text search in many languages. Azure AI Search uses this version.
The legacy 2022-04-11 model, which supports only English.

Skill inputs

Skill definition inputs include name, source, and inputs. The following table provides valid values for name of the input. You can also specify recursive inputs. For more information, see the REST API reference and Create a skillset.

Input	Description
`text`	The input text to be vectorized. If you're using data chunking, the source might be `/document/pages/*`.
`image`	Complex Type. Currently only works with "/document/normalized_images" field, produced by the Azure blob indexer when `imageAction` is set to a value other than `none`.
`url`	The URL to download the image to be vectorized.
`queryString`	The query string of the URL to download the image to be vectorized. Useful if you store the URL and SAS token in separate paths.

Only one of text, image or url/queryString can be configured for a single instance of the skill. If you want to vectorize both images and text within the same skillset, include two instances of this skill in the skillset definition, one for each input type you would like to use.

Skill outputs

Output	Description
`vector`	Output embedding array of floats for the input text or image.

Sample definition

For text input, consider a blob that has the following content:

{
    "content": "Forests, grasslands, deserts, and mountains are all part of the Patagonian landscape that spans more than a million square  kilometers of South America."
}

For text inputs, your skill definition might look like this:

{ 
    "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill", 
    "context": "/document", 
    "modelVersion": "2023-04-15", 
    "inputs": [ 
        { 
            "name": "text", 
            "source": "/document/content" 
        } 
    ], 
    "outputs": [ 
        { 
            "name": "vector",
            "targetName": "text_vector"
        } 
    ] 
}

For image input, a second skill definition in the same skillset might look like this:

{
    "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill",
    "context": "/document/normalized_images/*",
    "modelVersion": "2023-04-15", 
    "inputs": [
        {
            "name": "image",
            "source": "/document/normalized_images/*"
        }
    ],
    "outputs": [
        {
            "name": "vector",
            "targetName": "image_vector"
        }
    ]
}

If you want to vectorize images directly from your blob storage data source rather than extract images during indexing, your skill definition should specify a URL, and perhaps a SAS token depending on storage security. For this scenario, your skill definition might look like this:

{
    "@odata.type": "#Microsoft.Skills.Vision.VectorizeSkill",
    "context": "/document",
    "modelVersion": "2023-04-15", 
    "inputs": [
        {
            "name": "url",
            "source": "/document/metadata_storage_path"
        },
        {
            "name": "queryString",
            "source": "/document/metadata_storage_sas_token"
        }
    ],
    "outputs": [
        {
            "name": "vector",
            "targetName": "image_vector"
        }
    ]
}

Sample output

For the given input, a vectorized embedding output is produced. Output is 1,024 dimensions, which is the number of dimensions supported by the Azure Vision multimodal API.

{
  "text_vector": [
        0.018990106880664825,
        -0.0073809814639389515,
        .... 
        0.021276434883475304,
      ]
}

The output resides in memory. To send this output to a field in the search index, you must define an outputFieldMapping that maps the vectorized embedding output (which is an array) to a vector field. Assuming the skill output resides in the document's vector node, and content_vector is the field in the search index, the outputFieldMapping in the indexer should look like:

  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/vector/*",
      "targetFieldName": "content_vector"
    }
  ]

For mapping image embeddings to the index, you use index projections. The payload for indexProjections might look something like the following example. image_content_vector is a field in the index, and it's populated with the content found in the vector of the normalized_images array.

"indexProjections": {
    "selectors": [
        {
            "targetIndexName": "myTargetIndex",
            "parentKeyFieldName": "ParentKey",
            "sourceContext": "/document/normalized_images/*",
            "mappings": [
                {
                    "name": "image_content_vector",
                    "source": "/document/normalized_images/*/vector"
                }
            ]
        }
    ]
}