Analyze video and audio files with Azure Media Services

Media Services logo v3


Warning

Azure Media Services will be retired June 30th, 2024. For more information, see the AMS Retirement Guide.

Important

As Microsoft's Responsible AI Standards outlines, Microsoft is committed to fairness, privacy, security, and transparency with respect to AI systems. To align with these standards, Azure Media Services is retiring the Video Analyzer preset on September 14, 2023. This preset currently allows you to extract multiple video and audio insights from a video file. Customers can replace their current workflows using the more advanced feature set offered by Azure Video Indexer.

Media Services lets you extract insights from your video and audio files using the audio and video analyzer presets. This article describes the analyzer presets used to extract insights.

There are two modes for the Audio Analyzer preset, basic and standard. See the description of the differences in the table below.

To analyze your content using Media Services v3 presets, you create a Transform and submit a Job that uses one of these presets: VideoAnalyzerPreset or AudioAnalyzerPreset.

Note

AudioAnalyzerPreset is not supported if the storage account does not have public network access.

Compliance, Privacy, and Security

You must comply with all applicable laws in your use of Video Analyzer for Media, and you may not use Video Analyzer for Media or any other Azure service in a manner that violates the rights of others or may be harmful to others. Before uploading any videos, including any biometric data, to the Video Analyzer for Media service for processing and storage, You must have all the proper rights, including all appropriate consents, from the individual(s) in the video. For Azure's privacy obligations and handling of your data, please review Azure's Privacy Statement.

Built-in presets

Media Services currently supports the following built-in analyzer presets:

Preset name Scenario / Mode Details
AudioAnalyzerPreset Analyzing audio Standard mode The preset applies a predefined set of AI-based analysis operations, including speech transcription. Currently, the preset supports processing content with a single audio track that contains speech in a single language. Specify the language for the audio payload in the input using the BCP-47 format of 'language tag-region'. See supported languages list below for available language codes. The automatic language detection chooses the first language detected and continues with the selected language for the whole file if it not set, or set to null. The automatic language detection feature currently supports: English, Chinese, French, German, Italian, Japanese, Spanish, Russian, and Brazilian Portuguese. It doesn't support dynamically switching between languages after the first language is detected. The automatic language detection feature works best with audio recordings with clearly discernible speech. If automatic language detection fails to find the language, the transcription falls back to English.
AudioAnalyzerPreset Analyzing audio Basic mode This preset mode performs speech-to-text transcription and generation of a VTT subtitle/caption file. The output of this mode includes an Insights JSON file including only the keywords, transcription,and timing information. Automatic language detection and speaker diarization are not included in this mode. The list of supported languages is identical to the Standard mode above.
VideoAnalyzerPreset Analyzing audio and video Extracts insights (rich metadata) from both audio and video, and outputs a JSON format file. You can specify whether you only want to extract audio insights when processing a video file.
FaceDetectorPreset Detecting faces present in video Describes the settings to be used when analyzing a video to detect all the faces present.

Note

AudioAnalyzerPreset is not supported if the storage account does not have public network access.

Supported languages

  • Arabic ('ar-BH', 'ar-EG', 'ar-IQ', 'ar-JO', 'ar-KW', 'ar-LB', 'ar-OM', 'ar-QA', 'ar-SA' and 'ar-SY')
  • Brazilian Portuguese ('pt-BR')
  • Chinese ('zh-CN')
  • Danish('da-DK')
  • English ('en-US', 'en-GB' and 'en-AU')
  • Finnish ('fi-FI')
  • French ('fr-FR' and 'fr-CA')
  • German ('de-DE')
  • Hebrew (he-IL)
  • Hindi ('hi-IN'), Korean ('ko-KR')
  • Italian ('it-IT')
  • Japanese ('ja-JP')
  • Norwegian ('nb-NO')
  • Persian ('fa-IR')
  • Portugal Portuguese ('pt-PT')
  • Russian ('ru-RU')
  • Spanish ('es-ES' and 'es-MX')
  • Swedish ('sv-SE')
  • Thai ('th-TH')
  • Turkish ('tr-TR')

Note

AudioAnalyzerPreset is not supported if the storage account does not have public network access.

AudioAnalyzerPreset standard mode

The preset enables you to extract multiple audio insights from an audio or video file.

The output includes a JSON file (with all the insights) and VTT file for the audio transcript. This preset accepts a property that specifies the language of the input file in the form of a BCP47 string. The audio insights include:

  • Audio transcription: A transcript of the spoken words with timestamps. Multiple languages are supported.
  • Keywords: Keywords that are extracted from the audio transcription.

AudioAnalyzerPreset basic mode

The preset enables you to extract multiple audio insights from an audio or video file.

The output includes a JSON file and VTT file for the audio transcript. This preset accepts a property that specifies the language of the input file in the form of a BCP47 string. The output includes:

  • Audio transcription: A transcript of the spoken words with timestamps. Multiple languages are supported, but automatic language detection and speaker diarization are not included.
  • Keywords: Keywords that are extracted from the audio transcription.

VideoAnalyzerPreset

The preset enables you to extract multiple audio and video insights from a video file. The output includes a JSON file (with all the insights), a VTT file for the video transcript, and a collection of thumbnails. This preset also accepts a BCP47 string (representing the language of the video) as a property. The video insights include all the audio insights mentioned above and the following extra items:

  • Face tracking: The time during which faces are present in the video. Each face has a face ID and a corresponding collection of thumbnails.
  • Visual text: The text that's detected via optical character recognition. The text is time stamped and also used to extract keywords (in addition to the audio transcript).
  • Keyframes: A collection of keyframes extracted from the video.
  • Visual content moderation: The portion of the videos flagged as adult or racy in nature.
  • Annotation: A result of annotating the videos based on a pre-defined object model

insights.json elements

The output includes a JSON file (insights.json) with all the insights found in the video or audio. The JSON may contain the following elements:

transcript

Name Description
id The line ID.
text The transcript itself.
language The transcript language. Intended to support transcript where each line can have a different language.
instances A list of time ranges where this line appeared. If the instance is transcript, it will have only one instance.

Example:

"transcript": [
{
    "id": 0,
    "text": "Hi I'm Doug from office.",
    "language": "en-US",
    "instances": [
    {
        "start": "00:00:00.5100000",
        "end": "00:00:02.7200000"
    }
    ]
},
{
    "id": 1,
    "text": "I have a guest. It's Michelle.",
    "language": "en-US",
    "instances": [
    {
        "start": "00:00:02.7200000",
        "end": "00:00:03.9600000"
    }
    ]
}
]

ocr

Name Description
id The OCR line ID.
text The OCR text.
confidence The recognition confidence.
language The OCR language.
instances A list of time ranges where this OCR appeared (the same OCR can appear multiple times).
"ocr": [
    {
      "id": 0,
      "text": "LIVE FROM NEW YORK",
      "confidence": 0.91,
      "language": "en-US",
      "instances": [
        {
          "start": "00:00:26",
          "end": "00:00:52"
        }
      ]
    },
    {
      "id": 1,
      "text": "NOTICIAS EN VIVO",
      "confidence": 0.9,
      "language": "es-ES",
      "instances": [
        {
          "start": "00:00:26",
          "end": "00:00:28"
        },
        {
          "start": "00:00:32",
          "end": "00:00:38"
        }
      ]
    }
  ],

faces

Name Description
id The face ID.
name The face name. It can be 'Unknown #0', an identified celebrity, or a customer trained person.
confidence The face identification confidence.
description A description of the celebrity.
thumbnailId The ID of the thumbnail of that face.
knownPersonId The internal ID (if it's a known person).
referenceId The Bing ID (if it's a Bing celebrity).
referenceType Currently just Bing.
title The title (if it's a celebrity—for example, "Microsoft's CEO").
imageUrl The image URL, if it's a celebrity.
instances Instances where the face appeared in the given time range. Each instance also has a thumbnailsId.
"faces": [{
	"id": 2002,
	"name": "Xam 007",
	"confidence": 0.93844,
	"description": null,
	"thumbnailId": "00000000-aee4-4be2-a4d5-d01817c07955",
	"knownPersonId": "8340004b-5cf5-4611-9cc4-3b13cca10634",
	"referenceId": null,
	"title": null,
	"imageUrl": null,
	"instances": [{
		"thumbnailsIds": ["00000000-9f68-4bb2-ab27-3b4d9f2d998e",
		"cef03f24-b0c7-4145-94d4-a84f81bb588c"],
		"adjustedStart": "00:00:07.2400000",
		"adjustedEnd": "00:00:45.6780000",
		"start": "00:00:07.2400000",
		"end": "00:00:45.6780000"
	},
	{
		"thumbnailsIds": ["00000000-51e5-4260-91a5-890fa05c68b0"],
		"adjustedStart": "00:10:23.9570000",
		"adjustedEnd": "00:10:39.2390000",
		"start": "00:10:23.9570000",
		"end": "00:10:39.2390000"
	}]
}]

shots

Name Description
id The shot ID.
keyFrames A list of key frames within the shot (each has an ID and a list of instances time ranges). Key frames instances have a thumbnailId field with the keyFrame's thumbnail ID.
instances A list of time ranges of this shot (shots have only one instance).
"Shots": [
    {
      "id": 0,
      "keyFrames": [
        {
          "id": 0,
          "instances": [
            {
	            "thumbnailId": "00000000-0000-0000-0000-000000000000",
              "start": "00: 00: 00.1670000",
              "end": "00: 00: 00.2000000"
            }
          ]
        }
      ],
      "instances": [
        {
	        "thumbnailId": "00000000-0000-0000-0000-000000000000",
          "start": "00: 00: 00.2000000",
          "end": "00: 00: 05.0330000"
        }
      ]
    },
    {
      "id": 1,
      "keyFrames": [
        {
          "id": 1,
          "instances": [
            {
	            "thumbnailId": "00000000-0000-0000-0000-000000000000",
              "start": "00: 00: 05.2670000",
              "end": "00: 00: 05.3000000"
            }
          ]
        }
      ],
      "instances": [
        {
          "thumbnailId": "00000000-0000-0000-0000-000000000000",
          "start": "00: 00: 05.2670000",
          "end": "00: 00: 10.3000000"
        }
      ]
    }
  ]

statistics

Name Description
CorrespondenceCount Number of correspondences in the video.
WordCount The number of words per speaker.
SpeakerNumberOfFragments The amount of fragments the speaker has in a video.
SpeakerLongestMonolog The speaker's longest monolog. If the speaker has silences inside the monolog it's included. Silence at the beginning and the end of the monolog is removed.
SpeakerTalkToListenRatio The calculation is based on the time spent on the speaker's monolog (without the silence in between) divided by the total time of the video. The time is rounded to the third decimal point.

labels

Name Description
id The label ID.
name The label name (for example, 'Computer', 'TV').
language The label name language (when translated). BCP-47
instances A list of time ranges where this label appeared (a label can appear multiple times). Each instance has a confidence field.
"labels": [
    {
      "id": 0,
      "name": "person",
      "language": "en-US",
      "instances": [
        {
          "confidence": 1.0,
          "start": "00: 00: 00.0000000",
          "end": "00: 00: 25.6000000"
        },
        {
          "confidence": 1.0,
          "start": "00: 01: 33.8670000",
          "end": "00: 01: 39.2000000"
        }
      ]
    },
    {
      "name": "indoor",
      "language": "en-US",
      "id": 1,
      "instances": [
        {
          "confidence": 1.0,
          "start": "00: 00: 06.4000000",
          "end": "00: 00: 07.4670000"
        },
        {
          "confidence": 1.0,
          "start": "00: 00: 09.6000000",
          "end": "00: 00: 10.6670000"
        },
        {
          "confidence": 1.0,
          "start": "00: 00: 11.7330000",
          "end": "00: 00: 20.2670000"
        },
        {
          "confidence": 1.0,
          "start": "00: 00: 21.3330000",
          "end": "00: 00: 25.6000000"
        }
      ]
    }
  ]

keywords

Name Description
id The keyword ID.
text The keyword text.
confidence The keyword's recognition confidence.
language The keyword language (when translated).
instances A list of time ranges where this keyword appeared (a keyword can appear multiple times).
"keywords": [
{
    "id": 0,
    "text": "office",
    "confidence": 1.6666666666666667,
    "language": "en-US",
    "instances": [
    {
        "start": "00:00:00.5100000",
        "end": "00:00:02.7200000"
    },
    {
        "start": "00:00:03.9600000",
        "end": "00:00:12.2700000"
    }
    ]
},
{
    "id": 1,
    "text": "icons",
    "confidence": 1.4,
    "language": "en-US",
    "instances": [
    {
        "start": "00:00:03.9600000",
        "end": "00:00:12.2700000"
    },
    {
        "start": "00:00:13.9900000",
        "end": "00:00:15.6100000"
    }
    ]
}
]

visualContentModeration

The visualContentModeration block contains time ranges which Video Analyzer for Media found to potentially have adult content. If visualContentModeration is empty, there's no adult content that was identified.

Videos that are found to contain adult or racy content might be available for private view only. Users can submit a request for a human review of the content, in which case the IsAdult attribute will contain the result of the human review.

Name Description
id The visual content moderation ID.
adultScore The adult score (from content moderator).
racyScore The racy score (from content moderation).
instances A list of time ranges where this visual content moderation appeared.
"VisualContentModeration": [
{
    "id": 0,
    "adultScore": 0.00069,
    "racyScore": 0.91129,
    "instances": [
    {
        "start": "00:00:25.4840000",
        "end": "00:00:25.5260000"
    }
    ]
},
{
    "id": 1,
    "adultScore": 0.99231,
    "racyScore": 0.99912,
    "instances": [
    {
        "start": "00:00:35.5360000",
        "end": "00:00:35.5780000"
    }
    ]
}
]