Analyze video and audio files with Azure Media Services
Warning
Azure Media Services will be retired June 30th, 2024. For more information, see the AMS Retirement Guide.
Important
As Microsoft's Responsible AI Standards outlines, Microsoft is committed to fairness, privacy, security, and transparency with respect to AI systems. To align with these standards, Azure Media Services is retiring the Video Analyzer preset on September 14, 2023. This preset currently allows you to extract multiple video and audio insights from a video file. Customers can replace their current workflows using the more advanced feature set offered by Azure Video Indexer.
Media Services lets you extract insights from your video and audio files using the audio and video analyzer presets. This article describes the analyzer presets used to extract insights.
There are two modes for the Audio Analyzer preset, basic and standard. See the description of the differences in the table below.
To analyze your content using Media Services v3 presets, you create a Transform and submit a Job that uses one of these presets: VideoAnalyzerPreset or AudioAnalyzerPreset.
Note
AudioAnalyzerPreset is not supported if the storage account does not have public network access.
Compliance, Privacy, and Security
You must comply with all applicable laws in your use of Video Analyzer for Media, and you may not use Video Analyzer for Media or any other Azure service in a manner that violates the rights of others or may be harmful to others. Before uploading any videos, including any biometric data, to the Video Analyzer for Media service for processing and storage, You must have all the proper rights, including all appropriate consents, from the individual(s) in the video. For Azure's privacy obligations and handling of your data, please review Azure's Privacy Statement.
Built-in presets
Media Services currently supports the following built-in analyzer presets:
Preset name | Scenario / Mode | Details |
---|---|---|
AudioAnalyzerPreset | Analyzing audio Standard mode | The preset applies a predefined set of AI-based analysis operations, including speech transcription. Currently, the preset supports processing content with a single audio track that contains speech in a single language. Specify the language for the audio payload in the input using the BCP-47 format of 'language tag-region'. See supported languages list below for available language codes. The automatic language detection chooses the first language detected and continues with the selected language for the whole file if it not set, or set to null. The automatic language detection feature currently supports: English, Chinese, French, German, Italian, Japanese, Spanish, Russian, and Brazilian Portuguese. It doesn't support dynamically switching between languages after the first language is detected. The automatic language detection feature works best with audio recordings with clearly discernible speech. If automatic language detection fails to find the language, the transcription falls back to English. |
AudioAnalyzerPreset | Analyzing audio Basic mode | This preset mode performs speech-to-text transcription and generation of a VTT subtitle/caption file. The output of this mode includes an Insights JSON file including only the keywords, transcription,and timing information. Automatic language detection and speaker diarization are not included in this mode. The list of supported languages is identical to the Standard mode above. |
VideoAnalyzerPreset | Analyzing audio and video | Extracts insights (rich metadata) from both audio and video, and outputs a JSON format file. You can specify whether you only want to extract audio insights when processing a video file. |
FaceDetectorPreset | Detecting faces present in video | Describes the settings to be used when analyzing a video to detect all the faces present. |
Note
AudioAnalyzerPreset is not supported if the storage account does not have public network access.
Supported languages
- Arabic ('ar-BH', 'ar-EG', 'ar-IQ', 'ar-JO', 'ar-KW', 'ar-LB', 'ar-OM', 'ar-QA', 'ar-SA' and 'ar-SY')
- Brazilian Portuguese ('pt-BR')
- Chinese ('zh-CN')
- Danish('da-DK')
- English ('en-US', 'en-GB' and 'en-AU')
- Finnish ('fi-FI')
- French ('fr-FR' and 'fr-CA')
- German ('de-DE')
- Hebrew (he-IL)
- Hindi ('hi-IN'), Korean ('ko-KR')
- Italian ('it-IT')
- Japanese ('ja-JP')
- Norwegian ('nb-NO')
- Persian ('fa-IR')
- Portugal Portuguese ('pt-PT')
- Russian ('ru-RU')
- Spanish ('es-ES' and 'es-MX')
- Swedish ('sv-SE')
- Thai ('th-TH')
- Turkish ('tr-TR')
Note
AudioAnalyzerPreset is not supported if the storage account does not have public network access.
AudioAnalyzerPreset standard mode
The preset enables you to extract multiple audio insights from an audio or video file.
The output includes a JSON file (with all the insights) and VTT file for the audio transcript. This preset accepts a property that specifies the language of the input file in the form of a BCP47 string. The audio insights include:
- Audio transcription: A transcript of the spoken words with timestamps. Multiple languages are supported.
- Keywords: Keywords that are extracted from the audio transcription.
AudioAnalyzerPreset basic mode
The preset enables you to extract multiple audio insights from an audio or video file.
The output includes a JSON file and VTT file for the audio transcript. This preset accepts a property that specifies the language of the input file in the form of a BCP47 string. The output includes:
- Audio transcription: A transcript of the spoken words with timestamps. Multiple languages are supported, but automatic language detection and speaker diarization are not included.
- Keywords: Keywords that are extracted from the audio transcription.
VideoAnalyzerPreset
The preset enables you to extract multiple audio and video insights from a video file. The output includes a JSON file (with all the insights), a VTT file for the video transcript, and a collection of thumbnails. This preset also accepts a BCP47 string (representing the language of the video) as a property. The video insights include all the audio insights mentioned above and the following extra items:
- Face tracking: The time during which faces are present in the video. Each face has a face ID and a corresponding collection of thumbnails.
- Visual text: The text that's detected via optical character recognition. The text is time stamped and also used to extract keywords (in addition to the audio transcript).
- Keyframes: A collection of keyframes extracted from the video.
- Visual content moderation: The portion of the videos flagged as adult or racy in nature.
- Annotation: A result of annotating the videos based on a pre-defined object model
insights.json elements
The output includes a JSON file (insights.json) with all the insights found in the video or audio. The JSON may contain the following elements:
transcript
Name | Description |
---|---|
id | The line ID. |
text | The transcript itself. |
language | The transcript language. Intended to support transcript where each line can have a different language. |
instances | A list of time ranges where this line appeared. If the instance is transcript, it will have only one instance. |
Example:
"transcript": [
{
"id": 0,
"text": "Hi I'm Doug from office.",
"language": "en-US",
"instances": [
{
"start": "00:00:00.5100000",
"end": "00:00:02.7200000"
}
]
},
{
"id": 1,
"text": "I have a guest. It's Michelle.",
"language": "en-US",
"instances": [
{
"start": "00:00:02.7200000",
"end": "00:00:03.9600000"
}
]
}
]
ocr
Name | Description |
---|---|
id | The OCR line ID. |
text | The OCR text. |
confidence | The recognition confidence. |
language | The OCR language. |
instances | A list of time ranges where this OCR appeared (the same OCR can appear multiple times). |
"ocr": [
{
"id": 0,
"text": "LIVE FROM NEW YORK",
"confidence": 0.91,
"language": "en-US",
"instances": [
{
"start": "00:00:26",
"end": "00:00:52"
}
]
},
{
"id": 1,
"text": "NOTICIAS EN VIVO",
"confidence": 0.9,
"language": "es-ES",
"instances": [
{
"start": "00:00:26",
"end": "00:00:28"
},
{
"start": "00:00:32",
"end": "00:00:38"
}
]
}
],
faces
Name | Description |
---|---|
id | The face ID. |
name | The face name. It can be 'Unknown #0', an identified celebrity, or a customer trained person. |
confidence | The face identification confidence. |
description | A description of the celebrity. |
thumbnailId | The ID of the thumbnail of that face. |
knownPersonId | The internal ID (if it's a known person). |
referenceId | The Bing ID (if it's a Bing celebrity). |
referenceType | Currently just Bing. |
title | The title (if it's a celebrity—for example, "Microsoft's CEO"). |
imageUrl | The image URL, if it's a celebrity. |
instances | Instances where the face appeared in the given time range. Each instance also has a thumbnailsId. |
"faces": [{
"id": 2002,
"name": "Xam 007",
"confidence": 0.93844,
"description": null,
"thumbnailId": "00000000-aee4-4be2-a4d5-d01817c07955",
"knownPersonId": "8340004b-5cf5-4611-9cc4-3b13cca10634",
"referenceId": null,
"title": null,
"imageUrl": null,
"instances": [{
"thumbnailsIds": ["00000000-9f68-4bb2-ab27-3b4d9f2d998e",
"cef03f24-b0c7-4145-94d4-a84f81bb588c"],
"adjustedStart": "00:00:07.2400000",
"adjustedEnd": "00:00:45.6780000",
"start": "00:00:07.2400000",
"end": "00:00:45.6780000"
},
{
"thumbnailsIds": ["00000000-51e5-4260-91a5-890fa05c68b0"],
"adjustedStart": "00:10:23.9570000",
"adjustedEnd": "00:10:39.2390000",
"start": "00:10:23.9570000",
"end": "00:10:39.2390000"
}]
}]
shots
Name | Description |
---|---|
id | The shot ID. |
keyFrames | A list of key frames within the shot (each has an ID and a list of instances time ranges). Key frames instances have a thumbnailId field with the keyFrame's thumbnail ID. |
instances | A list of time ranges of this shot (shots have only one instance). |
"Shots": [
{
"id": 0,
"keyFrames": [
{
"id": 0,
"instances": [
{
"thumbnailId": "00000000-0000-0000-0000-000000000000",
"start": "00: 00: 00.1670000",
"end": "00: 00: 00.2000000"
}
]
}
],
"instances": [
{
"thumbnailId": "00000000-0000-0000-0000-000000000000",
"start": "00: 00: 00.2000000",
"end": "00: 00: 05.0330000"
}
]
},
{
"id": 1,
"keyFrames": [
{
"id": 1,
"instances": [
{
"thumbnailId": "00000000-0000-0000-0000-000000000000",
"start": "00: 00: 05.2670000",
"end": "00: 00: 05.3000000"
}
]
}
],
"instances": [
{
"thumbnailId": "00000000-0000-0000-0000-000000000000",
"start": "00: 00: 05.2670000",
"end": "00: 00: 10.3000000"
}
]
}
]
statistics
Name | Description |
---|---|
CorrespondenceCount | Number of correspondences in the video. |
WordCount | The number of words per speaker. |
SpeakerNumberOfFragments | The amount of fragments the speaker has in a video. |
SpeakerLongestMonolog | The speaker's longest monolog. If the speaker has silences inside the monolog it's included. Silence at the beginning and the end of the monolog is removed. |
SpeakerTalkToListenRatio | The calculation is based on the time spent on the speaker's monolog (without the silence in between) divided by the total time of the video. The time is rounded to the third decimal point. |
labels
Name | Description |
---|---|
id | The label ID. |
name | The label name (for example, 'Computer', 'TV'). |
language | The label name language (when translated). BCP-47 |
instances | A list of time ranges where this label appeared (a label can appear multiple times). Each instance has a confidence field. |
"labels": [
{
"id": 0,
"name": "person",
"language": "en-US",
"instances": [
{
"confidence": 1.0,
"start": "00: 00: 00.0000000",
"end": "00: 00: 25.6000000"
},
{
"confidence": 1.0,
"start": "00: 01: 33.8670000",
"end": "00: 01: 39.2000000"
}
]
},
{
"name": "indoor",
"language": "en-US",
"id": 1,
"instances": [
{
"confidence": 1.0,
"start": "00: 00: 06.4000000",
"end": "00: 00: 07.4670000"
},
{
"confidence": 1.0,
"start": "00: 00: 09.6000000",
"end": "00: 00: 10.6670000"
},
{
"confidence": 1.0,
"start": "00: 00: 11.7330000",
"end": "00: 00: 20.2670000"
},
{
"confidence": 1.0,
"start": "00: 00: 21.3330000",
"end": "00: 00: 25.6000000"
}
]
}
]
keywords
Name | Description |
---|---|
id | The keyword ID. |
text | The keyword text. |
confidence | The keyword's recognition confidence. |
language | The keyword language (when translated). |
instances | A list of time ranges where this keyword appeared (a keyword can appear multiple times). |
"keywords": [
{
"id": 0,
"text": "office",
"confidence": 1.6666666666666667,
"language": "en-US",
"instances": [
{
"start": "00:00:00.5100000",
"end": "00:00:02.7200000"
},
{
"start": "00:00:03.9600000",
"end": "00:00:12.2700000"
}
]
},
{
"id": 1,
"text": "icons",
"confidence": 1.4,
"language": "en-US",
"instances": [
{
"start": "00:00:03.9600000",
"end": "00:00:12.2700000"
},
{
"start": "00:00:13.9900000",
"end": "00:00:15.6100000"
}
]
}
]
visualContentModeration
The visualContentModeration block contains time ranges which Video Analyzer for Media found to potentially have adult content. If visualContentModeration is empty, there's no adult content that was identified.
Videos that are found to contain adult or racy content might be available for private view only. Users can submit a request for a human review of the content, in which case the IsAdult
attribute will contain the result of the human review.
Name | Description |
---|---|
id | The visual content moderation ID. |
adultScore | The adult score (from content moderator). |
racyScore | The racy score (from content moderation). |
instances | A list of time ranges where this visual content moderation appeared. |
"VisualContentModeration": [
{
"id": 0,
"adultScore": 0.00069,
"racyScore": 0.91129,
"instances": [
{
"start": "00:00:25.4840000",
"end": "00:00:25.5260000"
}
]
},
{
"id": 1,
"adultScore": 0.99231,
"racyScore": 0.99912,
"instances": [
{
"start": "00:00:35.5360000",
"end": "00:00:35.5780000"
}
]
}
]