Create a batch transcription

With batch transcriptions, you submit audio data in a batch. The service transcribes the audio data and stores the results in a storage container. You can then retrieve the results from the storage container.

Batch transcription completion can take several minutes to hours, depending on the size of the audio data and the number of files submitted. Even the same size of audio data can take different amounts of time to transcribe, depending on service load and other factors. The service doesn't provide a way to estimate the time it takes to transcribe a batch of audio data.

Tip

If you need consistent fast speed for audio files less than 2 hours long and less than 300 MB in size, consider using the fast transcription API instead.

Prerequisites

You need an Azure Speech resource.

Create a transcription job

To create a batch transcription job, use the Transcriptions - Submit operation of the speech to text REST API. Construct the request body according to the following instructions:

You must set either the contentContainerUrl or contentUrls property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.
Set the required locale property. This value should match the expected locale of the audio data to transcribe. You can't change the locale later.
Set the required displayName property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later.
Set the required timeToLiveHours property. This property specifies how long the transcription should be kept in the system after it completed. The shortest supported duration is 6 hours, the longest supported duration is 31 days. The recommended value is 48 hours (two days) when data is consumed directly.
Optionally, to use a model other than the base model, set the model property to the model ID. For more information, see Use a custom model.
Optionally, set the wordLevelTimestampsEnabled property to true to enable word-level timestamps in the transcription results. The default value is false.
Optionally, set the languageIdentification property. Language identification is used to identify languages spoken in audio when compared against a list of supported languages. If you set the languageIdentification property, then you must also set languageIdentification.candidateLocales with candidate locales.

For more information, see Request configuration options.

Make an HTTP POST request that uses the URI as shown in the following Transcriptions - Submit example.

Replace YourSpeechResoureKey with your Speech resource key.
replace YourResourceName with your Speech resource name.
Set the request body properties as previously described.

curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSpeechResoureKey" -H "Content-Type: application/json" -d '{
  "contentUrls": [
    "https://crbn.us/hello.wav",
    "https://crbn.us/whatstheweatherlike.wav"
  ],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": null,
  "properties": {
    "wordLevelTimestampsEnabled": true,
    "languageIdentification": {
      "candidateLocales": [
        "en-US", "de-DE", "es-ES"
      ],
      "mode": "Continuous"
    },
    "timeToLiveHours": 48
  }
}'  "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/transcriptions:submit?api-version=2024-11-15"

You should receive a response body in the following format:

{
  "self": "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/transcriptions/788a1f24-f980-4809-8978-e5cf41f77b35?api-version=2024-11-15",
  "displayName": "My Transcription 2",
  "locale": "en-US",
  "createdDateTime": "2025-05-24T03:20:39Z",
  "lastActionDateTime": "2025-05-24T03:20:39Z",
  "links": {
    "files": "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/transcriptions/788a1f24-f980-4809-8978-e5cf41f77b35/files?api-version=2024-11-15"
  },
  "properties": {
    "wordLevelTimestampsEnabled": true,
    "displayFormWordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48,
    "languageIdentification": {
      "candidateLocales": [
        "en-US",
        "de-DE",
        "es-ES"
      ],
      "mode": "Continuous"
    }
  },
  "status": "NotStarted"
}

The top-level self property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.

You can query the status of your transcriptions with the Transcriptions - Get operation.

Call Transcriptions - Delete regularly from the service, after you retrieve the results. Alternatively, set the timeToLive property to ensure the eventual deletion of the results.

Tip

You can also try the Batch Transcription API using Python, C#, or Node.js on GitHub.

To create a transcription, use the spx batch transcription create command. Construct the request parameters according to the following instructions:

Set the required content parameter. You can specify a comma delimited list of individual files or the URL for an entire container. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.
Set the required language property. This value should match the expected locale of the audio data to transcribe. You can't change the locale later. The Speech CLI language parameter corresponds to the locale property in the JSON request and response.
Set the required name property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later. The Speech CLI name parameter corresponds to the displayName property in the JSON request and response.
Set the required api-version parameter to v3.2. The Speech CLI doesn't support version 2024-11-15 or later yet, so you must use v3.2 for now.

Here's an example Speech CLI command that creates a transcription job:

spx batch transcription create --api-version v3.2 --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav,https://crbn.us/whatstheweatherlike.wav

You should receive a response body in the following format:

{
  "self": "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/v3.2/transcriptions/bbbbcccc-1111-dddd-2222-eeee3333ffff",
  "model": {
    "self": "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/v3.2/models/base/ccccdddd-2222-eeee-3333-ffff4444aaaa"
  },
  "links": {
    "files": "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/v3.2/transcriptions/7f4232d5-9873-47a7-a6f7-4a3f00d00dc0/files"
  },
  "properties": {
    "diarizationEnabled": false,
    "wordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked"
  },
  "lastActionDateTime": "2025-05-24T03:20:39Z",
  "status": "NotStarted",
  "createdDateTime": "2025-05-24T03:20:39Z",
  "locale": "en-US",
  "displayName": "My Transcription",
  "description": ""
}

For Speech CLI help with transcriptions, run the following command:

spx help batch transcription

Request configuration options

Here are some property options to configure a transcription when you call the Transcriptions - Submit operation. You can find more examples on the same page, such as creating a transcription with language identification.

The request body has two distinct levels. Misplacing a property causes the service to silently ignore it or return a validation error.

Root level: Metadata that describes the transcription job itself (displayName, locale, model, contentUrls, contentContainerUrl).
Inside properties: Options that control transcription behavior. Wrap these in a "properties": { } object.

Important

destinationContainerUrl belongs inside the properties object, not at the root level of the request body. Placing it at the root causes the service to ignore it, and transcription results are silently written to the Azure-managed container instead.

The following example shows the correct structure:

{
  "contentUrls": ["https://..."],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": null,
  "properties": {
    "destinationContainerUrl": "https://<storage>.blob.core.chinacloudapi.cn/<container>?<SAS>",
    "wordLevelTimestampsEnabled": true,
    "timeToLiveHours": 48
  }
}

Property	Location in request body	Description
`contentContainerUrl`	Root level	You can submit individual audio files or a whole storage container. You must specify the audio data location by using either the `contentContainerUrl` or `contentUrls` property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription. This property isn't returned in the response.
`contentUrls`	Root level	You can submit individual audio files or a whole storage container. You must specify the audio data location by using either the `contentContainerUrl` or `contentUrls` property. For more information, see Locate audio files for batch transcription. This property isn't returned in the response.
`displayName`	Root level	The name of the batch transcription. Choose a name that you can refer to later. The display name doesn't have to be unique. This property is required.
`locale`	Root level	The locale of the batch transcription. This value should match the expected locale of the audio data to transcribe. The locale can't be changed later. This property is required.
`model`	Root level	You can set the `model` property to use a specific base model or custom speech model. If you don't specify the `model`, the default base model for the locale is used. For more information, see Use a custom model and Use a Whisper model.
`channels`	Inside `properties`	An array of channel numbers to process. Channels `0` and `1` are transcribed by default.
`destinationContainerUrl`	Inside `properties`	The result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. When the transcription job is deleted, the transcription result data is also deleted. For more information, such as the supported security scenarios, see Specify a destination container URL.
`diarization`	Inside `properties`	Indicates that the Speech service should attempt diarization analysis on the input, which is expected to be a mono channel that contains multiple voices. The feature isn't available with stereo recordings. Diarization is the process of separating speakers in audio data. The batch pipeline can recognize and separate multiple speakers on mono channel recordings. Specify the minimum and maximum number of people who might be speaking. You must also set the `diarizationEnabled` property to `true`. The transcription file contains a `speaker` entry for each transcribed phrase. You need to use this property when you expect three or more speakers. For two speakers, setting `diarizationEnabled` property to `true` is enough. For an example of the property usage, see Transcriptions - Submit. The maximum number of speakers for diarization must be less than 36 and more or equal to the `minCount` property. For an example, see Transcriptions - Submit. When this property is selected, source audio length can't exceed 240 minutes per file. Note: This property is only available with Speech to text REST API version 3.1 and later. If you set this property with any previous version, such as version 3.0, it's ignored and only two speakers are identified.
`diarizationEnabled`	Inside `properties`	Specifies that the Speech service should attempt diarization analysis on the input, which is expected to be a mono channel that contains two voices. The default value is `false`. For three or more voices you also need to use property `diarization`. Use only with Speech to text REST API version 3.1 and later. When this property is selected, source audio length can't exceed 240 minutes per file.
`displayFormWordLevelTimestampsEnabled`	Inside `properties`	Specifies whether to include word-level timestamps on the display form of the transcription results. The results are returned in the `displayWords` property of the transcription file. The default value is `false`. Note: This property is only available with Speech to text REST API version 3.1 and later.
`languageIdentification`	Inside `properties`	Language identification is used to identify languages spoken in audio when compared against a list of supported languages. If you set the `languageIdentification` property, then you must also set its enclosed `candidateLocales` property.
`languageIdentification.candidateLocales`	Inside `properties`	The candidate locales for language identification, such as `"properties": { "languageIdentification": { "candidateLocales": ["en-US", "de-DE", "es-ES"]}}`. A minimum of two and a maximum of ten candidate locales, including the main locale for the transcription, is supported.
`profanityFilterMode`	Inside `properties`	Specifies how to handle profanity in recognition results. Accepted values are `None` to disable profanity filtering, `Masked` to replace profanity with asterisks, `Removed` to remove all profanity from the result, or `Tags` to add profanity tags. The default value is `Masked`.
`punctuationMode`	Inside `properties`	Specifies how to handle punctuation in recognition results. Accepted values are `None` to disable punctuation, `Dictated` to imply explicit (spoken) punctuation, `Automatic` to let the decoder deal with punctuation, or `DictatedAndAutomatic` to use dictated and automatic punctuation. The default value is `DictatedAndAutomatic`. This property isn't applicable for Whisper models.
`timeToLiveHours`	Inside `properties`	This required property specifies how long the transcription should be kept in the system after it completed. Once the transcription reaches the time to live after completion (successful or failed) it's automatically deleted. The shortest supported duration is 6 hours, the longest supported duration is 31 days. The recommended value is 48 hours (two days) when data is consumed directly. As an alternative, you can call Transcriptions - Delete regularly after you retrieve the transcription results.
`wordLevelTimestampsEnabled`	Inside `properties`	Specifies if word level timestamps should be included in the output. The default value is `false`. This property isn't applicable for Whisper models. Whisper is a display-only model, so the lexical field isn't populated in the transcription.

For Speech CLI help with transcription configuration options, run the following command:

spx help batch transcription create advanced

Use a custom model

Batch transcription uses the default base model for the locale that you specify. You don't need to set any properties to use the default base model.

Optionally, you can modify the previous create transcription example by setting the model property to use a specific base model or custom speech model.

curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSpeechResoureKey" -H "Content-Type: application/json" -d '{
  "contentUrls": [
    "https://crbn.us/hello.wav",
    "https://crbn.us/whatstheweatherlike.wav"
  ],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": {
    "self": "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/models/base/ccccdddd-2222-eeee-3333-ffff4444aaaa"
  },
  "properties": {
    "wordLevelTimestampsEnabled": true,
  }
}'  "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/transcriptions:submit?api-version=2024-11-15"

spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav,https://crbn.us/whatstheweatherlike.wav --model "https://YourResourceName.cognitiveservices.azure.cn/speechtotext/v3.2/models/base/ccccdddd-2222-eeee-3333-ffff4444aaaa"

To use a custom speech model for batch transcription, you need the model's URI. The top-level self property in the response body is the model's URI. You can retrieve the model location when you create or get a model. For more information, see the JSON response example in Create a model.

Tip

A hosted deployment endpoint isn't required to use custom speech with the batch transcription service. You can conserve resources if you use the custom speech model only for batch transcription.

Batch transcription requests for expired models fail with a 4xx error. Set the model property to a base model or custom model that isn't expired. Otherwise don't include the model property to always use the latest base model. For more information, see Choose a model and Custom speech model lifecycle.

Language identification

To identify languages with Batch transcription REST API, use languageIdentification property in the body of your Transcriptions - Submit request.

Warning

Batch transcription only supports language identification for default base models. If both language identification and a custom model are specified in the transcription request, the service falls back to use the base models for the specified candidate languages. This might result in unexpected recognition results.

If your speech to text scenario requires both language identification and custom models, use real-time speech to text instead of batch transcription.

The following example shows the usage of the languageIdentification property with four candidate languages. For more information about request properties, see Create a batch transcription.

{
    <...>
    
    "properties": {
    <...>
    
        "languageIdentification": {
            "candidateLocales": [
            "en-US",
            "ja-JP",
            "zh-cn",
            "hi-IN"
            ]
        },	
        <...>
    }
}

Specify a destination container URL

The transcription result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. In that case, when the transcription job is deleted, the transcription result data is also deleted.

You can store the results of a batch transcription to a writable Azure Blob storage container using option destinationContainerUrl in the batch transcription creation request. This option uses only an ad hoc SAS URI and doesn't support Trusted Azure services security mechanism. This option also doesn't support Access policy based SAS. The Storage account resource of the destination container must allow all external traffic.

If you want to store the transcription results in an Azure Blob storage container by using the Trusted Azure services security mechanism, consider using Bring-your-own-storage (BYOS). For more information, see Use the Bring your own storage (BYOS) Speech resource for speech to text.

Last updated on 2026-06-09