What is batch transcription?

Use batch transcription to transcribe a large amount of audio data in storage. Both the Speech to text REST API and Speech CLI support batch transcription.

You should provide multiple files per request or point to an Azure Blob Storage container with the audio files to transcribe. The batch transcription service can handle a large number of submitted transcriptions. The service transcribes the files concurrently, which reduces the turnaround time.

How does it work?

With batch transcriptions, you submit the audio data, and then retrieve transcription results asynchronously. The service transcribes the audio data and stores the results in a storage container. You can then retrieve the results from the storage container.

To use the batch transcription REST API:

Locate audio files for batch transcription - You can upload your own data or use existing audio files via public URI or shared access signature (SAS) URI.
Create a batch transcription - Submit the transcription job with parameters such as the audio files, the transcription language, and the transcription model.
Get batch transcription results - Check transcription status and retrieve transcription results asynchronously.

Important

The service schedules batch transcription jobs on a best-effort basis. At peak hours, it might take up to 30 minutes for a transcription job to start processing and up to 24 hours to complete. See how to check the current status of a batch transcription job in this section.

Best practices for improving performance

Request size: Batch transcription is asynchronous, and each region processes requests one at a time. Submitting jobs at a higher rate doesn't speed up processing. For example, sending 600 or 6,000 requests per minute has no effect on throughput. Submit about 1,000 files in a single Transcription_Create request to send fewer requests overall.

Time distribution: Distribute your requests over time. Submit them across several hours rather than sending them all within a few minutes. Backend processing maintains a stable performance level due to fixed bandwidth, so sending requests too quickly doesn't improve performance.

Job monitoring: When monitoring job status, polling every few seconds is unnecessary. If you submit multiple jobs, the service processes only the first job initially; subsequent jobs wait until the first job completes. Polling all jobs frequently increases system load without benefit. Checking the status every 10 minutes is sufficient, and polling more often than once per minute isn't recommended.

Because of the sequential processing, you can get job status by checking only a subset of the files: check the first 100 files, and if they're not completed, later batches are likely not completed either. Wait at least one minute (ideally five minutes) before checking again.

Avoid peak traffic for API calls: Minimize the ListFiles, Update, and Get API calls during peak traffic times. These calls behave similarly to the Create call.

Load balancing: To optimize throughput for large-scale batch transcription, consider distributing your jobs across multiple supported Azure regions. This approach can help balance load and reduce overall processing time, provided your data and compliance requirements allow for multiregion usage. Review region availability and ensure your storage and resources are accessible from each region you plan to use.

Roughly estimate the latency

The latency is the end-to-end time it takes to transcribe a batch of audio data. It's hard to estimate because it's determined by multiple factors:

The total audio length included in the request.
Whether additional features are required, such as diarization and language identification.
The length of the job queue in the system at that time.
The system's hardware resource configuration. But we are confident that the 90th percentile latency is less than 6 hours. We define a normalized latency calculation method as follows:

Normalized Latency = ProcessDuration - AudioLength/5

For example, when the AudioLength is 120 minutes and the end-to-end ProcessDuration is 30 minutes: 30 - 120/5 = 6, where the factor of 5 means the system processes audio at roughly 5x real-time speed. The 6 minutes represents the queue wait time under normal circumstances, which doesn't exceed six hours. In extreme cases, it might reach 24 hours or longer, as the pending queue might be very long, making the end-to-end ProcessDuration longer.

Last updated on 2026-07-17

What is batch transcription?

How does it work?

Best practices for improving performance

Roughly estimate the latency

Related content

Additional resources