Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Fast transcription API is used to transcribe audio files with returning results synchronously and faster than real-time. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency, such as:
- Quick audio or video transcription, subtitles, and edit.
- Meeting notes
- Voicemail
Unlike the batch transcription API, fast transcription API only produces transcriptions in the display (not lexical) form. The display form is a more human-readable form of the transcription that includes punctuation and capitalization.
Prerequisites
An Azure Speech resource in one of the regions where the fast transcription API is available. For the current list of supported regions, see the Speech service regions table.
An audio file (less than 2 hours long and less than 300 MB in size) in one of the formats and codecs supported by the batch transcription API: WAV, MP3, OPUS/OGG, FLAC, WMA, AAC, ALAW in WAV container, MULAW in WAV container, AMR, WebM, and SPEEX. For more information about supported audio formats, see supported audio formats.
Upload audio
You can provide audio data to fast transcription in the following ways:
- Inline audio upload
--form 'audio=@"YourAudioFile"'
- Audio from a public URL
--form 'definition="{"audioUrl": "https://crbn.us/hello.wav"}"'
In the sections below, inline audio upload is used as an example.
Use the fast transcription API
We learn how to use the fast transcription API (via Transcriptions - Transcribe) with the following scenarios:
- Known locale specified: Transcribe an audio file with a specified locale. If you know the locale of the audio file, you can specify it to improve transcription accuracy and minimize the latency.
- Language identification on: Transcribe an audio file with language identification on. If you're not sure about the locale of the audio file, you can turn on language identification to let the Speech service identify the locale (one locale per audio).
- Multi-lingual transcription (preview): Transcribe an audio file with the latest multi-lingual speech transcription model. If your audio contains multi-lingual contents that you want to transcribe continuously and accurately, you can use the latest multi-lingual speech transcription model without specifying the locale codes.
- Diarization on: Transcribe an audio file with diarization on. Diarization distinguishes between different speakers in the conversation. The Speech service provides information about which speaker was speaking a particular part of the transcribed speech.
- Multi-channel on: Transcribe an audio file that has one or two channels. Multi-channel transcriptions are useful for audio files with multiple channels, such as audio files with multiple speakers or audio files with background noise. By default, the fast transcription API merges all input channels into a single channel and then performs the transcription. If this isn't desirable, channels can be transcribed independently without merging.
- Known locale specified
- Language identification on
- Multi-lingual transcription
- Diarization on
- Multi-channel on
Make a multipart/form-data POST request to the transcriptions endpoint with the audio file and the request body properties.
The following example shows how to transcribe an audio file with a specified locale. If you know the locale of the audio file, you can specify it to improve transcription accuracy and minimize the latency.
- Replace
YourSpeechResourceKeywith your Speech resource key. - Replace
YourServiceRegionwith your Speech resource region. - Replace
YourAudioFilewith the path to your audio file.
Important
For the recommended keyless authentication with Microsoft Entra ID, replace --header 'Ocp-Apim-Subscription-Key: YourSpeechResourceKey' with --header "Authorization: Bearer YourAccessToken". For more information about keyless authentication, see the role-based access control how-to guide.
curl --location 'https://YourServiceRegion.api.cognitive.azure.cn/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResourceKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
"locales":["en-US"]}"'
Construct the form definition according to the following instructions:
- Set the optional (but recommended)
localesproperty that should match the expected locale of the audio data to transcribe. In this example, the locale is set toen-US. For more information about the supported locales, see speech to text supported languages.
For more information about locales and other properties for the fast transcription API, see the request configuration options section later in this guide.
The response includes durationMilliseconds, offsetMilliseconds, and more. The combinedPhrases property contains the full transcriptions for all speakers.
{
"durationMilliseconds": 182439,
"combinedPhrases": [
{
"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And you're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
}
],
"phrases": [
{
"offsetMilliseconds": 960,
"durationMilliseconds": 640,
"text": "Good afternoon.",
"words": [
{
"text": "Good",
"offsetMilliseconds": 960,
"durationMilliseconds": 240
},
{
"text": "afternoon.",
"offsetMilliseconds": 1200,
"durationMilliseconds": 400
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 1600,
"durationMilliseconds": 640,
"text": "This is Sam.",
"words": [
{
"text": "This",
"offsetMilliseconds": 1600,
"durationMilliseconds": 240
},
{
"text": "is",
"offsetMilliseconds": 1840,
"durationMilliseconds": 120
},
{
"text": "Sam.",
"offsetMilliseconds": 1960,
"durationMilliseconds": 280
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 2240,
"durationMilliseconds": 1040,
"text": "Thank you for calling Contoso.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 2240,
"durationMilliseconds": 200
},
{
"text": "you",
"offsetMilliseconds": 2440,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 2520,
"durationMilliseconds": 120
},
{
"text": "calling",
"offsetMilliseconds": 2640,
"durationMilliseconds": 200
},
{
"text": "Contoso.",
"offsetMilliseconds": 2840,
"durationMilliseconds": 440
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 3280,
"durationMilliseconds": 640,
"text": "How can I help?",
"words": [
{
"text": "How",
"offsetMilliseconds": 3280,
"durationMilliseconds": 120
},
{
"text": "can",
"offsetMilliseconds": 3440,
"durationMilliseconds": 120
},
{
"text": "I",
"offsetMilliseconds": 3560,
"durationMilliseconds": 40
},
{
"text": "help?",
"offsetMilliseconds": 3600,
"durationMilliseconds": 320
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 5040,
"durationMilliseconds": 400,
"text": "Hi there.",
"words": [
{
"text": "Hi",
"offsetMilliseconds": 5040,
"durationMilliseconds": 240
},
{
"text": "there.",
"offsetMilliseconds": 5280,
"durationMilliseconds": 160
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 5440,
"durationMilliseconds": 800,
"text": "My name is Mary.",
"words": [
{
"text": "My",
"offsetMilliseconds": 5440,
"durationMilliseconds": 80
},
{
"text": "name",
"offsetMilliseconds": 5520,
"durationMilliseconds": 120
},
{
"text": "is",
"offsetMilliseconds": 5640,
"durationMilliseconds": 80
},
{
"text": "Mary.",
"offsetMilliseconds": 5720,
"durationMilliseconds": 520
}
],
"locale": "en-US",
"confidence": 0.93554276
},
// More transcription results...
// Redacted for brevity
{
"offsetMilliseconds": 180320,
"durationMilliseconds": 680,
"text": "Thank you for your help.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 180320,
"durationMilliseconds": 160
},
{
"text": "you",
"offsetMilliseconds": 180480,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 180560,
"durationMilliseconds": 120
},
{
"text": "your",
"offsetMilliseconds": 180680,
"durationMilliseconds": 120
},
{
"text": "help.",
"offsetMilliseconds": 180800,
"durationMilliseconds": 200
}
],
"locale": "en-US",
"confidence": 0.92022026
},
{
"offsetMilliseconds": 181960,
"durationMilliseconds": 280,
"text": "Thank you.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 181960,
"durationMilliseconds": 200
},
{
"text": "you.",
"offsetMilliseconds": 182160,
"durationMilliseconds": 80
}
],
"locale": "en-US",
"confidence": 0.92022026
}
]
}
Note
Speech service is an elastic service. If you receive 429 error code (too many requests), please follow the best practices to mitigate throttling during autoscaling.
Request configuration options
Here are some property options to configure a transcription when you call the Transcriptions - Transcribe operation.
| Property | Description | Required or optional |
|---|---|---|
channels |
The list of zero-based indices of the channels to be transcribed separately. Up to two channels are supported unless diarization is enabled. By default, the fast transcription API merges all input channels into a single channel and then performs the transcription. If this isn't desirable, channels can be transcribed independently without merging. If you want to transcribe the channels from a stereo audio file separately, you need to specify [0,1], [0], or [1]. Otherwise, stereo audio is merged to mono and only a single channel is transcribed.If the audio is stereo and diarization is enabled, then you can't set the channels property to [0,1]. The Speech service doesn't support diarization of multiple channels.For mono audio, the channels property is ignored, and the audio is always transcribed as a single channel. |
Optional |
diarization |
The diarization configuration. Diarization is the process of recognizing and separating multiple speakers in one audio channel. For example, specify "diarization": {"maxSpeakers": 2, "enabled": true}. Then the transcription file contains speaker entries (such as "speaker": 0 or "speaker": 1) for each transcribed phrase. |
Optional |
locales |
The list of locales that should match the expected locale of the audio data to transcribe. If you know the locale of the audio file, you can specify it to improve transcription accuracy and minimize the latency. If a single locale is specified, that locale is used for transcription. But if you're not sure about the locale, you can specify multiple locales to use language identification. Language identification might be more accurate with a more precise list of candidate locales. If you don't specify any locale, then the Speech service will use the latest multi-lingual model to identify the locale and transcribe continuously. You can get the latest supported languages via the Transcriptions - List Supported Locales REST API (API version 2024-11-15 or later). For more information about locales, see the Speech service language support documentation. |
Optional but recommended if you know the expected locale. |
phraseList |
Phrase list is a list of words or phrases provided ahead of time to help improve their recognition. Adding a phrase to a phrase list increases its importance, thus making it more likely to be recognized. For example, specify phraseList":{"phrases":["Contoso","Jessie","Rehaan"]}. Phrase List is supported via API version 2025-10-15. For more information, see Improve recognition accuracy with phrase list. |
Optional |
profanityFilterMode |
Specifies how to handle profanity in recognition results. Accepted values are None to disable profanity filtering, Masked to replace profanity with asterisks, Removed to remove all profanity from the result, or Tags to add profanity tags. The default value is Masked. |
Optional |
Reference documentation | Package (PyPi) | GitHub Samples
Prerequisites
- An Azure subscription. Create one for trial.
- Python 3.9 or later version. If you don't have a suitable version of Python installed, you can follow the instructions in the VS Code Python Tutorial for the easiest way of installing Python on your operating system.
- A AI services resource created in one of the supported regions. For more information about region availability, see Region support.
- A sample
.wavaudio file to transcribe.
Microsoft Entra ID prerequisites
For the recommended keyless authentication with Microsoft Entra ID, you need to:
- Install the Azure CLI used for keyless authentication with Microsoft Entra ID.
- Assign the
Cognitive Services Userrole to your user account. You can assign roles in the Azure portal under Access control (IAM) > Add role assignment.
Setup
Create a new folder named
transcription-quickstartand go to the quickstart folder with the following command:mkdir transcription-quickstart && cd transcription-quickstartCreate and activate a virtual Python environment to install the packages you need for this tutorial. We recommend you always use a virtual or conda environment when installing Python packages. Otherwise, you can break your global installation of Python. If you already have Python 3.9 or higher installed, create a virtual environment by using the following commands:
When you activate the Python environment, running
pythonorpipfrom the command line uses the Python interpreter in the.venvfolder of your application. Use thedeactivatecommand to exit the Python virtual environment. You can reactivate it later when needed.Create a file named requirements.txt. Add the following packages to the file:
azure-ai-transcription azure-identityInstall the packages:
pip install -r requirements.txt
Note
For Microsoft Entra ID authentication (recommended for production), install azure-identity and configure authentication as described in the Microsoft Entra ID prerequisites section.
Code
Create a file named
transcribe_audio_file.pywith the following code:import os from azure.core.credentials import AzureKeyCredential from azure.ai.transcription import TranscriptionClient from azure.ai.transcription.models import TranscriptionContent, TranscriptionOptions # Get configuration from environment variables endpoint = os.environ["AZURE_SPEECH_ENDPOINT"] api_key = os.environ["AZURE_SPEECH_API_KEY"] # Create the transcription client client = TranscriptionClient(endpoint=endpoint, credential=AzureKeyCredential(api_key)) # Path to your audio file (replace with your own file path) audio_file_path = "<path-to-your-audio-file.wav>" # Open and read the audio file with open(audio_file_path, "rb") as audio_file: # Create transcription options options = TranscriptionOptions(locales=["en-US"]) # Specify the language # Create the request content request_content = TranscriptionContent(definition=options, audio=audio_file) # Transcribe the audio result = client.transcribe(request_content) # Print the transcription result print(f"Transcription: {result.combined_phrases[0].text}") # Print detailed phrase information if result.phrases: print("\nDetailed phrases:") for phrase in result.phrases: print( f" [{phrase.offset_milliseconds}ms - " f"{phrase.offset_milliseconds + phrase.duration_milliseconds}ms]: " f"{phrase.text}" )Reference: TranscriptionClient | TranscriptionContent | TranscriptionOptions | AzureKeyCredential
Replace
<path-to-your-audio-file.wav>with the path to your audio file. The service supports WAV, MP3, FLAC, OGG, and other common audio formats.Run the Python script:
python transcribe_audio_file.py
Output
The script prints the transcription result to the console:
Transcription: Hi there! This is a sample voice recording created for speech synthesis testing. The quick brown fox jumps over the lazy dog. Just a fun way to include every letter of the alphabet. Numbers, like 1, 2, 3, are spoken clearly. Let's see how well this voice captures tone, timing, and natural rhythm. This audio is provided by samplefiles.com.
Detailed phrases:
[40ms - 4880ms]: Hi there! This is a sample voice recording created for speech synthesis testing.
[5440ms - 8400ms]: The quick brown fox jumps over the lazy dog.
[9040ms - 12240ms]: Just a fun way to include every letter of the alphabet.
[12720ms - 16720ms]: Numbers, like 1, 2, 3, are spoken clearly.
[17200ms - 22000ms]: Let's see how well this voice captures tone, timing, and natural rhythm.
[22480ms - 25920ms]: This audio is provided by samplefiles.com.
Reference documentation | Package (Maven) | GitHub Samples
Prerequisites
- An Azure subscription. Create one for trial.
- Java Development Kit (JDK) 8 or later.
- Apache Maven for dependency management and building the project.
- A AI services resource in one of the supported regions. For more information about region availability, see Speech service supported regions.
- A sample
.wavaudio file to transcribe.
Set up the environment
Create a new folder named
transcription-quickstartand navigate to it:mkdir transcription-quickstart && cd transcription-quickstartCreate a
pom.xmlfile in the root of your project directory with the following content:<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>transcription-quickstart</artifactId> <version>1.0.0</version> <packaging>jar</packaging> <name>Speech Transcription Quickstart</name> <description>Quickstart sample for Azure Speech Transcription client library.</description> <url>https://github.com/Azure/azure-sdk-for-java</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <dependencies> <dependency> <groupId>com.azure</groupId> <artifactId>azure-ai-speech-transcription</artifactId> <version>1.0.0-beta.2</version> </dependency> <dependency> <groupId>com.azure</groupId> <artifactId>azure-identity</artifactId> <version>1.18.1</version> </dependency> </dependencies> <build> <sourceDirectory>.</sourceDirectory> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.11.0</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>exec-maven-plugin</artifactId> <version>3.1.0</version> <configuration> <mainClass>TranscriptionQuickstart</mainClass> </configuration> </plugin> </plugins> </build> </project>Note
The
<sourceDirectory>.</sourceDirectory>configuration tells Maven to look for Java source files in the current directory instead of the defaultsrc/main/javastructure. This configuration change allows for a simpler flat project structure.Install the dependencies:
mvn clean install
Set environment variables
Your application must be authenticated to access the Speech service. The SDK supports both API key and Microsoft Entra ID authentication. It automatically detects which method to use based on the environment variables you set.
First, set the endpoint for your Speech resource. Replace <your-speech-endpoint> with your actual resource name:
Then, choose one of the following authentication methods:
Option 1: API key authentication (recommended for getting started)
Set the API key environment variable:
Option 2: Microsoft Entra ID authentication (recommended for production)
Instead of setting AZURE_SPEECH_API_KEY, configure one of the following credential sources:
- Azure CLI: Run
az loginon your development machine. - Managed Identity: For apps running in Azure (App Service, Azure Functions, VMs).
- Environment Variables: Set
AZURE_TENANT_ID,AZURE_CLIENT_ID, andAZURE_CLIENT_SECRET. - Visual Studio Code or IntelliJ: Sign in through your IDE.
You also need to assign the Cognitive Services User role to your identity:
az role assignment create --assignee <your-identity> \
--role "Cognitive Services User" \
--scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.CognitiveServices/accounts/<speech-resource-name>
Note
After setting environment variables on Windows, restart any running programs that need to read them, including the console window. On Linux or macOS, run source ~/.bashrc (or your equivalent shell configuration file) to make the changes effective.
Create the application
Create a file named TranscriptionQuickstart.java in your project directory with the following code:
import com.azure.ai.speech.transcription.TranscriptionClient;
import com.azure.ai.speech.transcription.TranscriptionClientBuilder;
import com.azure.ai.speech.transcription.models.AudioFileDetails;
import com.azure.ai.speech.transcription.models.TranscriptionOptions;
import com.azure.ai.speech.transcription.models.TranscriptionResult;
import com.azure.core.credential.KeyCredential;
import com.azure.core.util.BinaryData;
import com.azure.identity.DefaultAzureCredentialBuilder;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
public class TranscriptionQuickstart {
public static void main(String[] args) {
try {
// Get credentials from environment variables
String endpoint = System.getenv("AZURE_SPEECH_ENDPOINT");
String apiKey = System.getenv("AZURE_SPEECH_API_KEY");
// Create client with API key or Entra ID authentication
TranscriptionClientBuilder builder = new TranscriptionClientBuilder()
.endpoint(endpoint);
TranscriptionClient client;
if (apiKey != null && !apiKey.isEmpty()) {
// Use API key authentication
client = builder.credential(new KeyCredential(apiKey)).buildClient();
} else {
// Use Entra ID authentication
client = builder.credential(new DefaultAzureCredentialBuilder().build()).buildClient();
}
// Load audio file
String audioFilePath = "<path-to-your-audio-file.wav>";
byte[] audioData = Files.readAllBytes(Paths.get(audioFilePath));
// Create audio file details
AudioFileDetails audioFileDetails = new AudioFileDetails(BinaryData.fromBytes(audioData));
// Transcribe
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails);
TranscriptionResult result = client.transcribe(options);
// Print result
System.out.println("Transcription:");
result.getCombinedPhrases().forEach(phrase ->
System.out.println(phrase.getText())
);
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
}
}
}
Replace <path-to-your-audio-file.wav> with the path to your audio file.
Run the application
Run the application using Maven:
mvn compile exec:java
Clean up resources
When you're done with the quickstart, you can delete the project folder:
rm -rf transcription-quickstart