Text to speech REST API

The Speech service allows you to convert text into synthesized speech and get a list of supported voices for a region by using a REST API. In this article, you learn about authorization options, query options, how to structure a request, and how to interpret a response.

Tip

Use cases for the text to speech REST API are limited. Use it only in cases where you can't use the Speech SDK. For example, with the Speech SDK you can subscribe to events for more insights about the text to speech processing and results.

The text to speech REST API supports neural text to speech voices in many locales. Each available endpoint is associated with a region. An API key for the endpoint or region that you plan to use is required. Here are links to more information:

For a complete list of voices, see Language and voice support for the Speech service.
For information about regional availability, see Speech service supported regions.
For Microsoft Azure operated by 21Vianet endpoints, see this article about sovereign clouds.

Important

Costs vary for standard voices. For more information, see text to speech pricing.

Prerequisites

To use the text to speech REST API, you need:

An Azure account. Create one for trial.
A Speech resource in the Azure portal.
The resource key and endpoint from your Speech resource's Keys and Endpoint page.

Authentication

Each request requires an authorization header. This table illustrates which headers are supported for each feature:

Supported authorization header	Speech to text	Text to speech
`Ocp-Apim-Subscription-Key`	Yes	Yes
`Authorization: Bearer`	Yes	Yes

When you're using the Ocp-Apim-Subscription-Key header, only your resource key must be provided. For example:

'Ocp-Apim-Subscription-Key': 'YourSpeechResourceKey'

If you use the STS bearer-token flow with Authorization: Bearer, first make a request to the issueToken endpoint. In this request, you exchange your resource key for an access token that's valid for 10 minutes.

Another option is to use Microsoft Entra authentication that also uses the Authorization: Bearer header, but with a token issued via Microsoft Entra ID. See Use Microsoft Entra authentication.

How to get an STS access token

To get an STS access token, make a request to the issueToken endpoint by using Ocp-Apim-Subscription-Key and your resource key.

The issueToken endpoint has this format:

https://YourResourceName.cognitiveservices.azure.cn/sts/v1.0/issueToken

Replace YourResourceName with the name of your Speech resource.

Note

This endpoint requires your resource to have a custom subdomain configured. For resources without a custom domain, use the regional endpoint instead: https://<region>.api.cognitive.azure.cn/sts/v1.0/issueToken. Replace <region> with your resource's Azure region (for example, chinanorth2).

Use the following samples to create your access token request.

HTTP sample

This example is a simple HTTP request to get a token. Replace YourSpeechResourceKey with your resource key for the Speech service. Replace YourResourceName with the name of your Speech resource.

POST /sts/v1.0/issueToken HTTP/1.1
Ocp-Apim-Subscription-Key: YourSpeechResourceKey
Host: YourResourceName.cognitiveservices.azure.cn
Content-type: application/x-www-form-urlencoded
Content-Length: 0

The body of the response contains the access token in JSON Web Token (JWT) format.

PowerShell sample

This example is a simple PowerShell script to get an access token. Replace YourSpeechResourceKey with your resource key for the Speech service. Replace YourResourceName with the name of your Speech resource.

$FetchTokenHeader = @{
  'Content-type'='application/x-www-form-urlencoded';
  'Content-Length'= '0';
  'Ocp-Apim-Subscription-Key' = 'YourSpeechResourceKey'
}

$OAuthToken = Invoke-RestMethod -Method POST `
    -Uri https://YourResourceName.cognitiveservices.azure.cn/sts/v1.0/issueToken `
    -Headers $FetchTokenHeader

# show the token received
$OAuthToken

cURL sample

cURL is a command-line tool available in Linux (and in the Windows Subsystem for Linux). This cURL command illustrates how to get an access token. Replace YourSpeechResourceKey with your resource key for the Speech service. Replace YourResourceName with the name of your Speech resource.

curl -v -X POST \
 "https://YourResourceName.cognitiveservices.azure.cn/sts/v1.0/issueToken" \
 -H "Content-type: application/x-www-form-urlencoded" \
 -H "Content-Length: 0" \
 -H "Ocp-Apim-Subscription-Key: YourSpeechResourceKey"

C# sample

This C# class illustrates how to get an access token. Pass your resource key for the Speech service when you instantiate the class. Replace YourResourceName with the name of your Speech resource.

public class Authentication
{
    public static readonly string FetchTokenUri =
        "https://YourResourceName.cognitiveservices.azure.cn/sts/v1.0/issueToken";
    private string subscriptionKey;
    private string token;

    public Authentication(string subscriptionKey)
    {
        this.subscriptionKey = subscriptionKey;
        this.token = FetchTokenAsync(FetchTokenUri, subscriptionKey).Result;
    }

    public string GetAccessToken()
    {
        return this.token;
    }

    private async Task<string> FetchTokenAsync(string fetchUri, string subscriptionKey)
    {
        using (var client = new HttpClient())
        {
            client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);
            UriBuilder uriBuilder = new UriBuilder(fetchUri);

            var result = await client.PostAsync(uriBuilder.Uri.AbsoluteUri, null);
            Console.WriteLine("Token Uri: {0}", uriBuilder.Uri.AbsoluteUri);
            return await result.Content.ReadAsStringAsync();
        }
    }
}

Python sample

# Request module must be installed.
# Run pip install requests if necessary.
import requests

subscription_key = 'REPLACE_WITH_YOUR_KEY'


def get_token(subscription_key):
    fetch_token_url = 'https://YourResourceName.cognitiveservices.azure.cn/sts/v1.0/issueToken'
    headers = {
        'Ocp-Apim-Subscription-Key': subscription_key
    }
    response = requests.post(fetch_token_url, headers=headers)
    access_token = str(response.text)
    print(access_token)

How to use an access token

The access token should be sent to the service as the Authorization: Bearer <TOKEN> header. Each access token is valid for 10 minutes. You can get a new token at any time, but to minimize network traffic and latency, we recommend using the same token for nine minutes.

Important

Bearer tokens are scoped to the endpoint that issued them. A token obtained from YourResourceName.cognitiveservices.azure.cn works only for requests to that same host. A token from <region>.api.cognitive.azure.cn works only against regional Speech endpoints. If you receive a 401 error when using a Bearer token, use Ocp-Apim-Subscription-Key with your resource key instead, which works with all endpoint formats.

Here's a sample HTTP request to the Speech to text REST API for short audio:

POST /cognitiveservices/v1 HTTP/1.1
Authorization: Bearer YOUR_ACCESS_TOKEN
Host: YourResourceName.cognitiveservices.azure.cn
Content-type: application/ssml+xml
Content-Length: 199
Connection: Keep-Alive

// Message body here...

Use Microsoft Entra authentication

To use Microsoft Entra authentication with the Speech to text REST API for short audio, you need to create an access token. The steps to obtain the access token consisting of Resource ID and Microsoft Entra access token are the same as when using the Speech SDK. Follow the steps here Use Microsoft Entra authentication

Create an AI Services resource for Speech
Configure the Speech resource for Microsoft Entra authentication
Get a Microsoft Entra access token
Get the Speech resource ID

After the resource ID and the Microsoft Entra access token were obtained, the actual access token can be constructed following this format:

aad#YOUR_RESOURCE_ID#YOUR_MICROSOFT_ENTRA_ACCESS_TOKEN

You need to include the "aad#" prefix and the "#" (hash) separator between resource ID and the access token.

Here's a sample HTTP request to the Speech to text REST API for short audio:

POST /cognitiveservices/v1 HTTP/1.1
Authorization: Bearer YOUR_ACCESS_TOKEN
Host: YourResourceName.cognitiveservices.azure.cn
Content-type: application/ssml+xml
Content-Length: 199
Connection: Keep-Alive

// Message body here...

To learn more about Microsoft Entra access tokens, including token lifetime, visit Access tokens in the Microsoft identity platform.

Get a list of voices

You can use your Speech resource endpoint to get a full list of voices. Use the /tts/cognitiveservices/voices/list path with your resource endpoint. For example, use the https://YourResourceName.cognitiveservices.azure.cn/tts/cognitiveservices/voices/list endpoint. For a list of all supported regions, see the regions documentation.

Note

Voices and styles in preview are only available in a subset of regions. For the current list of regions that support voices and styles in public preview, see the Speech service regions table.

Request headers

This table lists required and optional headers for text to speech requests:

Header	Description	Required or optional
`Ocp-Apim-Subscription-Key`	Your Speech resource key.	Either this header or `Authorization` is required.
`Authorization`	An authorization token preceded by the word `Bearer`. For more information, see Authentication.	Either this header or `Ocp-Apim-Subscription-Key` is required.

Request body

A body isn't required for GET requests to this endpoint.

Sample request

This request requires only an authorization header:

GET /tts/cognitiveservices/voices/list HTTP/1.1

Host: YourResourceName.cognitiveservices.azure.cn
Ocp-Apim-Subscription-Key: YOUR_RESOURCE_KEY

Here's an example curl command:

curl --location --request GET 'https://YourResourceName.cognitiveservices.azure.cn/tts/cognitiveservices/voices/list' \
--header 'Ocp-Apim-Subscription-Key: YOUR_RESOURCE_KEY'

Sample response

You should receive a response with a JSON body that includes all supported locales, voices, gender, styles, and other details. The WordsPerMinute property for each voice can be used to estimate the length of the output speech. This JSON example shows partial results to illustrate the structure of a response:

[
    // Redacted for brevity
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)",
        "DisplayName": "Jenny",
        "LocalName": "Jenny",
        "ShortName": "en-US-JennyNeural",
        "Gender": "Female",
        "Locale": "en-US",
        "LocaleName": "English (United States)",
        "StyleList": [
          "assistant",
          "chat",
          "customerservice",
          "newscast",
          "angry",
          "cheerful",
          "sad",
          "excited",
          "friendly",
          "terrified",
          "shouting",
          "unfriendly",
          "whispering",
          "hopeful"
        ],
        "SampleRateHertz": "48000",
        "VoiceType": "Neural",
        "Status": "GA",
        "WordsPerMinute": "152"
    },
    // Redacted for brevity
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (en-US, JennyMultilingualNeural)",
        "DisplayName": "Jenny Multilingual",
        "LocalName": "Jenny Multilingual",
        "ShortName": "en-US-JennyMultilingualNeural",
        "Gender": "Female",
        "Locale": "en-US",
        "LocaleName": "English (United States)",
        "SecondaryLocaleList": [
          "de-DE",
          "en-AU",
          "en-CA",
          "en-GB",
          "es-ES",
          "es-MX",
          "fr-CA",
          "fr-FR",
          "it-IT",
          "ja-JP",
          "ko-KR",
          "pt-BR",
          "zh-cn"
        ],
        "SampleRateHertz": "48000",
        "VoiceType": "Neural",
        "Status": "GA",
        "WordsPerMinute": "190"
    },
    // Redacted for brevity
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (ga-IE, OrlaNeural)",
        "DisplayName": "Orla",
        "LocalName": "Orla",
        "ShortName": "ga-IE-OrlaNeural",
        "Gender": "Female",
        "Locale": "ga-IE",
        "LocaleName": "Irish (Ireland)",
        "SampleRateHertz": "48000",
        "VoiceType": "Neural",
        "Status": "GA",
        "WordsPerMinute": "139"
    },
    // Redacted for brevity
    {
        "Name": "Microsoft Server Speech Text to Speech Voice (zh-cn, YunxiNeural)",
        "DisplayName": "Yunxi",
        "LocalName": "云希",
        "ShortName": "zh-cn-YunxiNeural",
        "Gender": "Male",
        "Locale": "zh-cn",
        "LocaleName": "Chinese (Mandarin, Simplified)",
        "StyleList": [
          "narration-relaxed",
          "embarrassed",
          "fearful",
          "cheerful",
          "disgruntled",
          "serious",
          "angry",
          "sad",
          "depressed",
          "chat",
          "assistant",
          "newscast"
        ],
        "SampleRateHertz": "48000",
        "VoiceType": "Neural",
        "Status": "GA",
        "RolePlayList": [
          "Narrator",
          "YoungAdultMale",
          "Boy"
        ],
        "WordsPerMinute": "293"
    },
    // Redacted for brevity
]

HTTP status codes

The HTTP status code for each response indicates success or common errors.

HTTP status code	Description	Possible reason
200	OK	The request was successful.
400	Bad request	A required parameter is missing, empty, or null. Or, the value passed to either a required or optional parameter is invalid. A common reason is a header that's too long.
401	Unauthorized	The request isn't authorized. Make sure your resource key or token is valid and in the correct region.
429	Too many requests	You exceeded the quota or rate of requests allowed for your resource.
502	Bad gateway	There's a network or server-side problem. This status might also indicate invalid headers.

Convert text to speech

The cognitiveservices/v1 endpoint allows you to convert text to speech by using Speech Synthesis Markup Language (SSML).

Regions and endpoints

These regions are supported for text to speech through the REST API. Be sure to select the endpoint that matches your Speech resource region.

Standard voices

Use this table to determine availability of neural voices by region or endpoint:

Region	Endpoint
China East 2	`https://chinaeast2.tts.speech.azure.cn/cognitiveservices/v1`
China North 2	`https://chinanorth2.tts.speech.azure.cn/cognitiveservices/v1`
China North 3	`https://chinanorth3.tts.speech.azure.cn/cognitiveservices/v1`

Request headers

This table lists required and optional headers for text to speech requests:

Header	Description	Required or optional
`Authorization`	An authorization token preceded by the word `Bearer`. For more information, see Authentication.	Required
`Content-Type`	Specifies the content type for the provided text. Accepted value: `application/ssml+xml`.	Required
`X-Microsoft-OutputFormat`	Specifies the audio output format. For a complete list of accepted values, see Audio outputs.	Required
`User-Agent`	The application name. The provided value must be fewer than 255 characters.	Required

Request body

The body of each POST request is sent as SSML. SSML allows you to choose the voice and language of the synthesized speech that the text-to-speech feature returns. For a complete list of supported voices, see Language and voice support for the Speech service.

Sample request

This HTTP request uses SSML to specify the voice and language. If the body length is long, and the resulting audio exceeds 10 minutes, it's truncated to 10 minutes. In other words, the audio length can't exceed 10 minutes.

POST /cognitiveservices/v1 HTTP/1.1

X-Microsoft-OutputFormat: riff-24khz-16bit-mono-pcm
Content-Type: application/ssml+xml
Host: YourResourceName.cognitiveservices.azure.cn
Content-Length: <Length>
Authorization: Bearer [Base64 access_token]
User-Agent: <Your application name>

<speak version='1.0' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Male'
    name='en-US-ChristopherNeural'>
        I'm excited to try text to speech!
</voice></speak>

^* For the Content-Length, you should use your own content length. In most cases, this value is calculated automatically.

HTTP status codes

The HTTP status code for each response indicates success or common errors:

HTTP status code	Description	Possible reason
200	OK	The request was successful. The response body is an audio file.
400	Bad request	A required parameter is missing, empty, or null. Or, the value passed to either a required or optional parameter is invalid. A common reason is a header that's too long.
401	Unauthorized	The request isn't authorized. Make sure your Speech resource key or token is valid and in the correct region.
415	Unsupported media type	It's possible that the wrong `Content-Type` value was provided. `Content-Type` should be set to `application/ssml+xml`.
429	Too many requests	You exceeded the quota or rate of requests allowed for your resource.
502	Bad gateway	There's a network or server-side problem. This status might also indicate invalid headers.
503	Service Unavailable	There's a server-side problem for various reasons.

If the HTTP status is 200 OK, the body of the response contains an audio file in the requested format. This file can be played as it's transferred, saved to a buffer, or saved to a file.

Audio outputs

The supported streaming and nonstreaming audio formats are sent in each request as the X-Microsoft-OutputFormat header. Each format incorporates a bit rate and encoding type. The Speech service supports 48-kHz, 24-kHz, 16-kHz, and 8-kHz audio outputs. Each standard voice model is available at 24kHz and high-fidelity 48kHz.

Streaming
NonStreaming

amr-wb-16000hz
audio-16khz-16bit-32kbps-mono-opus
audio-16khz-32kbitrate-mono-mp3
audio-16khz-64kbitrate-mono-mp3
audio-16khz-128kbitrate-mono-mp3
audio-24khz-16bit-24kbps-mono-opus
audio-24khz-16bit-48kbps-mono-opus
audio-24khz-48kbitrate-mono-mp3
audio-24khz-96kbitrate-mono-mp3
audio-24khz-160kbitrate-mono-mp3
audio-48khz-96kbitrate-mono-mp3
audio-48khz-192kbitrate-mono-mp3
g722-16khz-64kbps
ogg-16khz-16bit-mono-opus
ogg-24khz-16bit-mono-opus
ogg-48khz-16bit-mono-opus
raw-8khz-8bit-mono-alaw
raw-8khz-8bit-mono-mulaw
raw-8khz-16bit-mono-pcm
raw-16khz-16bit-mono-pcm
raw-16khz-16bit-mono-truesilk
raw-22050hz-16bit-mono-pcm
raw-24khz-16bit-mono-pcm
raw-24khz-16bit-mono-truesilk
raw-44100hz-16bit-mono-pcm
raw-48khz-16bit-mono-pcm
webm-16khz-16bit-mono-opus
webm-24khz-16bit-24kbps-mono-opus
webm-24khz-16bit-mono-opus

riff-8khz-8bit-mono-alaw
riff-8khz-8bit-mono-mulaw
riff-8khz-16bit-mono-pcm
riff-22050hz-16bit-mono-pcm
riff-24khz-16bit-mono-pcm
riff-44100hz-16bit-mono-pcm
riff-48khz-16bit-mono-pcm

Note

If you select 48kHz output format, the high-fidelity voice model with 48kHz will be invoked accordingly. The sample rates other than 24kHz and 48kHz can be obtained through upsampling or downsampling when synthesizing, for example, 44.1kHz is downsampled from 48kHz.

If your selected voice and output format have different bit rates, the audio is resampled as necessary. You can decode the ogg-24khz-16bit-mono-opus format by using the Opus codec.

Next steps

Create a Azure account

Last updated on 2026-06-09

Text to speech REST API

Prerequisites

Authentication

How to get an STS access token

HTTP sample

PowerShell sample

cURL sample

C# sample

Python sample

How to use an access token

Use Microsoft Entra authentication

Get a list of voices

Request headers

Request body

Sample request

Sample response

HTTP status codes

Convert text to speech

Regions and endpoints

Standard voices

Request headers

Request body

Sample request

HTTP status codes

Audio outputs

Next steps

Additional resources