Text to speech FAQ

This article answers commonly asked questions about the text to speech (TTS) capability. If you can't find answers to your questions here, check out other support options.

How does the billing work for text to speech?

Text to speech usage is billed per character. Check the definition of billable characters in the pricing note.

What is the rate limit for the text to speech synthesis requests?

The text to speech synthesis rate scales automatically as it receives more requests. A default rate limit is set per speech resource. The rate is adjustable with business justifications and no extra charges are incurred for rate limit increase. Check more details in Speech service quotas and limits.

How can I reduce the latency for my voice app?

We provide several tips for you to lower the latency and bring the best performance to your users. See Lower speech synthesis latency using Speech SDK.

What output audio formats does text to speech support?

Azure AI text to speech supports various streaming and non-streaming audio formats, with the commonly used sampling rates. All TTS prebuilt neural voices are created to support high-fidelity audio outputs with 48 kHz and 24 kHz. The audio can be resampled to support other rates as needed. See Audio outputs.

Can the voice be customized to stress specific words?

Adjusting the emphasis is supported for some voices depending on the locale. See the emphasis tag.

Can we have multiple strength for each emotion, like sad, slightly sad, and so on, in?

Adjusting the style degree is supported for some voices depending on the locale. See the mstts:express-as tag.

Is there a mapping between Viseme IDs and mouth shape?