What is text-to-speech?


TLS 1.2 is now enforced for all HTTP requests to this service.

In this overview, you learn about the benefits and capabilities of the text-to-speech service, which enables your applications, tools, or devices to convert text into human-like synthesized speech. Use human-like neural voices, or create a custom voice unique to your product or brand. For a full list of supported voices, languages, and locales, see supported languages.

This documentation contains the following article types:

  • Quickstarts are getting-started instructions to guide you through making requests to the service.
  • How-to guides contain instructions for using the service in more specific or customized ways.
  • Concepts provide in-depth explanations of the service functionality and features.
  • Tutorials are longer guides that show you how to use the service as a component in broader business solutions.

Core features

  • Speech synthesis - Use the Speech SDK or REST API to convert text-to-speech using standard, neural, or custom voices.
  • Neural voices - Deep neural networks are used to overcome the limits of traditional speech synthesis with regard to stress and intonation in spoken language. Prosody prediction and voice synthesis are performed simultaneously, which results in more fluid and natural-sounding outputs. Neural voices can be used to make interactions with chatbots and voice assistants more natural and engaging, convert digital texts such as e-books into audiobooks, and enhance in-car navigation systems. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when you interact with AI systems. For a full list of neural voices, see supported languages.

  • Adjust speaking styles with SSML - Speech Synthesis Markup Language (SSML) is an XML-based markup language used to customize speech-to-text outputs. With SSML, you can adjust pitch, add pauses, improve pronunciation, change speaking rate, adjust volume, and attribute multiple voices to a single document. With the multi-lingual voices, you can also adjust the speaking languages via SSML.See the how-to for adjusting speaking styles.

  • Visemes - Visemes are the key poses in observed speech, including the position of the lips, jaw and tongue when producing a particular phoneme. Visemes have a strong correlation with voices and phonemes. Using viseme events in Speech SDK, you can generate facial animation data, which can be used to animate faces in lip-reading communication, education, entertainment, and customer service. Viseme is currently only supported for the en-US English (United States) neural voices.

Get started

See the quickstart to get started with text-to-speech. The text-to-speech service is available via the Speech SDK, the REST API, and the Speech CLI

Sample code

Sample code for text-to-speech is available on GitHub. These samples cover text-to-speech conversion in most popular programming languages.


In addition to neural voices, you can create and fine-tune custom voices unique to your product or brand. All it takes to get started are a handful of audio files and the associated transcriptions.

Pricing note

When using the text-to-speech service, you are billed for each character that is converted to speech, including punctuation. While the SSML document itself is not billable, optional elements that are used to adjust how the text is converted to speech, like phonemes and pitch, are counted as billable characters. Here's a list of what's billable:

  • Text passed to the text-to-speech service in the SSML body of the request
  • All markup within the text field of the request body in the SSML format, except for <speak> and <voice> tags
  • Letters, punctuation, spaces, tabs, markup, and all white-space characters
  • Every code point defined in Unicode

For detailed information, see Pricing.


Each Chinese, Japanese, and Korean language character is counted as two characters for billing.

Reference docs

Next steps