Game development with Azure Speech

Azure Speech enhances various gaming scenarios, both in-game and out-of-game.

Speech features to consider for flexible and interactive game experiences include:

Synthesize audio from text or display text from audio to make conversations accessible to all players.
Improve accessibility for players who can't read text in a particular language, including young players who don't yet read or write. Players can listen to storylines and instructions in their preferred language.
Create game avatars and nonplayable characters (NPCs) that can initiate or participate in conversations during gameplay.
Use standard voices to provide highly natural out-of-the-box voices across a large portfolio of languages and voices.
Prototype game dialogue to reduce the time and cost of production and get the game to market sooner. You can rapidly swap lines of dialogue and listen to variations in real time to iterate on game content.

You can use the Speech SDK or Speech CLI for real-time, low-latency speech to text, text to speech, language identification, and speech translation. You can also use the Batch transcription API to transcribe prerecorded speech to text.

For information about locale and regional availability, see Language and voice support and Region support.

Text to speech

Convert text messages to audio using text to speech for scenarios such as game dialogue prototyping, greater accessibility, or nonplayable character (NPC) voices. Text to speech includes standard voice features. Standard voice provides highly natural out-of-the-box voices across a large portfolio of languages and voices.

Consider the following capabilities when you enable text to speech in your game:

Voices and languages - A large portfolio of locales and voices is supported. You can also specify multiple languages for text to speech output.
Emotional styles - Emotional tones, such as cheerful, angry, sad, excited, hopeful, friendly, unfriendly, terrified, shouting, and whispering. You can adjust the speaking style, style degree, and role at the sentence level.
Visemes - You can use visemes during real-time synthesis to control the movement of 2D and 3D avatar models so that mouth movements match synthetic speech precisely. For more information, see Get facial position with viseme.
SSML fine-tuning - With Speech Synthesis Markup Language (SSML), you can customize text to speech output with richer voice tuning options. For more information, see Speech Synthesis Markup Language (SSML) overview.
Audio outputs - Each standard voice model is available at 24 kHz and high-fidelity 48 kHz. If you select a 48-kHz output format, the high-fidelity voice model at 48 kHz is invoked accordingly. Other sample rates can be obtained through upsampling or downsampling during synthesis. For example, 44.1 kHz is downsampled from 48 kHz. Each audio format incorporates a bitrate and encoding type. For more information, see the supported audio formats. For more information on 48-kHz high-quality voices, see this blog post.

For an example, see the text to speech quickstart.

Speech to text

You can use speech to text to display text from the spoken audio in your game. For an example, see the Speech to text quickstart.

Language identification

With language identification, you can detect the language of the chat string submitted by the player.

Speech translation

Players in the same game session often speak different languages and might appreciate receiving both the original message and its translation. You can use speech translation to translate text between languages so players across the world can communicate in their native language.

For an example, see the Speech translation quickstart.

Note

In addition to the Speech service, you can also use the Translator service. To perform real-time text translation between supported source and target languages, see Text translation.

Next steps

Last updated on 2026-05-25