什么是文本转语音?What is text-to-speech?


现在,将对此服务的所有 HTTP 请求强制执行 TLS 1.2。TLS 1.2 is now enforced for all HTTP requests to this service.

语音服务中的文本转语音可让应用程序、工具或设备将文本转换为类似于人类的合成语音。Text-to-speech from the Speech service enables your applications, tools, or devices to convert text into human-like synthesized speech. 从标准语音和神经语音中进行选择。Choose from standard and neural voices. 40 多种标准语音已在 10 种以上的语言和区域设置中提供,5 种神经语音已在一组精选的语言和区域设置中提供。40+ standard voices are available in more than 10 languages and locales, and 5 neural voices are available in a select number of languages and locales. 有关支持的语音、语言和区域设置的完整列表,请参阅支持的语言For a full list of supported voices, languages, and locales, see supported languages.

核心功能Core features

  • 语音合成 - 使用语音 SDKREST API 通过标准语音和神经语音将文本转换为语音。Speech synthesis - Use the Speech SDK or REST API to convert text-to-speech using standard and neural voices.

  • 长音频的异步合成 - 使用长音频 API 异步合成 10 分钟以上的文本转语音文件(例如有声书籍或讲座)。Asynchronous synthesis of long audio - Use the Long Audio API to asynchronously synthesize text-to-speech files longer than 10 minutes (for example audio books or lectures). 不同于使用语音 SDK 或语音转文本 REST API 执行的合成,响应不会实时返回。Unlike synthesis performed using the Speech SDK or speech-to-text REST API, responses aren't returned in real time. 预期会异步发送请求,以轮询的方式获取响应,并会下载合成音频(在服务提供该音频的情况下)。The expectation is that requests are sent asynchronously, responses are polled for, and that the synthesized audio is downloaded when made available from the service. 仅支持神经语音。Only neural voices are supported.

  • 标准语音 - 使用统计参数合成和/或串联合成技术创建。Standard voices - Created using Statistical Parametric Synthesis and/or Concatenation Synthesis techniques. 这些语音的辨识度很高,且听起来非常自然。These voices are highly intelligible and sound natural. 你可以轻松地让应用程序使用多种语音选项以 10 种以上的语言讲述。You can easily enable your applications to speak in more than 10 languages, with a wide range of voice options. 这些声音提供较高的发音准确度,支持缩写、缩略词扩展、日期/时间解释、多音字等。These voices provide high pronunciation accuracy, including support for abbreviations, acronym expansions, date/time interpretations, polyphones, and more. 有关标准语音的完整列表,请参阅支持的语言For a full list of standard voices, see supported languages.

  • 神经语音 - 深层神经网络用于克服有关口语中的重读和语调的传统语音合成限制。Neural voices - Deep neural networks are used to overcome the limits of traditional speech synthesis with regards to stress and intonation in spoken language. 韵律预测和语音合成以同步方式执行,使输出听起来更流畅且自然。Prosody prediction and voice synthesis are performed simultaneously, which results in more fluid and natural-sounding outputs. 使用神经语音可使得与聊天机器人和语音助手的交流更加自然且富有吸引力、将数字文本(如电子书)转换为有声读物以及增强车载导航系统。Neural voices can be used to make interactions with chatbots and voice assistants more natural and engaging, convert digital texts such as e-books into audiobooks, and enhance in-car navigation systems. 神经语音可以生成类人的自然韵律和清晰的字词发音,当你在与 AI 系统交互时,它可以显著减轻听力疲劳。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when you interact with AI systems. 有关神经语音的完整列表,请参阅支持的语言For a full list of neural voices, see supported languages.

  • 语音合成标记语言 (SSML) - 一种基于 XML 的标记语言,用于自定义语音转文本输出。Speech Synthesis Markup Language (SSML) - An XML-based markup language used to customize speech-to-text outputs. 使用 SSML,你可以调整音调、添加暂停、改进发音、提高或降低语速、增加或减少音量,以及将多个语音赋予单个文档。With SSML, you can adjust pitch, add pauses, improve pronunciation, speed up or slow down speaking rate, increase or decrease volume, and attribute multiple voices to a single document. 请参阅 SSMLSee SSML.

入门Get started

文本转语音服务通过语音 SDK 提供。The text-to-speech service is available via the Speech SDK. 有几种常见方案可作为快速入门,以各种语言和平台提供:There are several common scenarios available as quickstarts, in various languages and platforms:

如果你愿意,可以通过 REST 来访问文本转语音服务。If you prefer, the text-to-speech service is accessible via REST.

代码示例Sample code

GitHub 上提供了文本转语音的示例代码。Sample code for text-to-speech is available on GitHub. 这些示例涵盖了最流行编程语言的文本转语音转换。These samples cover text-to-speech conversion in most popular programming languages.

定价说明Pricing note

使用文本转语音服务时,需按照转换为语音的每个字符(包括标点)付费。When using the text-to-speech service, you are billed for each character that is converted to speech, including punctuation. 尽管 SSML 文档本身不计费,但用于调整文本转语音方式的可选元素(例如音素和音节)将算作计费字符。While the SSML document itself is not billable, optional elements that are used to adjust how the text is converted to speech, like phonemes and pitch, are counted as billable characters. 下面列出了计费的内容:Here's a list of what's billable:

  • 在请求的 SSML 正文中传递给文本转语音服务的文本Text passed to the text-to-speech service in the SSML body of the request
  • 请求正文的文本字段中所有 SSML 格式的标记,<speak><voice> 标记除外All markup within the text field of the request body in the SSML format, except for <speak> and <voice> tags
  • 字母、标点、空格、制表符、标记和所有空白字符Letters, punctuation, spaces, tabs, markup, and all white-space characters
  • Unicode 中定义的每个码位Every code point defined in Unicode

有关详细信息,请参阅定价For detailed information, see Pricing.


每个中文、日语和韩语字符算作两个计费字符。Each Chinese, Japanese, and Korean language character is counted as two characters for billing.

参考文档Reference docs

后续步骤Next steps