什么是文本转语音?What is text-to-speech?

重要

现在,将对此服务的所有 HTTP 请求强制执行 TLS 1.2。TLS 1.2 is now enforced for all HTTP requests to this service.

在本概述中,你将了解文本转语音服务的优势和功能,该服务使你的应用程序、工具或设备可以将文本转换为类似于人的合成语音。In this overview, you learn about the benefits and capabilities of the text-to-speech service, which enables your applications, tools, or devices to convert text into human-like synthesized speech. 从标准和神经语音中选择,或创建产品或品牌特有的自定义语音。Choose from standard and neural voices, or create a custom voice unique to your product or brand. 75 种以上的标准语音已在 45 种以上的语言和区域设置中提供,5 种神经语音已在一组精选的语言和区域设置中提供。75+ standard voices are available in more than 45 languages and locales, and 5 neural voices are available in a select number of languages and locales. 有关支持的语音、语言和区域设置的完整列表,请参阅支持的语言For a full list of supported voices, languages, and locales, see supported languages.

本文档包含以下文章类型:This documentation contains the following article types:

  • 快速入门 介绍了入门说明,指导您完成向服务发出请求。Quickstarts are getting-started instructions to guide you through making requests to the service.
  • 操作指南 包含以更具体的方式或自定义方式使用服务的说明。How-to guides contain instructions for using the service in more specific or customized ways.
  • 概念 对服务的功能和特性进行了深入说明。Concepts provide in-depth explanations of the service functionality and features.
  • 教程 是较长的指南,向您演示了如何在更广泛的业务解决方案中使用该服务作为组件。Tutorials are longer guides that show you how to use the service as a component in broader business solutions.

核心功能Core features

  • 语音合成 - 使用语音 SDKREST API 通过标准语音、神经语音或自定义语音将文本转换为语音。Speech synthesis - Use the Speech SDK or REST API to convert text-to-speech using standard, neural, or custom voices.
  • 标准语音 - 使用统计参数合成和/或串联合成技术创建。Standard voices - Created using Statistical Parametric Synthesis and/or Concatenation Synthesis techniques. 这些语音的辨识度很高,且听起来非常自然。These voices are highly intelligible and sound natural. 你可以轻松地让应用程序使用多种语音选项以 45 种以上的语言讲述。You can easily enable your applications to speak in more than 45 languages, with a wide range of voice options. 这些声音提供较高的发音准确度,支持缩写、缩略词扩展、日期/时间解释、多音字等。These voices provide high pronunciation accuracy, including support for abbreviations, acronym expansions, date/time interpretations, polyphones, and more. 有关标准语音的完整列表,请参阅支持的语言For a full list of standard voices, see supported languages.

  • 神经语音 - 深度神经网络用于克服有关口语中的重读和语调的传统语音合成限制。Neural voices - Deep neural networks are used to overcome the limits of traditional speech synthesis with regard to stress and intonation in spoken language. 韵律预测和语音合成以同步方式执行,使输出听起来更流畅且自然。Prosody prediction and voice synthesis are performed simultaneously, which results in more fluid and natural-sounding outputs. 使用神经语音可使得与聊天机器人和语音助手的交流更加自然且富有吸引力、将数字文本(如电子书)转换为有声读物以及增强车载导航系统。Neural voices can be used to make interactions with chatbots and voice assistants more natural and engaging, convert digital texts such as e-books into audiobooks, and enhance in-car navigation systems. 神经语音可以生成类人的自然韵律和清晰的字词发音,当你在与 AI 系统交互时,它可以显著减轻听力疲劳。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when you interact with AI systems. 有关神经语音的完整列表,请参阅支持的语言For a full list of neural voices, see supported languages.

  • 使用 SSML 调整说话风格 - 语音合成标记语言 (SSML) 是一种基于 XML 的标记语言,用于自定义语音转文本输出。Adjust speaking styles with SSML - Speech Synthesis Markup Language (SSML) is an XML-based markup language used to customize speech-to-text outputs. 使用 SSML,你可以调整音调、添加暂停、改进发音、提高或降低语速、增加或减少音量,以及将多个语音赋予单个文档。With SSML, you can adjust pitch, add pauses, improve pronunciation, speed up or slow down speaking rate, increase or decrease volume, and attribute multiple voices to a single document. 请参阅操作说明调整说话风格。See the how-to for adjusting speaking styles.

  • 视素 - 视素是观察到的语音中的关键姿态,包括在产生特定音素时嘴唇、下巴和舌头的位置。Visemes - Visemes are the key poses in observed speech, including the position of the lips, jaw and tongue when producing a particular phoneme. 视素与语音和音素有很强的关联性。Visemes have a strong correlation with voices and phonemes. 使用语音 SDK 中的视素事件,可以生成面部动画数据,用于制作唇读交流、教育、娱乐、客户服务等方面的面部动画。Using viseme events in Speech SDK, you can generate facial animation data, which can be used to animate faces in lip-reading communication, education, entertainment, and customer service.

备注

目前,视素仅适用于 en-US-AriaNeural 语音。Viseme only works for en-US-AriaNeural voice for now.

入门Get started

请参阅快速入门以开始使用文本转语音。See the quickstart to get started with text-to-speech. 文本转语音服务通过语音 SDKREST API语音 CLI 提供The text-to-speech service is available via the Speech SDK, the REST API, and the Speech CLI

代码示例Sample code

GitHub 上提供了文本转语音的示例代码。Sample code for text-to-speech is available on GitHub. 这些示例涵盖了最流行编程语言的文本转语音转换。These samples cover text-to-speech conversion in most popular programming languages.

自定义Customization

除了标准语音和神经语音外,还可以创建和微调产品或品牌独有的自定义语音。In addition to standard and neural voices, you can create and fine-tune custom voices unique to your product or brand. 只需准备好几个音频文件和关联的听录内容即可完全入门。All it takes to get started are a handful of audio files and the associated transcriptions.

定价说明Pricing note

使用文本转语音服务时,需按照转换为语音的每个字符(包括标点)付费。When using the text-to-speech service, you are billed for each character that is converted to speech, including punctuation. 尽管 SSML 文档本身不计费,但用于调整文本转语音方式的可选元素(例如音素和音节)将算作计费字符。While the SSML document itself is not billable, optional elements that are used to adjust how the text is converted to speech, like phonemes and pitch, are counted as billable characters. 下面列出了计费的内容:Here's a list of what's billable:

  • 在请求的 SSML 正文中传递给文本转语音服务的文本Text passed to the text-to-speech service in the SSML body of the request
  • 请求正文的文本字段中所有 SSML 格式的标记,<speak><voice> 标记除外All markup within the text field of the request body in the SSML format, except for <speak> and <voice> tags
  • 字母、标点、空格、制表符、标记和所有空白字符Letters, punctuation, spaces, tabs, markup, and all white-space characters
  • Unicode 中定义的每个码位Every code point defined in Unicode

有关详细信息,请参阅定价For detailed information, see Pricing.

重要

每个中文、日语和韩语字符算作两个计费字符。Each Chinese, Japanese, and Korean language character is counted as two characters for billing.

参考文档Reference docs

后续步骤Next steps