通过语音合成标记语言 (SSML) 改善合成Improve synthesis with Speech Synthesis Markup Language (SSML)

语音合成标记语言 (SSML) 是一种基于 XML 的标记语言,可让开发人员指定如何使用文本转语音服务将输入文本转换为合成语音。Speech Synthesis Markup Language (SSML) is an XML-based markup language that lets developers specify how input text is converted into synthesized speech using the text-to-speech service. 与纯文本相比,SSML 可让开发人员微调音节、发音、语速、音量以及文本转语音输出的其他属性。Compared to plain text, SSML allows developers to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output. SSML 可自动处理正常的停顿(例如,在句号后面暂停片刻),或者在以问号结尾的句子中使用正确的音调。Normal punctuation, such as pausing after a period, or using the correct intonation when a sentence ends with a question mark are automatically handled.

SSML 的语音服务实现基于万维网联合会的语音合成标记语言版本 1.0The Speech service implementation of SSML is based on World Wide Web Consortium's Speech Synthesis Markup Language Version 1.0.

重要

中文、日语和韩语字符按两个字符计费。Chinese, Japanese, and Korean characters count as two characters for billing. 有关详细信息,请参阅定价For more information, see Pricing.

标准、神经和自定义语音Standard, neural, and custom voices

从标准和神经语音中选择,或创建自己产品或品牌特有的自定义语音。Choose from standard and neural voices, or create your own custom voice unique to your product or brand. 40 多种标准语音可在 10 种以上的语言和区域设置中使用,5 种神经语音可在 4 种语言和区域设置中使用。40+ standard voices are available in more than 10 languages and locales, and 5 neural voices are available in four languages and locales . 有关支持的语言、区域设置和语音(神经和标准)的完整列表,请参阅语言支持For a complete list of supported languages, locales, and voices (neural and standard), see language support.

若要详细了解标准、神经和自定义语音,请参阅文本转语音概述To learn more about standard, neural, and custom voices, see Text-to-speech overview.

特殊字符Special characters

使用 SSML 时请注意,特殊字符(例如引号、撇号和括号)必须经过转义。While using SSML, keep in mind that special characters, such as quotation marks, apostrophes, and brackets must be escaped. 有关详细信息,请参阅可扩展标记语言 (XML) 1.0:附录 DFor more information, see Extensible Markup Language (XML) 1.0: Appendix D.

支持的 SSML 元素Supported SSML elements

每个 SSML 文档是使用 SSML 元素(或标记)创建的。Each SSML document is created with SSML elements (or tags). 这些元素用于调整音节、韵律、音量等。These elements are used to adjust pitch, prosody, volume, and more. 以下部分详细说明了每个元素的用法,以及该元素是必需的还是可选的。The following sections detail how each element is used, and when an element is required or optional.

重要

不要忘记将属性值括在双引号中。Don't forget to use double quotes around attribute values. 适当格式的有效 XML 的标准要求将属性值括在双引号中。Standards for well-formed, valid XML requires attribute values to be enclosed in double quotation marks. 例如,<prosody volume="90"> 是适当格式的有效元素,而 <prosody volume=90> 则不是。For example, <prosody volume="90"> is a well-formed, valid element, but <prosody volume=90> is not. SSML 无法识别未括在引号中的属性值。SSML may not recognize attribute values that are not in quotes.

创建 SSML 文档Create an SSML document

speak 是根元素,对于所有 SSML 文档都是必需的speak is the root element, and is required for all SSML documents. speak 元素包含重要信息,例如版本、语言和标记词汇定义。The speak element contains important information, such as version, language, and the markup vocabulary definition.

语法Syntax

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="string"></speak>

属性Attributes

属性Attribute 描述Description 必需/可选Required / Optional
version 指示用于解释文档标记的 SSML 规范的版本。Indicates the version of the SSML specification used to interpret the document markup. 当前版本为 1.0。The current version is 1.0. 必须Required
xml:lang 指定根文档的语言。Specifies the language of the root document. 该值可以包含小写的双字母语言代码(例如 en),或者语言代码加上大写的国家/地区代码(例如 en-US)。The value may contain a lowercase, two-letter language code (for example, en), or the language code and uppercase country/region (for example, en-US). 必须Required
xmlns 指定文档的 URI,用于定义 SSML 文档的标记词汇(元素类型和属性名称)。Specifies the URI to the document that defines the markup vocabulary (the element types and attribute names) of the SSML document. 当前 URI 为 http://www.w3.org/2001/10/synthesisThe current URI is http://www.w3.org/2001/10/synthesis. 必须Required

选择文本转语音所用的语音Choose a voice for text-to-speech

voice 元素是必需的。The voice element is required. 它用于指定文本转语音所用的语音。It is used to specify the voice that is used for text-to-speech.

语法Syntax

<voice name="string">
    This text will get converted into synthesized speech.
</voice>

属性Attributes

属性Attribute 描述Description 必需/可选Required / Optional
name 标识用于文本转语音输出的语音。Identifies the voice used for text-to-speech output. 有关支持的语音的完整列表,请参阅语言支持For a complete list of supported voices, see Language support. 必须Required

示例Example

备注

本示例使用 en-US-AriaRUS 语音。This example uses the en-US-AriaRUS voice. 有关支持的语音的完整列表,请参阅语言支持For a complete list of supported voices, see Language support.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        This is the text that is spoken.
    </voice>
</speak>

使用多个语音Use multiple voices

speak 元素中,可为文本转语音输出指定多种语音。Within the speak element, you can specify multiple voices for text-to-speech output. 这些语音可以采用不同的语言。These voices can be in different languages. 对于每种语音,必须将文本包装在 voice 元素中。For each voice, the text must be wrapped in a voice element.

属性Attributes

属性Attribute 描述Description 必需/可选Required / Optional
name 标识用于文本转语音输出的语音。Identifies the voice used for text-to-speech output. 有关支持的语音的完整列表,请参阅语言支持For a complete list of supported voices, see Language support. 必须Required

重要

多个语音与字边界功能不兼容。Multiple voices are incompatible with the word boundary feature. 需要禁用字边界功能才能使用多个语音。The word boundary feature needs to be disabled in order to use multiple voices.

禁用字边界Disable word boundary

根据语音 SDK 语言,需将 "SpeechServiceResponse_Synthesis_WordBoundaryEnabled" 属性设置为 SpeechConfig 对象的实例上的 falseDepending on the Speech SDK language, you'll set the "SpeechServiceResponse_Synthesis_WordBoundaryEnabled" property to false on an instance of the SpeechConfig object.

有关详细信息,请参阅 SetProperty For more information, see SetProperty .

speechConfig.SetProperty(
    "SpeechServiceResponse_Synthesis_WordBoundaryEnabled", "false");

示例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        Good morning!
    </voice>
    <voice name="en-US-Guy24kRUS">
        Good morning to you too Aria!
    </voice>
</speak>

调整讲话风格Adjust speaking styles

重要

调整讲话风格的操作仅适用于神经语音。The adjustment of speaking styles will only work with neural voices.

默认情况下,对于标准和神经语音,文本转语音服务将使用中性讲话风格合成文本。By default, the text-to-speech service synthesizes text using a neutral speaking style for both standard and neural voices. 使用神经语音,可以调整讲话风格来表达不同的情感(例如愉悦、同情和冷静),或使用 mstts:express-as 元素针对不同场景(例如自定义服务、新闻广播和语音助手)优化语音。With neural voices, you can adjust the speaking style to express different emotions like cheerfulness, empathy, and calm, or optimize the voice for different scenarios like custom service, newscasting and voice assistant, using the mstts:express-as element. 这是语音服务特有的可选元素。This is an optional element unique to the Speech service.

目前,支持调整以下神经语音的讲话风格:Currently, speaking style adjustments are supported for these neural voices:

  • en-US-AriaNeural
  • zh-CN-XiaoxiaoNeural
  • zh-CN-YunyangNeural

更改将在句子级别应用,风格因语音而异。Changes are applied at the sentence level, and style vary by voice. 如果某种风格不受支持,该服务将以默认的中性讲话风格返回语音。If a style isn't supported, the service will return speech in the default neutral speaking style.

语法Syntax

<mstts:express-as style="string"></mstts:express-as>

属性Attributes

属性Attribute 描述Description 必需/可选Required / Optional
style 指定讲话风格。Specifies the speaking style. 目前,讲话风格特定于语音。Currently, speaking styles are voice-specific. 如果调整神经语音的讲话风格,则此属性是必需的。Required if adjusting the speaking style for a neural voice. 如果使用 mstts:express-as,则必须提供风格。If using mstts:express-as, then style must be provided. 如果提供无效的值,将忽略此元素。If an invalid value is provided, this element will be ignored.

参考下表来确定每种神经语音支持的讲话风格。Use this table to determine which speaking styles are supported for each neural voice.

语音Voice StyleStyle 说明Description
en-US-AriaNeural style="newscast" 以正式专业的语气叙述新闻Expresses a formal and professional tone for narrating news
style="customerservice" 以友好热情的语气为客户提供支持Expresses a friendly and helpful tone for customer support
style="chat" 表达轻松随意的语气Expresses a casual and relaxed tone
style="cheerful" 表达积极愉快的语气Expresses a positive and happy tone
style="empathetic" 表达关心和理解Expresses a sense of caring and understanding
zh-CN-XiaoxiaoNeural style="newscast" 以正式专业的语气叙述新闻Expresses a formal and professional tone for narrating news
style="customerservice" 以友好热情的语气为客户提供支持Expresses a friendly and helpful tone for customer support
style="assistant" 以热情而轻松的语气对数字助理讲话Expresses a warm and relaxed tone for digital assistants
style="lyrical" 以优美又带感伤的方式表达情感Expresses emotions in a melodic and sentimental way
zh-CN-YunyangNeural style="customerservice" 以友好热情的语气为客户提供支持Expresses a friendly and helpful tone for customer support

示例Example

此 SSML 代码片段演示如何使用 <mstts:express-as> 元素将讲话风格更改为 cheerfulThis SSML snippet illustrates how the <mstts:express-as> element is used to change the speaking style to cheerful.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        <mstts:express-as style="cheerful">
            That'd be just amazing!
        </mstts:express-as>
    </voice>
</speak>

添加或删除中断/暂停Add or remove a break/pause

使用元素 break 可在单词之间插入暂停(或中断),或者防止文本转语音服务自动添加暂停。Use the break element to insert pauses (or breaks) between words, or prevent pauses automatically added by the text-to-speech service.

备注

如果某个单词或短语的合成语音听起来不自然,可以使用此元素来重写该单词或短语的默认文本转语音 (TTS) 行为。Use this element to override the default behavior of text-to-speech (TTS) for a word or phrase if the synthesized speech for that word or phrase sounds unnatural. strength 设置为 none 可防止文本转语音服务自动插入的韵律中断。Set strength to none to prevent a prosodic break, which is automatically inserted by the text-to-speech service.

语法Syntax

<break strength="string" />
<break time="string" />

属性Attributes

属性Attribute 描述Description 必需/可选Required / Optional
strength 使用以下值之一指定暂停的相对持续时间:Specifies the relative duration of a pause using one of the following values:
  • none
  • x-weakx-weak
  • weakweak
  • medium(默认值)medium (default)
  • strongstrong
  • x-strongx-strong
可选Optional
time 指定暂停的绝对持续时间,以秒或毫秒为单位。Specifies the absolute duration of a pause in seconds or milliseconds. 例如,2s500 是有效值Examples of valid values are 2s and 500 可选Optional
StrengthStrength 说明Description
None,或者不提供任何值None, or if no value provided 0 毫秒0 ms
x-weakx-weak 250 毫秒250 ms
weakweak 500 毫秒500 ms
medium 750 毫秒750 ms
strongstrong 1000 毫秒1000 ms
x-strongx-strong 1250 毫秒1250 ms

示例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        Welcome to Microsoft Cognitive Services <break time="100ms" /> Text-to-Speech API.
    </voice>
</speak>

指定段落和句子Specify paragraphs and sentences

ps 元素分别用于表示段落和句子。p and s elements are used to denote paragraphs and sentences, respectively. 如果不指定这些元素,则文本转语音服务会自动确定 SSML 文档的结构。In the absence of these elements, the text-to-speech service automatically determines the structure of the SSML document.

p 元素可包含文本和以下元素:audiobreakphonemeprosodysay-assubmstts:express-assThe p element may contain text and the following elements: audio, break, phoneme, prosody, say-as, sub, mstts:express-as, and s.

s 元素可包含文本和以下元素:audiobreakphonemeprosodysay-asmstts:express-assubThe s element may contain text and the following elements: audio, break, phoneme, prosody, say-as, mstts:express-as, and sub.

语法Syntax

<p></p>
<s></s>

示例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <p>
            <s>Introducing the sentence element.</s>
            <s>Used to mark individual sentences.</s>
        </p>
        <p>
            Another simple paragraph.
            Sentence structure in this paragraph is not explicitly marked.
        </p>
    </voice>
</speak>

使用音素改善发音Use phonemes to improve pronunciation

ph 元素用于 SSML 文档中的发音。The ph element is used to for phonetic pronunciation in SSML documents. ph 元素只能包含文本,而不能包含其他元素。The ph element can only contain text, no other elements. 始终提供人类可读的语音作为回退。Always provide human-readable speech as a fallback.

音标由音素构成,而这些音素由字母、数字或字符(有时是它们的组合)构成。Phonetic alphabets are composed of phones, which are made up of letters, numbers, or characters, sometimes in combination. 每个音素描述独特的语音。Each phone describes a unique sound of speech. 这与拉丁音标不同,其中的任一字母可以表示多种语音。This is in contrast to the Latin alphabet, where any letter may represent multiple spoken sounds. 想像一下单词“candy”和“cease”中字母“c”的不同发音,或者字母组合“th”在单词“thing”和“those”中的不同发音。Consider the different pronunciations of the letter "c" in the words "candy" and "cease", or the different pronunciations of the letter combination "th" in the words "thing" and "those".

语法Syntax

<phoneme alphabet="string" ph="string"></phoneme>

属性Attributes

属性Attribute 描述Description 必需/可选Required / Optional
alphabet 指定在 ph 属性中合成字符串发音时要使用的音标。Specifies the phonetic alphabet to use when synthesizing the pronunciation of the string in the ph attribute. 指定音标的字符串必须以小写字母指定。The string specifying the alphabet must be specified in lowercase letters. 下面是可以指定的可能音标。The following are the possible alphabets that you may specify.
音标仅适用于元素中的 phonemeThe alphabet applies only to the phoneme in the element..
可选Optional
ph 一个字符串,包含用于在 phoneme 元素中指定单词发音的音素。A string containing phones that specify the pronunciation of the word in the phoneme element. 如果指定的字符串包含无法识别的音素,则文本转语音 (TTS) 服务将拒绝整个 SSML 文档,并且不会生成文档中指定的任何语音输出。If the specified string contains unrecognized phones, the text-to-speech (TTS) service rejects the entire SSML document and produces none of the speech output specified in the document. 如果使用音素,则此属性是必需的。Required if using phonemes.

示例Examples

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <phoneme alphabet="ipa" ph="t&#x259;mei&#x325;&#x27E;ou&#x325;"> tomato </phoneme>
    </voice>
</speak>
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <phoneme alphabet="sapi" ph="iy eh n y uw eh s"> en-US </phoneme>
    </voice>
</speak>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
    </voice>
</speak>

使用自定义词典改善发音Use custom lexicon to improve pronunciation

有时文本转语音服务无法准确地根据字词发音。Sometimes the text-to-speech service cannot accurately pronounce a word. 例如,公司的名称或医学术语。For example, the name of a company, or a medical term. 开发人员可以使用 phonemesub 标记来定义采用 SSML 朗读单个实体的方式。Developers can define how single entities are read in SSML using the phoneme and sub tags. 但是,如果需要定义朗读多个实体的方式,则可以使用 lexicon 标记创建自定义词典。However, if you need to define how multiple entities are read, you can create a custom lexicon using the lexicon tag.

备注

自定义词典当前支持 UTF-8 编码。Custom lexicon currently supports UTF-8 encoding.

语法Syntax

<lexicon uri="string"/>

属性Attributes

属性Attribute 描述Description 必需/可选Required / Optional
uri 外部 PLS 文档的地址。The address of the external PLS document. 必需。Required.

使用情况Usage

若要定义朗读多个实体的方式,可以创建一个以 .xml 或 .pls 文件形式存储的自定义词典。To define how multiple entities are read, you can create a custom lexicon, which is stored as an .xml or .pls file. 以下是示例 .xml 文件。The following is a sample .xml file.

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon 
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="ipa" xml:lang="en-US">
  <lexeme>
    <grapheme>BTW</grapheme> 
    <alias>By the way</alias> 
  </lexeme>
  <lexeme>
    <grapheme> Benigni </grapheme> 
    <phoneme> bɛˈniːnji</phoneme>
  </lexeme>
</lexicon>

lexicon 元素包含至少一个 lexeme 元素。The lexicon element contains at least one lexeme element. 每个 lexeme 元素包含至少一个 grapheme 元素以及一个或多个 graphemealiasphoneme 元素。Each lexeme element contains at least one grapheme element and one or more grapheme, alias, and phoneme elements. grapheme 元素包含描述拼字法 的文本。The grapheme element contains text describing the orthography . alias 元素用于指示某个首字母缩写词或某个缩写词的发音。The alias elements are used to indicate the pronunciation of an acronym or an abbreviated term. phoneme 元素提供了描述 lexeme 发音方式的文本。The phoneme element provides text describing how the lexeme is pronounced.

需要特别注意的是,不能使用自定义词典直接设置字词的发音。It's important to note, that you cannot directly set the pronunciation of a word using the custom lexicon. 如果需要设置首字母缩略词或缩写词的发音,请首先提供 alias,再将 phoneme 与该 alias 关联。If you need to set the pronunciation for an acronym or an abbreviated term, first provide an alias, then associate the phoneme with that alias. 例如:For example:

  <lexeme>
    <grapheme>Scotland MV</grapheme> 
    <alias>ScotlandMV</alias> 
  </lexeme>
  <lexeme>
    <grapheme>ScotlandMV</grapheme> 
    <phoneme>ˈskɒtlənd.ˈmiːdiəm.weɪv</phoneme>
  </lexeme>

重要

使用 IPA 时,phoneme 元素不能包含空格。The phoneme element cannot contain white spaces when using IPA.

有关自定义词典文件的详细信息,请参阅 Pronunciation Lexicon Specification (PLS) Version 1.0(发音词典规范 (PLS) 版本 1.0)。For more information about custom lexicon file, see Pronunciation Lexicon Specification (PLS) Version 1.0.

接下来,发布自定义词典文件。Next, publish your custom lexicon file. 虽然对此文件的存储位置没有限制,但建议使用 Azure Blob 存储While we don't have restrictions on where this file can be stored, we do recommend using Azure Blob Storage.

发布自定义词典后,可以从 SSML 引用它。After you've published your custom lexicon, you can reference it from your SSML.

备注

lexicon 元素必须位于 voice 元素内部。The lexicon element must be inside the voice element.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
          xmlns:mstts="http://www.w3.org/2001/mstts" 
          xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <lexicon uri="http://www.example.com/customlexicon.xml"/>
        BTW, we will be there probably at 8:00 tomorrow morning.
        Could you help leave a message to Robert Benigni for me?
    </voice>
</speak>

使用此自定义词典时,“BTW”将读作“By the way”。When using this custom lexicon, "BTW" will be read as "By the way". “Benigni”将通过提供的 IPA“bɛˈniːnji”朗读。"Benigni" will be read with the provided IPA "bɛˈniːnji".

限制Limitations

  • 文件大小:自定义词典文件大小的最大限制为 100KB。如果超过此大小,合成请求会失败。File size: custom lexicon file size maximum limit is 100KB, if beyond this size, synthesis request will fail.
  • 词典缓存刷新:自定义词典在缓存时会将 URI 用作 TTS 服务上的密钥(在首次加载它时)。Lexicon cache refresh: custom lexicon will be cached with URI as key on TTS Service when it's first loaded. 不会在 15 分钟内重新加载具有相同 URI 的词典。因此,如果希望自定义词典更改生效,最多需要等待 15 分钟。Lexicon with same URI won't be reloaded within 15 mins, so custom lexicon change needs to wait at most 15 mins to take effect.

语音服务语音集Speech service phonetic sets

在上面的示例中,我们使用的是国际音标(也称为 IPA 语音集)。In the sample above, we're using the International Phonetic Alphabet, also known as the IPA phone set. 我们建议开发人员使用 IPA,因为它是国际标准。We suggest developers use the IPA, because it is the international standard. 对于某些 IPA 字符,当使用 Unicode 表示时,它们具有“预构”和“分解”版本。For some IPA characters, they have the 'precomposed' and 'decomposed' version when being represented with Unicode. 自定义词典仅支持分解的 Unicode。Custom lexicon only support the decomposed unicodes.

考虑到 IPA 不容易记住,语音服务为七种语言(en-USfr-FRde-DEes-ESja-JPzh-CNzh-TW)定义语音集。Considering that the IPA is not easy to remember, the Speech service defines a phonetic set for seven languages (en-US, fr-FR, de-DE, es-ES, ja-JP, zh-CN, and zh-TW).

可以使用 sapi 作为 alphabet 属性的值,而自定义词典则如下所示:You can use the sapi as the vale for the alphabet attribute with custom lexicons as demonstrated below:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="sapi" xml:lang="en-US">
  <lexeme>
    <grapheme>BTW</grapheme>
    <alias> By the way </alias>
  </lexeme>
  <lexeme>
    <grapheme> Benigni </grapheme>
    <phoneme> b eh 1 - n iy - n y iy </phoneme>
  </lexeme>
</lexicon>

有关详细的语音服务音标的更多信息,请参阅语音服务语音集For more information on the detailed Speech service phonetic alphabet, see the Speech service phonetic sets.

调整韵律Adjust prosody

prosody 元素用于指定文本转语音输出的音节、调型、范围、速率、持续时间和音量的变化。The prosody element is used to specify changes to pitch, contour, range, rate, duration, and volume for the text-to-speech output. prosody 元素可包含文本和以下元素:audiobreakpphonemeprosodysay-assubsThe prosody element may contain text and the following elements: audio, break, p, phoneme, prosody, say-as, sub, and s.

由于韵律属性值可在较大范围内变化,因此,语音识别器会将分配的值解释为所选语音的建议实际韵律值。Because prosodic attribute values can vary over a wide range, the speech recognizer interprets the assigned values as a suggestion of what the actual prosodic values of the selected voice should be. 文本转语音服务将限制或替代不支持的值。The text-to-speech service limits or substitutes values that are not supported. 例如,音节 1 MHz 或音量 120 就是不支持的值。Examples of unsupported values are a pitch of 1 MHz or a volume of 120.

语法Syntax

<prosody pitch="value" contour="value" range="value" rate="value" duration="value" volume="value"></prosody>

属性Attributes

属性Attribute 说明Description 必需/可选Required / Optional
pitch 指示文本的基线音节。Indicates the baseline pitch for the text. 可将音节表述为:You may express the pitch as:
  • 以某个数字后接“Hz”(赫兹)表示的绝对值。An absolute value, expressed as a number followed by "Hz" (Hertz). 例如,600 Hz。For example, 600 Hz.
  • 以前面带有“+”或“-”的数字,后接“Hz”或“st”(用于指定音节的变化量)表示的相对值。A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st", that specifies an amount to change the pitch. 例如:+80 Hz 或 -2st。For example: +80 Hz or -2st. “st”表示变化单位为半音,即,标准全音阶中的半调(半步)。The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.
  • 常量值:A constant value:
    • x-lowx-low
    • lowlow
    • medium
    • highhigh
    • x-highx-high
    • 默认值default
..
可选Optional
contour 调型现在同时支持神经语音和标准语音。Contour now supports both neural and standard voices. 调型表示音节的变化。Contour represents changes in pitch. 这些变化以语音输出中指定时间处的目标数组形式表示。These changes are represented as an array of targets at specified time positions in the speech output. 每个目标由参数对的集定义。Each target is defined by sets of parameter pairs. 例如:For example:

<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">

每参数集中的第一个值以文本持续时间百分比的形式指定音节变化的位置。The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the text. 第二个值使用音节的相对值或枚举值指定音节的升高或降低量(请参阅 pitch)。The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch (see pitch).
可选Optional
range 表示文本音节范围的值。A value that represents the range of pitch for the text. 可以使用用于描述 pitch 的相同绝对值、相对值或枚举值表示 rangeYou may express range using the same absolute values, relative values, or enumeration values used to describe pitch. 可选Optional
rate 指示文本的讲出速率。Indicates the speaking rate of the text. 可将 rate 表述为:You may express rate as:
  • 以充当默认值倍数的数字表示的相对值。A relative value, expressed as a number that acts as a multiplier of the default. 例如,如果值为 1,则速率不会变化。For example, a value of 1 results in no change in the rate. 如果值为 0.5,则速率会减慢一半。A value of 0.5 results in a halving of the rate. 如果值为 3,则速率为三倍。A value of 3 results in a tripling of the rate.
  • 常量值:A constant value:
    • x-slowx-slow
    • slowslow
    • medium
    • fastfast
    • x-fastx-fast
    • 默认值default
可选Optional
duration 语音合成 (TTS) 服务读取文本时应该消逝的时长,以秒或毫秒为单位。The period of time that should elapse while the speech synthesis (TTS) service reads the text, in seconds or milliseconds. 例如 2s1800msFor example, 2s or 1800ms. 可选Optional
volume 指示语音的音量级别。Indicates the volume level of the speaking voice. 可将音量表述为:You may express the volume as:
  • 以从 0.0 到 100.0(从最安静到最大声)的数字表示的绝对值。An absolute value, expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. 例如 75。For example, 75. 默认值为 100.0。The default is 100.0.
  • 以前面带有“+”或“-”的数字表示的相对值,指定音量的变化量。A relative value, expressed as a number preceded by "+" or "-" that specifies an amount to change the volume. 例如,+10 或 -5.5。For example, +10 or -5.5.
  • 常量值:A constant value:
    • silentsilent
    • x-softx-soft
    • softsoft
    • medium
    • loudloud
    • x-loudx-loud
    • 默认值default
可选Optional

更改语速Change speaking rate

可以在单词或句子级别对神经语音和标准语音应用语速。Speaking rate can be applied to Neural voices and standard voices at the word or sentence-level.

示例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-GuyNeural">
        <prosody rate="+30.00%">
            Welcome to Microsoft Cognitive Services Text-to-Speech API.
        </prosody>
    </voice>
</speak>

更改音量Change volume

可以在单词或句子级别对标准语音应用音量变化。Volume changes can be applied to standard voices at the word or sentence-level. 只能在句子级别对神经语音应用音量变化。Whereas volume changes can only be applied to neural voices at the sentence level.

示例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <prosody volume="+20.00%">
            Welcome to Microsoft Cognitive Services Text-to-Speech API.
        </prosody>
    </voice>
</speak>

更改音高Change pitch

可以在单词或句子级别对标准语音应用音节变化。Pitch changes can be applied to standard voices at the word or sentence-level. 只能在句子级别对神经语音应用音节变化。Whereas pitch changes can only be applied to neural voices at the sentence level.

示例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-Guy24kRUS">
        Welcome to <prosody pitch="high">Microsoft Cognitive Services Text-to-Speech API.</prosody>
    </voice>
</speak>

更改音高升降曲线Change pitch contour

重要

神经语音现在支持音节调型变化。Pitch contour changes are now supported with neural voices.

示例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        <prosody contour="(60%,-60%) (100%,+80%)" >
            Were you the only person in the room? 
        </prosody>
    </voice>
</speak>

say-as 元素say-as element

say-as 是一个可选元素,指示元素文本的内容类型(例如数字或日期)。say-as is an optional element that indicates the content type (such as number or date) of the element's text. 它为语音合成引擎提供有关如何朗读文本的指导。This provides guidance to the speech synthesis engine about how to pronounce the text.

语法Syntax

<say-as interpret-as="string" format="digit string" detail="string"> <say-as>

属性Attributes

属性Attribute 说明Description 必需/可选Required / Optional
interpret-as 指示元素文本的内容类型。Indicates the content type of element's text. 有关类型列表,请参阅下表。For a list of types, see the table below. 必须Required
format 为可能具有不明确格式的内容类型提供有关元素文本的精确格式设置的更多信息。Provides additional information about the precise formatting of the element's text for content types that may have ambiguous formats. SSML 为使用它们的内容类型定义格式(请参阅下表)。SSML defines formats for content types that use them (see table below). 可选Optional
detail 指示要朗读的详细信息的级别。Indicates the level of detail to be spoken. 例如,此属性可以要求语音合成引擎朗读标点符号。For example, this attribute might request that the speech synthesis engine pronounce punctuation marks. 没有为 detail 定义标准值。There are no standard values defined for detail. 可选Optional

下面是 interpret-asformat 属性支持的内容类型。The following are the supported content types for the interpret-as and format attributes. 仅当 interpret-as 设置为日期和时间时,才包括 format 属性。Include the format attribute only if interpret-as is set to date and time.

interpret-asinterpret-as formatformat 解释Interpretation
address 此文本朗读为地址。The text is spoken as an address. 语音合成引擎将以下内容:The speech synthesis engine pronounces:

I'm at <say-as interpret-as="address">150th CT NE, Redmond, WA</say-as>

朗读为:“I'm at 150th court north east redmond washington.”As "I'm at 150th court north east redmond washington."
cardinal, numbercardinal, number 此文本朗读为基数。The text is spoken as a cardinal number. 语音合成引擎将以下内容:The speech synthesis engine pronounces:

There are <say-as interpret-as="cardinal">3</say-as> alternatives

朗读为“There are three alternatives.”。As "There are three alternatives."
characters, spell-outcharacters, spell-out 此文本朗读为各个字符(拼读出来)。The text is spoken as individual letters (spelled out). 语音合成引擎将以下内容:The speech synthesis engine pronounces:

<say-as interpret-as="characters">test</say-as>

朗读为“T E S T.”As "T E S T."
date dmy, mdy, ymd, ydm, ym, my, md, dm, d, m, ydmy, mdy, ymd, ydm, ym, my, md, dm, d, m, y 此文本朗读为日期。The text is spoken as a date. format 属性指定日期的格式 (d=day, m=month, and y=year)。The format attribute specifies the date's format (d=day, m=month, and y=year). 语音合成引擎将以下内容:The speech synthesis engine pronounces:

Today is <say-as interpret-as="date" format="mdy">10-19-2016</say-as>

朗读为“Today is October nineteenth two thousand sixteen.”As "Today is October nineteenth two thousand sixteen."
digits, number_digitdigits, number_digit 此文本朗读为个体数字的序列。The text is spoken as a sequence of individual digits. 语音合成引擎将以下内容:The speech synthesis engine pronounces:

<say-as interpret-as="number_digit">123456789</say-as>

朗读为“1 2 3 4 5 6 7 8 9.”As "1 2 3 4 5 6 7 8 9."
fraction 此文本朗读为分数。The text is spoken as a fractional number. 语音合成引擎将以下内容:The speech synthesis engine pronounces:

<say-as interpret-as="fraction">3/8</say-as> of an inch

朗读为“three eighths of an inch.”As "three eighths of an inch."
ordinal 此文本朗读为基数。The text is spoken as an ordinal number. 语音合成引擎将以下内容:The speech synthesis engine pronounces:

Select the <say-as interpret-as="ordinal">3rd</say-as> option

朗读为“Select the third option”。As "Select the third option".
telephone 此文本朗读为电话号码。The text is spoken as a telephone number. format 属性可以包含表示国家/地区代码的数字。The format attribute may contain digits that represent a country code. 例如,“1”表示美国,“39”表示意大利。For example, "1" for the United States or "39" for Italy. 语音合成引擎可以使用此信息来指导其电话号码的发音。The speech synthesis engine may use this information to guide its pronunciation of a phone number. 电话号码中也可能包含国家/地区代码,如果是,则它优先于 format 中的国家/地区代码。The phone number may also include the country code, and if so, takes precedence over the country code in the format. 语音合成引擎将以下内容:The speech synthesis engine pronounces:

The number is <say-as interpret-as="telephone" format="1">(888) 555-1212</say-as>

朗读为“My number is area code eight eight eight five five five one two one two.”As "My number is area code eight eight eight five five five one two one two."
time hms12, hms24hms12, hms24 此文本朗读为时间。The text is spoken as a time. format 属性指定时间是使用 12 小时制 (hms12) 还是 24 小时制 (hms24) 指定的。The format attribute specifies whether the time is specified using a 12-hour clock (hms12) or a 24-hour clock (hms24). 请使用冒号分隔表示小时、分钟和秒的数字。Use a colon to separate numbers representing hours, minutes, and seconds. 下面是有效的时间示例:12:35、1:14:32、08:15 和 02:50:45。The following are valid time examples: 12:35, 1:14:32, 08:15, and 02:50:45. 语音合成引擎将以下内容:The speech synthesis engine pronounces:

The train departs at <say-as interpret-as="time" format="hms12">4:00am</say-as>

朗读为“The train departs at four A M.”As "The train departs at four A M."

使用情况Usage

say-as 元素只能包含文本。The say-as element may contain only text.

示例Example

语音合成引擎将以下示例朗读为“Your first request was for one room on October nineteenth twenty ten with early arrival at twelve thirty five PM.”The speech synthesis engine speaks the following example as "Your first request was for one room on October nineteenth twenty ten with early arrival at twelve thirty five PM."

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <p>
        Your <say-as interpret-as="ordinal"> 1st </say-as> request was for <say-as interpret-as="cardinal"> 1 </say-as> room
        on <say-as interpret-as="date" format="mdy"> 10/19/2010 </say-as>, with early arrival at <say-as interpret-as="time" format="hms12"> 12:35pm </say-as>.
        </p>
    </voice>
</speak>

添加录制的音频Add recorded audio

audio 是一个可选元素,用于将 MP3 音频插入 SSML 文档。audio is an optional element that allows you to insert MP3 audio into an SSML document. 如果音频文件不可用或不可播放,可在音频元素的正文中包含可讲述的纯文本或 SSML 标记。The body of the audio element may contain plain text or SSML markup that's spoken if the audio file is unavailable or unplayable. 此外,audio 元素可包含文本和以下元素:audiobreakpsphonemeprosodysay-assubAdditionally, the audio element can contain text and the following elements: audio, break, p, s, phoneme, prosody, say-as, and sub.

包含在 SSML 文档中的任何音频必须满足以下要求:Any audio included in the SSML document must meet these requirements:

  • MP3 必须托管在可通过 Internet 访问的 HTTPS 终结点上。The MP3 must be hosted on an Internet-accessible HTTPS endpoint. 必须使用 HTTPS,托管 MP3 文件的域必须提供有效的受信任 TLS/SSL 证书。HTTPS is required, and the domain hosting the MP3 file must present a valid, trusted TLS/SSL certificate.
  • MP3 必须是有效的 MP3 文件 (MPEG v2)。The MP3 must be a valid MP3 file (MPEG v2).
  • 比特率必须是 48 kbps。The bit rate must be 48 kbps.
  • 采样率必须是 16,000 Hz。The sample rate must be 16,000 Hz.
  • 单个响应中所有文本和音频文件的总时间不能超过 90 秒。The combined total time for all text and audio files in a single response cannot exceed ninety (90) seconds.
  • MP3 不得包含任何客户特定的信息或其他敏感信息。The MP3 must not contain any customer-specific or other sensitive information.

语法Syntax

<audio src="string"/></audio>

属性Attributes

属性Attribute 说明Description 必需/可选Required / Optional
src 指定音频文件的位置/URL。Specifies the location/URL of the audio file. 在 SSML 文档中使用音频元素时,此属性是必需的。Required if using the audio element in your SSML document.

示例Example

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaRUS">
        <p>
            <audio src="https://contoso.com/opinionprompt.wav"/>
            Thanks for offering your opinion. Please begin speaking after the beep.
            <audio src="https://contoso.com/beep.wav">
                Could not play the beep, please voice your opinion now.
            </audio>
        </p>
    </voice>
</speak>

添加背景音频Add background audio

使用 mstts:backgroundaudio 元素可将背景音频添加到 SSML 文档(或者使用文本转语音来混合音频文件)。The mstts:backgroundaudio element allows you to add background audio to your SSML documents (or mix an audio file with text-to-speech). 使用 mstts:backgroundaudio 可以在后台循环音频文件,在文本转语音的开头淡入,并在文本转语音的末尾淡出。With mstts:backgroundaudio you can loop an audio file in the background, fade in at the beginning of text-to-speech, and fade out at the end of text-to-speech.

如果提供的背景音频短于文本转语音或淡出持续时间,则会循环该音频。If the background audio provided is shorter than the text-to-speech or the fade out, it will loop. 如果其长度超过文本转语音的持续时间,则它在完成淡出后将会停止。If it is longer than the text-to-speech, it will stop when the fade out has finished.

每个 SSML 文档仅允许一个背景音频文件。Only one background audio file is allowed per SSML document. 但是,可以在 voice 元素中散布 audio 标记,以将更多的音频添加到 SSML 文档。However, you can intersperse audio tags within the voice element to add additional audio to your SSML document.

语法Syntax

<mstts:backgroundaudio src="string" volume="string" fadein="string" fadeout="string"/>

属性Attributes

属性Attribute 说明Description 必需/可选Required / Optional
src 指定背景音频文件的位置/URL。Specifies the location/URL of the background audio file. 如果在 SSML 文档中使用背景音频,则此属性是必需的。Required if using background audio in your SSML document.
volume 指定背景音频文件的音量。Specifies the volume of the background audio file. 接受的值0100(含)。Accepted values: 0 to 100 inclusive. 默认值为 1The default value is 1. 可选Optional
fadein 指定背景音频淡入的持续时间,以毫秒为单位。Specifies the duration of the background audio "fade in" as milliseconds. 默认值为 0,即,不淡入。The default value is 0, which is the equivalent to no fade in. 接受的值010000(含)。Accepted values: 0 to 10000 inclusive. 可选Optional
fadeout 指定背景音频淡出的持续时间,以毫秒为单位。Specifies the duration of the background audio fade out in milliseconds. 默认值为 0,即,不淡出。接受的值010000(含)。The default value is 0, which is the equivalent to no fade out. Accepted values: 0 to 10000 inclusive. 可选Optional

示例Example

<speak version="1.0" xml:lang="en-US" xmlns:mstts="http://www.w3.org/2001/mstts">
    <mstts:backgroundaudio src="https://contoso.com/sample.wav" volume="0.7" fadein="3000" fadeout="4000"/>
    <voice name="Microsoft Server Speech Text to Speech Voice (en-US, AriaRUS)">
        The text provided in this document will be spoken over the background audio.
    </voice>
</speak>

后续步骤Next steps