了解语音合成的基础知识Learn the basics of speech synthesis

本文介绍使用语音 SDK 进行文本到语音合成的常见设计模式。In this article, you learn common design patterns for doing text-to-speech synthesis using the Speech SDK. 首先,请进行基本的配置和合成,然后通过更高级的示例来了解自定义应用程序开发,其中包括:You start by doing basic configuration and synthesis, and move on to more advanced examples for custom application development including:

  • 获取内存中流形式的响应Getting responses as in-memory streams
  • 自定义输出采样率和比特率Customizing output sample rate and bit rate
  • 使用 SSML(语音合成标记语言)提交合成请求Submitting synthesis requests using SSML (speech synthesis markup language)
  • 使用神经语音Using neural voices

提示

如果你没有机会完成我们的快速入门之一,我们建议你进行尝试,自行试用语音识别。If you haven't had a chance to complete one of our quickstarts, we encourage you to kick the tires and try speech recognition out for yourself.

先决条件Prerequisites

本文假定你有 Azure 帐户和语音服务订阅。This article assumes that you have an Azure account and Speech service subscription. 如果你没有帐户和订阅,可以免费试用语音服务If you don't have an account and subscription, try the Speech service for free.

安装语音 SDKInstall the Speech SDK

你需要先安装语音 SDK,然后才能执行任何操作。Before you can do anything, you'll need to install the Speech SDK. 根据你的平台,使用以下说明:Depending on your platform, use the following instructions:

导入依赖项Import dependencies

若要运行本文中的示例,请在脚本的最前面包含以下 using 语句。To run the examples in this article, include the following using statements at the top of your script.

using System;
using System.IO;
using System.Text;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

创建语音配置Create a speech configuration

若要使用语音 SDK 调用语音服务,需要创建 SpeechConfigTo call the Speech service using the Speech SDK, you need to create a SpeechConfig. 此类包含有关你的订阅的信息,例如你的密钥和关联的区域、终结点、主机或授权令牌。This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

备注

无论你是要执行语音识别、语音合成、翻译,还是意向识别,都需要创建一个配置。Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

可以通过以下几种方法初始化 SpeechConfigThere are a few ways that you can initialize a SpeechConfig:

  • 使用订阅:传入密钥和关联的区域。With a subscription: pass in a key and the associated region.
  • 使用终结点:传入语音服务终结点。With an endpoint: pass in a Speech service endpoint. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用主机:传入主机地址。With a host: pass in a host address. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用授权令牌:传入授权令牌和关联的区域。With an authorization token: pass in an authorization token and the associated region.

在此示例中,你将使用订阅密钥和区域创建一个 SpeechConfigIn this example, you create a SpeechConfig using a subscription key and region. 请查看区域支持页,找到你的区域标识符。See the region support page to find your region identifier. 此外,你将创建一些基本的样板代码,在本文的余下部分,你将修改这些代码以进行不同的自定义操作。You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

public class Program 
{
    static async Task Main()
    {
        await SynthesizeAudioAsync();
    }

    static async Task SynthesizeAudioAsync() 
    {
        var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    }
}

将语音合成到文件中Synthesize speech to a file

接下来,创建一个 SpeechSynthesizer 对象,用于执行文本到语音的转换,并将转换结果输出到扬声器、文件或其他输出流。Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. SpeechSynthesizer 接受的参数包括上一步骤创建的 SpeechConfig 对象,以及用于指定如何处理输出结果的 AudioConfig 对象。The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioConfig object that specifies how output results should be handled.

若要开始,请创建一个 AudioConfig 以使用 FromWavFileOutput() 函数自动将输出写入 .wav 文件,并使用 using 语句将其实例化。To start, create an AudioConfig to automatically write the output to a .wav file, using the FromWavFileOutput() function, and instantiate it with a using statement. 此上下文中的 using 语句会自动释放非托管资源,导致对象在释放后超出范围。A using statement in this context automatically disposes of unmanaged resources and causes the object to go out of scope after disposal.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var audioConfig = AudioConfig.FromWavFileOutput("path/to/write/file.wav");
}

接下来,使用另一个 using 语句实例化 SpeechSynthesizerNext, instantiate a SpeechSynthesizer with another using statement. config 对象和 audioConfig 对象作为参数进行传递。Pass your config object and the audioConfig object as params. 然后,只需结合一个文本字符串运行 SpeakTextAsync(),就能执行语音合成和写入文件的操作。Then, executing speech synthesis and writing to a file is as simple as running SpeakTextAsync() with a string of text.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var audioConfig = AudioConfig.FromWavFileOutput("path/to/write/file.wav");
    using var synthesizer = new SpeechSynthesizer(config, audioConfig);
    await synthesizer.SpeakTextAsync("A simple test to write to a file.");
}

运行程序,合成的 .wav 文件会写入指定的位置。Run the program, and a synthesized .wav file is written to the location you specified. 这只是最基本用法的一个很好示例。接下来,你将了解如何自定义输出,并将输出响应作为适用于自定义方案的内存中流进行处理。This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

合成到扬声器输出Synthesize to speaker output

在某些情况下,你可能希望直接将合成的语音输出到扬声器。In some cases, you may want to directly output synthesized speech directly to a speaker. 为此,只需在上述示例中创建 SpeechSynthesizer 时省略 AudioConfig 参数即可。To do this, simply omit the AudioConfig param when creating the SpeechSynthesizer in the example above. 这会将语音输出到当前处于活动状态的输出设备。This outputs to the current active output device.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var synthesizer = new SpeechSynthesizer(config);
    await synthesizer.SpeakTextAsync("Synthesizing directly to speaker output.");
}

获取内存中流形式的结果Get result as an in-memory stream

在许多语音应用程序开发方案中,你可能需要内存中流形式的最终音频数据,而不是直接写入文件。For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. 这样便可以构建自定义行为,包括:This will allow you to build custom behavior including:

  • 抽取生成的字节数组,作为自定义下游服务的可搜寻流。Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • 将结果与其他 API 或服务相集成。Integrate the result with other API's or services.
  • 修改音频数据、写入自定义 .wav 标头,等等。Modify the audio data, write custom .wav headers, etc.

可以轻松地在前一个示例的基础上进行此项更改。It's simple to make this change from the previous example. 首先删除 AudioConfig 块,因为从现在起,你将手动管理输出行为,以提高控制度。First, remove the AudioConfig block, as you will manage the output behavior manually from this point onward for increased control. 然后在 SpeechSynthesizer 构造函数中为 AudioConfig 传递 nullThen pass null for the AudioConfig in the SpeechSynthesizer constructor.

备注

如果为 AudioConfig 传递 null,而不是像在前面的扬声器输出示例中那样省略它,则默认不会在当前处于活动状态的输出设备上播放音频。Passing null for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

这一次,请将结果保存到 SpeechSynthesisResult 变量。This time, you save the result to a SpeechSynthesisResult variable. AudioData 属性包含输出数据的 byte []The AudioData property contains a byte [] of the output data. 可以手动使用此 byte [],也可以使用 AudioDataStream 类来管理内存中流。You can work with this byte [] manually, or you can use the AudioDataStream class to manage the in-memory stream. 此示例使用 AudioDataStream.FromResult() 静态函数从结果中获取流。In this example you use the AudioDataStream.FromResult() static function to get a stream from the result.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var synthesizer = new SpeechSynthesizer(config, null);
    
    var result = await synthesizer.SpeakTextAsync("Getting the response as an in-memory stream.");
    using var stream = AudioDataStream.FromResult(result);
}

在此处,可以使用生成的 stream 对象来实现任何自定义行为。From here you can implement any custom behavior using the resulting stream object.

自定义音频格式Customize audio format

以下部分介绍如何自定义音频输出属性,包括:The following section shows how to customize audio output attributes including:

  • 音频文件类型Audio file type
  • 采样率Sample-rate
  • 位深度Bit-depth

若要更改音频格式,请对 SpeechConfig 对象使用 SetSpeechSynthesisOutputFormat() 函数。To change the audio format, you use the SetSpeechSynthesisOutputFormat() function on the SpeechConfig object. 此函数需要 SpeechSynthesisOutputFormat 类型的 enum,用于选择输出格式。This function expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. 请参阅参考文档,获取可用的音频格式列表See the reference docs for a list of audio formats that are available.

可根据要求对不同的文件类型使用不同的选项。There are various options for different file types depending on your requirements. 请注意,根据定义,Raw24Khz16BitMonoPcm 等原始格式不包括音频标头。Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. 仅当你知道下游实现可以解码原始位流,或者你打算基于位深度、采样率、通道数等属性手动生成标头时,才使用原始格式。Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

此示例通过对 SpeechConfig 对象设置 SpeechSynthesisOutputFormat 来指定高保真 RIFF 格式 Riff24Khz16BitMonoPcmIn this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the SpeechSynthesisOutputFormat on the SpeechConfig object. 类似于上一部分中的示例,可以使用 AudioDataStream 获取结果的内存中流,然后将其写入文件。Similar to the example in the previous section, you use AudioDataStream to get an in-memory stream of the result, and then write it to a file.

static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    config.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

    using var synthesizer = new SpeechSynthesizer(config, null);
    var result = await synthesizer.SpeakTextAsync("Customizing audio output format.");

    using var stream = AudioDataStream.FromResult(result);
    await stream.SaveToWaveFileAsync("path/to/write/file.wav");
}

再次运行程序会将 .wav 文件写入指定的路径。Running your program again will write a .wav file to the specified path.

使用 SSML 自定义语音特征Use SSML to customize speech characteristics

借助语音合成标记语言 (SSML),可以通过从 XML 架构中提交请求,来微调文本转语音输出的音节、发音、语速、音量等特征。Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. 本部分将演示一些实际用法示例,但如果你需要更详细的指导,请参阅 SSML 操作指南文章This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

若要开始使用 SSML 进行自定义,请做出一项切换语音的简单更改。To start using SSML for customization, you make a simple change that switches the voice. 首先,在根项目目录中为 SSML 配置创建一个新的 XML 文件,在本示例中为 ssml.xmlFirst, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. 根元素始终是 <speak>。将文本包装在 <voice> 元素中可以使用 name 参数来更改语音。The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. 本示例将语音更改为英式英语男声语音。This example changes the voice to a male English (UK) voice. 请注意,此语音是标准语音,其定价和可用性与神经语音不同。 Note that this voice is a standard voice, which has different pricing and availability than neural voices. 查看受支持标准语音的完整列表See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

接下来,需要更改语音合成请求以引用 XML 文件。Next, you need to change the speech synthesis request to reference your XML file. 该请求基本上保持不变,只不过需要使用 SpeakSsmlAsync() 而不是 SpeakTextAsync() 函数。The request is mostly the same, but instead of using the SpeakTextAsync() function, you use SpeakSsmlAsync(). 此函数需要 XML 字符串,因此,请先使用 File.ReadAllText() 加载字符串形式的 SSML 配置。This function expects an XML string, so you first load your SSML config as a string using File.ReadAllText(). 在此处,结果对象与前面的示例完全相同。From here, the result object is exactly the same as previous examples.

备注

如果使用的是 Visual Studio,则生成配置默认可能不会查找 XML 文件。If you're using Visual Studio, your build config likely will not find your XML file by default. 若要解决此问题,请右键单击 XML 文件并选择“属性”。 To fix this, right click the XML file and select Properties. 将“生成操作”更改为“内容”,将“复制到输出目录”更改为“始终复制”。 Change Build Action to Content, and change Copy to Output Directory to Copy always.

public static async Task SynthesizeAudioAsync() 
{
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    using var synthesizer = new SpeechSynthesizer(config, null);
    
    var ssml = File.ReadAllText("./ssml.xml");
    var result = await synthesizer.SpeakSsmlAsync(ssml);

    using var stream = AudioDataStream.FromResult(result);
    await stream.SaveToWaveFileAsync("path/to/write/file.wav");
}

输出正常,但可以做出几项简单的附加更改,使语音输出听起来更自然。The output works, but there a few simple additional changes you can make to help it sound more natural. 整体语速稍有点快,因此,我们将添加一个 <prosody> 标记,并将语速降至默认语速的 90%。 The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. 此外,句子中逗号后面的停顿稍有点短,听起来不太自然。Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. 若要解决此问题,可添加一个 <break> 标记来延迟语音,然后将时间参数设置为 200ms。 To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. 重新运行合成,以查看这些自定义操作对输出的影响。Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

神经语音Neural voices

神经语音是立足于深度神经网络的语音合成算法。Neural voices are speech synthesis algorithms powered by deep neural networks. 使用神经语音时,几乎无法将合成的语音与人类录音区分开来。When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. 随着类人的自然韵律和字词的清晰发音,用户在与 AI 系统交互时,神经语音显著减轻了听力疲劳。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

若要切换到某种神经语音,请将 name 更改为神经语音选项之一。To switch to a neural voice, change the name to one of the neural voice options. 然后,为 mstts 添加 XML 命名空间,并在 <mstts:express-as> 标记中包装文本。Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. 使用 style 参数自定义讲话风格。Use the style param to customize the speaking style. 此示例使用 cheerful,但请尝试将其设置为 customerservicechat,以了解讲话风格的差别。This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

先决条件Prerequisites

本文假定你有 Azure 帐户和语音服务订阅。This article assumes that you have an Azure account and Speech service subscription. 如果你没有帐户和订阅,可以免费试用语音服务If you don't have an account and subscription, try the Speech service for free.

安装语音 SDKInstall the Speech SDK

你需要先安装语音 SDK,然后才能执行任何操作。Before you can do anything, you'll need to install the Speech SDK. 根据你的平台,使用以下说明:Depending on your platform, use the following instructions:

导入依赖项Import dependencies

若要运行本文中的示例,请在脚本的最前面包含以下 import 和 using 语句。To run the examples in this article, include the following import and using statements at the top of your script.

#include <iostream>
#include <fstream>
#include <string>
#include <speechapi_cxx.h>

using namespace std;
using namespace Microsoft::CognitiveServices::Speech;
using namespace Microsoft::CognitiveServices::Speech::Audio;

创建语音配置Create a speech configuration

若要使用语音 SDK 调用语音服务,需要创建 SpeechConfigTo call the Speech service using the Speech SDK, you need to create a SpeechConfig. 此类包含有关你的订阅的信息,例如你的密钥和关联的区域、终结点、主机或授权令牌。This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

备注

无论你是要执行语音识别、语音合成、翻译,还是意向识别,都需要创建一个配置。Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

可以通过以下几种方法初始化 SpeechConfigThere are a few ways that you can initialize a SpeechConfig:

  • 使用订阅:传入密钥和关联的区域。With a subscription: pass in a key and the associated region.
  • 使用终结点:传入语音服务终结点。With an endpoint: pass in a Speech service endpoint. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用主机:传入主机地址。With a host: pass in a host address. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用授权令牌:传入授权令牌和关联的区域。With an authorization token: pass in an authorization token and the associated region.

在此示例中,你将使用订阅密钥和区域创建一个 SpeechConfigIn this example, you create a SpeechConfig using a subscription key and region. 请查看区域支持页,找到你的区域标识符。See the region support page to find your region identifier. 此外,你将创建一些基本的样板代码,在本文的余下部分,你将修改这些代码以进行不同的自定义操作。You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

int wmain()
{
    try
    {
        synthesizeSpeech();
    }
    catch (exception e)
    {
        cout << e.what();
    }
    return 0;
}
    
void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
}

将语音合成到文件中Synthesize speech to a file

接下来,创建一个 SpeechSynthesizer 对象,用于执行文本到语音的转换,并将转换结果输出到扬声器、文件或其他输出流。Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. SpeechSynthesizer 接受的参数包括上一步骤创建的 SpeechConfig 对象,以及用于指定如何处理输出结果的 AudioConfig 对象。The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioConfig object that specifies how output results should be handled.

若要开始,请创建一个 AudioConfig,以使用 FromWavFileOutput() 函数自动将输出写入到 .wav 文件。To start, create an AudioConfig to automatically write the output to a .wav file, using the FromWavFileOutput() function.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav");
}

接下来,实例化 SpeechSynthesizer 并将 config 对象和 audioConfig 对象作为参数传递。Next, instantiate a SpeechSynthesizer, passing your config object and the audioConfig object as params. 然后,只需结合一个文本字符串运行 SpeakTextAsync(),就能执行语音合成和写入文件的操作。Then, executing speech synthesis and writing to a file is as simple as running SpeakTextAsync() with a string of text.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto audioConfig = AudioConfig::FromWavFileOutput("path/to/write/file.wav");
    auto synthesizer = SpeechSynthesizer::FromConfig(config, audioConfig);
    auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get();
}

运行程序,合成的 .wav 文件会写入指定的位置。Run the program, and a synthesized .wav file is written to the location you specified. 这只是最基本用法的一个很好示例。接下来,你将了解如何自定义输出,并将输出响应作为适用于自定义方案的内存中流进行处理。This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

合成到扬声器输出Synthesize to speaker output

在某些情况下,你可能希望直接将合成的语音输出到扬声器。In some cases, you may want to directly output synthesized speech directly to a speaker. 为此,只需在上述示例中创建 SpeechSynthesizer 时省略 AudioConfig 参数即可。To do this, simply omit the AudioConfig param when creating the SpeechSynthesizer in the example above. 这会将语音输出到当前处于活动状态的输出设备。This outputs to the current active output device.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto synthesizer = SpeechSynthesizer::FromConfig(config);
    auto result = synthesizer->SpeakTextAsync("Synthesizing directly to speaker output.").get();
}

获取内存中流形式的结果Get result as an in-memory stream

在许多语音应用程序开发方案中,你可能需要内存中流形式的最终音频数据,而不是直接写入文件。For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. 这样便可以构建自定义行为,包括:This will allow you to build custom behavior including:

  • 抽取生成的字节数组,作为自定义下游服务的可搜寻流。Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • 将结果与其他 API 或服务相集成。Integrate the result with other API's or services.
  • 修改音频数据、写入自定义 .wav 标头,等等。Modify the audio data, write custom .wav headers, etc.

可以轻松地在前一个示例的基础上进行此项更改。It's simple to make this change from the previous example. 首先删除 AudioConfig,因为从现在起,你将手动管理输出行为,以提高控制度。First, remove the AudioConfig, as you will manage the output behavior manually from this point onward for increased control. 然后在 SpeechSynthesizer 构造函数中为 AudioConfig 传递 NULLThen pass NULL for the AudioConfig in the SpeechSynthesizer constructor.

备注

如果为 AudioConfig 传递 NULL,而不是像在前面的扬声器输出示例中那样省略它,则默认不会在当前处于活动状态的输出设备上播放音频。Passing NULL for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

这一次,请将结果保存到 SpeechSynthesisResult 变量。This time, you save the result to a SpeechSynthesisResult variable. GetAudioData Getter 返回输出数据的 byte []The GetAudioData getter returns a byte [] of the output data. 可以手动使用此 byte [],也可以使用 AudioDataStream 类来管理内存中流。You can work with this byte [] manually, or you can use the AudioDataStream class to manage the in-memory stream. 此示例使用 AudioDataStream.FromResult() 静态函数从结果中获取流。In this example you use the AudioDataStream.FromResult() static function to get a stream from the result.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
    
    auto result = synthesizer->SpeakTextAsync("Getting the response as an in-memory stream.").get();
    auto stream = AudioDataStream::FromResult(result);
}

在此处,可以使用生成的 stream 对象来实现任何自定义行为。From here you can implement any custom behavior using the resulting stream object.

自定义音频格式Customize audio format

以下部分介绍如何自定义音频输出属性,包括:The following section shows how to customize audio output attributes including:

  • 音频文件类型Audio file type
  • 采样率Sample-rate
  • 位深度Bit-depth

若要更改音频格式,请对 SpeechConfig 对象使用 SetSpeechSynthesisOutputFormat() 函数。To change the audio format, you use the SetSpeechSynthesisOutputFormat() function on the SpeechConfig object. 此函数需要 SpeechSynthesisOutputFormat 类型的 enum,用于选择输出格式。This function expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. 请参阅参考文档,获取可用的音频格式列表See the reference docs for a list of audio formats that are available.

可根据要求对不同的文件类型使用不同的选项。There are various options for different file types depending on your requirements. 请注意,根据定义,Raw24Khz16BitMonoPcm 等原始格式不包括音频标头。Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. 仅当你知道下游实现可以解码原始位流,或者你打算基于位深度、采样率、通道数等属性手动生成标头时,才使用原始格式。Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

此示例通过对 SpeechConfig 对象设置 SpeechSynthesisOutputFormat 来指定高保真 RIFF 格式 Riff24Khz16BitMonoPcmIn this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the SpeechSynthesisOutputFormat on the SpeechConfig object. 类似于上一部分中的示例,可以使用 AudioDataStream 获取结果的内存中流,然后将其写入文件。Similar to the example in the previous section, you use AudioDataStream to get an in-memory stream of the result, and then write it to a file.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    config->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Riff24Khz16BitMonoPcm);

    auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
    auto result = synthesizer->SpeakTextAsync("A simple test to write to a file.").get();
    
    auto stream = AudioDataStream::FromResult(result);
    stream->SaveToWavFileAsync("path/to/write/file.wav").get();
}

再次运行程序会将 .wav 文件写入指定的路径。Running your program again will write a .wav file to the specified path.

使用 SSML 自定义语音特征Use SSML to customize speech characteristics

借助语音合成标记语言 (SSML),可以通过从 XML 架构中提交请求,来微调文本转语音输出的音节、发音、语速、音量等特征。Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. 本部分将演示一些实际用法示例,但如果你需要更详细的指导,请参阅 SSML 操作指南文章This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

若要开始使用 SSML 进行自定义,请做出一项切换语音的简单更改。To start using SSML for customization, you make a simple change that switches the voice. 首先,在根项目目录中为 SSML 配置创建一个新的 XML 文件,在本示例中为 ssml.xmlFirst, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. 根元素始终是 <speak>。将文本包装在 <voice> 元素中可以使用 name 参数来更改语音。The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. 本示例将语音更改为英式英语男声语音。This example changes the voice to a male English (UK) voice. 请注意,此语音是标准语音,其定价和可用性与神经语音不同。 Note that this voice is a standard voice, which has different pricing and availability than neural voices. 查看受支持标准语音的完整列表See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

接下来,需要更改语音合成请求以引用 XML 文件。Next, you need to change the speech synthesis request to reference your XML file. 该请求基本上保持不变,只不过需要使用 SpeakSsmlAsync() 而不是 SpeakTextAsync() 函数。The request is mostly the same, but instead of using the SpeakTextAsync() function, you use SpeakSsmlAsync(). 此函数需要 XML 字符串,因此,请先加载字符串形式的 SSML 配置。This function expects an XML string, so you first load your SSML config as a string. 在此处,结果对象与前面的示例完全相同。From here, the result object is exactly the same as previous examples.

void synthesizeSpeech() 
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");
    auto synthesizer = SpeechSynthesizer::FromConfig(config, NULL);
    
    std::ifstream file("./ssml.xml");
    std::string ssml, line;
    while (std::getline(file, line))
    {
        ssml += line;
        ssml.push_back('\n');
    }
    auto result = synthesizer->SpeakSsmlAsync(ssml).get();
    
    auto stream = AudioDataStream::FromResult(result);
    stream->SaveToWavFileAsync("path/to/write/file.wav").get();
}

输出正常,但可以做出几项简单的附加更改,使语音输出听起来更自然。The output works, but there a few simple additional changes you can make to help it sound more natural. 整体语速稍有点快,因此,我们将添加一个 <prosody> 标记,并将语速降至默认语速的 90%。 The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. 此外,句子中逗号后面的停顿稍有点短,听起来不太自然。Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. 若要解决此问题,可添加一个 <break> 标记来延迟语音,然后将时间参数设置为 200ms。 To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. 重新运行合成,以查看这些自定义操作对输出的影响。Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

神经语音Neural voices

神经语音是立足于深度神经网络的语音合成算法。Neural voices are speech synthesis algorithms powered by deep neural networks. 使用神经语音时,几乎无法将合成的语音与人类录音区分开来。When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. 随着类人的自然韵律和字词的清晰发音,用户在与 AI 系统交互时,神经语音显著减轻了听力疲劳。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

若要切换到某种神经语音,请将 name 更改为神经语音选项之一。To switch to a neural voice, change the name to one of the neural voice options. 然后,为 mstts 添加 XML 命名空间,并在 <mstts:express-as> 标记中包装文本。Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. 使用 style 参数自定义讲话风格。Use the style param to customize the speaking style. 此示例使用 cheerful,但请尝试将其设置为 customerservicechat,以了解讲话风格的差别。This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

先决条件Prerequisites

本文假定你有 Azure 帐户和语音服务订阅。This article assumes that you have an Azure account and Speech service subscription. 如果你没有帐户和订阅,可以免费试用语音服务If you don't have an account and subscription, try the Speech service for free.

安装语音 SDKInstall the Speech SDK

你需要先安装语音 SDK,然后才能执行任何操作。Before you can do anything, you'll need to install the Speech SDK. 根据你的平台,使用以下说明:Depending on your platform, use the following instructions:

导入依赖项Import dependencies

若要运行本文中的示例,请在脚本的最前面包含以下 import 语句。To run the examples in this article, include the following import statements at the top of your script.

import com.microsoft.cognitiveservices.speech.AudioDataStream;
import com.microsoft.cognitiveservices.speech.SpeechConfig;
import com.microsoft.cognitiveservices.speech.SpeechSynthesizer;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisOutputFormat;
import com.microsoft.cognitiveservices.speech.SpeechSynthesisResult;
import com.microsoft.cognitiveservices.speech.audio.AudioConfig;

import java.io.*;
import java.util.Scanner;

创建语音配置Create a speech configuration

若要使用语音 SDK 调用语音服务,需要创建 SpeechConfigTo call the Speech service using the Speech SDK, you need to create a SpeechConfig. 此类包含有关你的订阅的信息,例如你的密钥和关联的区域、终结点、主机或授权令牌。This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

备注

无论你是要执行语音识别、语音合成、翻译,还是意向识别,都需要创建一个配置。Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

可以通过以下几种方法初始化 SpeechConfigThere are a few ways that you can initialize a SpeechConfig:

  • 使用订阅:传入密钥和关联的区域。With a subscription: pass in a key and the associated region.
  • 使用终结点:传入语音服务终结点。With an endpoint: pass in a Speech service endpoint. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用主机:传入主机地址。With a host: pass in a host address. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用授权令牌:传入授权令牌和关联的区域。With an authorization token: pass in an authorization token and the associated region.

在此示例中,你将使用订阅密钥和区域创建一个 SpeechConfigIn this example, you create a SpeechConfig using a subscription key and region. 请查看区域支持页,找到你的区域标识符。See the region support page to find your region identifier. 此外,你将创建一些基本的样板代码,在本文的余下部分,你将修改这些代码以进行不同的自定义操作。You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

public class Program 
{
    public static void main(String[] args) {
        SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    }
}

将语音合成到文件中Synthesize speech to a file

接下来,创建一个 SpeechSynthesizer 对象,用于执行文本到语音的转换,并将转换结果输出到扬声器、文件或其他输出流。Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. SpeechSynthesizer 接受的参数包括上一步骤创建的 SpeechConfig 对象,以及用于指定如何处理输出结果的 AudioConfig 对象。The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioConfig object that specifies how output results should be handled.

若要开始,请创建一个 AudioConfig,以使用 fromWavFileOutput() 静态函数自动将输出写入到 .wav 文件。To start, create an AudioConfig to automatically write the output to a .wav file using the fromWavFileOutput() static function.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    AudioConfig audioConfig = AudioConfig.fromWavFileOutput("path/to/write/file.wav");
}

接下来,实例化 SpeechSynthesizer 并将 speechConfig 对象和 audioConfig 对象作为参数传递。Next, instantiate a SpeechSynthesizer passing your speechConfig object and the audioConfig object as params. 然后,只需结合一个文本字符串运行 SpeakText(),就能执行语音合成和写入文件的操作。Then, executing speech synthesis and writing to a file is as simple as running SpeakText() with a string of text.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    AudioConfig audioConfig = AudioConfig.fromWavFileOutput("path/to/write/file.wav");

    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    synthesizer.SpeakText("A simple test to write to a file.");
}

运行程序,合成的 .wav 文件会写入指定的位置。Run the program, and a synthesized .wav file is written to the location you specified. 这只是最基本用法的一个很好示例。接下来,你将了解如何自定义输出,并将输出响应作为适用于自定义方案的内存中流进行处理。This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

合成到扬声器输出Synthesize to speaker output

在某些情况下,你可能希望直接将合成的语音输出到扬声器。In some cases, you may want to directly output synthesized speech directly to a speaker. 为此,请使用 fromDefaultSpeakerOutput() 静态函数实例化 AudioConfigTo do this, instantiate the AudioConfig using the fromDefaultSpeakerOutput() static function. 这会将语音输出到当前处于活动状态的输出设备。This outputs to the current active output device.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    AudioConfig audioConfig = AudioConfig.fromDefaultSpeakerOutput();

    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    synthesizer.SpeakText("Synthesizing directly to speaker output.");
}

获取内存中流形式的结果Get result as an in-memory stream

在许多语音应用程序开发方案中,你可能需要内存中流形式的最终音频数据,而不是直接写入文件。For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. 这样便可以构建自定义行为,包括:This will allow you to build custom behavior including:

  • 抽取生成的字节数组,作为自定义下游服务的可搜寻流。Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • 将结果与其他 API 或服务相集成。Integrate the result with other API's or services.
  • 修改音频数据、写入自定义 .wav 标头,等等。Modify the audio data, write custom .wav headers, etc.

可以轻松地在前一个示例的基础上进行此项更改。It's simple to make this change from the previous example. 首先删除 AudioConfig 块,因为从现在起,你将手动管理输出行为,以提高控制度。First, remove the AudioConfig block, as you will manage the output behavior manually from this point onward for increased control. 然后在 SpeechSynthesizer 构造函数中为 AudioConfig 传递 nullThen pass null for the AudioConfig in the SpeechSynthesizer constructor.

备注

如果为 AudioConfig 传递 null,而不是像在前面的扬声器输出示例中那样省略它,则默认不会在当前处于活动状态的输出设备上播放音频。Passing null for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

这一次,请将结果保存到 SpeechSynthesisResult 变量。This time, you save the result to a SpeechSynthesisResult variable. SpeechSynthesisResult.getAudioData() 函数返回输出数据的 byte []The SpeechSynthesisResult.getAudioData() function returns a byte [] of the output data. 可以手动使用此 byte [],也可以使用 AudioDataStream 类来管理内存中流。You can work with this byte [] manually, or you can use the AudioDataStream class to manage the in-memory stream. 此示例使用 AudioDataStream.fromResult() 静态函数从结果中获取流。In this example you use the AudioDataStream.fromResult() static function to get a stream from the result.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, null);
    
    SpeechSynthesisResult result = synthesizer.SpeakText("Getting the response as an in-memory stream.");
    AudioDataStream stream = AudioDataStream.fromResult(result);
    System.out.print(stream.getStatus());
}

在此处,可以使用生成的 stream 对象来实现任何自定义行为。From here you can implement any custom behavior using the resulting stream object.

自定义音频格式Customize audio format

以下部分介绍如何自定义音频输出属性,包括:The following section shows how to customize audio output attributes including:

  • 音频文件类型Audio file type
  • 采样率Sample-rate
  • 位深度Bit-depth

若要更改音频格式,请对 SpeechConfig 对象使用 setSpeechSynthesisOutputFormat() 函数。To change the audio format, you use the setSpeechSynthesisOutputFormat() function on the SpeechConfig object. 此函数需要 SpeechSynthesisOutputFormat 类型的 enum,用于选择输出格式。This function expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. 请参阅参考文档,获取可用的音频格式列表See the reference docs for a list of audio formats that are available.

可根据要求对不同的文件类型使用不同的选项。There are various options for different file types depending on your requirements. 请注意,根据定义,Raw24Khz16BitMonoPcm 等原始格式不包括音频标头。Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. 仅当你知道下游实现可以解码原始位流,或者你打算基于位深度、采样率、通道数等属性手动生成标头时,才使用原始格式。Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

此示例通过对 SpeechConfig 对象设置 SpeechSynthesisOutputFormat 来指定高保真 RIFF 格式 Riff24Khz16BitMonoPcmIn this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the SpeechSynthesisOutputFormat on the SpeechConfig object. 类似于上一部分中的示例,可以使用 AudioDataStream 获取结果的内存中流,然后将其写入文件。Similar to the example in the previous section, you use AudioDataStream to get an in-memory stream of the result, and then write it to a file.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");

    // set the output format
    speechConfig.setSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm);

    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, null);
    SpeechSynthesisResult result = synthesizer.SpeakText("Customizing audio output format.");
    AudioDataStream stream = AudioDataStream.fromResult(result);
    stream.saveToWavFile("path/to/write/file.wav");
}

再次运行程序会将 .wav 文件写入指定的路径。Running your program again will write a .wav file to the specified path.

使用 SSML 自定义语音特征Use SSML to customize speech characteristics

借助语音合成标记语言 (SSML),可以通过从 XML 架构中提交请求,来微调文本转语音输出的音节、发音、语速、音量等特征。Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. 本部分将演示一些实际用法示例,但如果你需要更详细的指导,请参阅 SSML 操作指南文章This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

若要开始使用 SSML 进行自定义,请做出一项切换语音的简单更改。To start using SSML for customization, you make a simple change that switches the voice. 首先,在根项目目录中为 SSML 配置创建一个新的 XML 文件,在本示例中为 ssml.xmlFirst, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. 根元素始终是 <speak>。将文本包装在 <voice> 元素中可以使用 name 参数来更改语音。The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. 本示例将语音更改为英式英语男声语音。This example changes the voice to a male English (UK) voice. 请注意,此语音是标准语音,其定价和可用性与神经语音不同。 Note that this voice is a standard voice, which has different pricing and availability than neural voices. 查看受支持标准语音的完整列表See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

接下来,需要更改语音合成请求以引用 XML 文件。Next, you need to change the speech synthesis request to reference your XML file. 该请求基本上保持不变,只不过需要使用 SpeakSsml() 而不是 SpeakText() 函数。The request is mostly the same, but instead of using the SpeakText() function, you use SpeakSsml(). 此函数需要 XML 字符串,因此,请先创建一个加载 XML 文件并将其作为字符串返回的函数。This function expects an XML string, so first you create a function to load an XML file and return it as a string.

private static String xmlToString(String filePath) {
    File file = new File(filePath);
    StringBuilder fileContents = new StringBuilder((int)file.length());

    try (Scanner scanner = new Scanner(file)) {
        while(scanner.hasNextLine()) {
            fileContents.append(scanner.nextLine() + System.lineSeparator());
        }
        return fileContents.toString().trim();
    } catch (FileNotFoundException ex) {
        return "File not found.";
    }
}

在此处,结果对象与前面的示例完全相同。From here, the result object is exactly the same as previous examples.

public static void main(String[] args) {
    SpeechConfig speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, null);

    String ssml = xmlToString("ssml.xml");
    SpeechSynthesisResult result = synthesizer.SpeakSsml(ssml);
    AudioDataStream stream = AudioDataStream.fromResult(result);
    stream.saveToWavFile("path/to/write/file.wav");
}

输出正常,但可以做出几项简单的附加更改,使语音输出听起来更自然。The output works, but there a few simple additional changes you can make to help it sound more natural. 整体语速稍有点快,因此,我们将添加一个 <prosody> 标记,并将语速降至默认语速的 90%。 The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. 此外,句子中逗号后面的停顿稍有点短,听起来不太自然。Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. 若要解决此问题,可添加一个 <break> 标记来延迟语音,然后将时间参数设置为 200ms。 To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. 重新运行合成,以查看这些自定义操作对输出的影响。Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

神经语音Neural voices

神经语音是立足于深度神经网络的语音合成算法。Neural voices are speech synthesis algorithms powered by deep neural networks. 使用神经语音时,几乎无法将合成的语音与人类录音区分开来。When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. 随着类人的自然韵律和字词的清晰发音,用户在与 AI 系统交互时,神经语音显著减轻了听力疲劳。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

若要切换到某种神经语音,请将 name 更改为神经语音选项之一。To switch to a neural voice, change the name to one of the neural voice options. 然后,为 mstts 添加 XML 命名空间,并在 <mstts:express-as> 标记中包装文本。Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. 使用 style 参数自定义讲话风格。Use the style param to customize the speaking style. 此示例使用 cheerful,但请尝试将其设置为 customerservicechat,以了解讲话风格的差别。This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

先决条件Prerequisites

本文假定你有 Azure 帐户和语音服务订阅。This article assumes that you have an Azure account and Speech service subscription. 如果你没有帐户和订阅,可以免费试用语音服务If you don't have an account and subscription, try the Speech service for free.

安装语音 SDKInstall the Speech SDK

需要先安装 JavaScript 语音 SDK ,然后才能执行操作。Before you can do anything, you'll need to install the JavaScript Speech SDK . 根据你的平台,使用以下说明:Depending on your platform, use the following instructions:

另外,请根据目标环境使用以下项之一:Additionally, depending on the target environment use one of the following:

import { readFileSync } from "fs";
import {
    AudioConfig,
    SpeechConfig,
    SpeechSynthesisOutputFormat,
    SpeechSynthesizer 
} from "microsoft-cognitiveservices-speech-sdk";

有关 import 的详细信息,请参阅 export 和 import For more information on import, see export and import .

创建语音配置Create a speech configuration

若要使用语音 SDK 调用语音服务,需要创建 SpeechConfigTo call the Speech service using the Speech SDK, you need to create a SpeechConfig. 此类包含有关你的订阅的信息,例如你的密钥和关联的区域、终结点、主机或授权令牌。This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

备注

无论你是要执行语音识别、语音合成、翻译,还是意向识别,都需要创建一个配置。Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

可以通过以下几种方法初始化 SpeechConfigThere are a few ways that you can initialize a SpeechConfig:

  • 使用订阅:传入密钥和关联的区域。With a subscription: pass in a key and the associated region.
  • 使用终结点:传入语音服务终结点。With an endpoint: pass in a Speech service endpoint. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用主机:传入主机地址。With a host: pass in a host address. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用授权令牌:传入授权令牌和关联的区域。With an authorization token: pass in an authorization token and the associated region.

在此示例中,你将使用订阅密钥和区域创建一个 SpeechConfigIn this example, you create a SpeechConfig using a subscription key and region. 请查看区域支持页以找到你的区域标识符。See the region support page to find your region identifier. 此外,你将创建一些基本的样板代码,在本文的余下部分,你将修改这些代码以进行不同的自定义操作。You also create some basic boilerplate code to use for the rest of this article, which you modify for different customizations.

function synthesizeSpeech() {
    const speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
}

将语音合成到文件中Synthesize speech to a file

接下来,创建一个 SpeechSynthesizer 对象,用于执行文本到语音的转换,并将转换结果输出到扬声器、文件或其他输出流。Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. SpeechSynthesizer 接受的参数包括上一步骤创建的 SpeechConfig 对象,以及用于指定如何处理输出结果的 AudioConfig 对象。The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioConfig object that specifies how output results should be handled.

若要开始,请创建一个 AudioConfig,以使用 fromAudioFileOutput() 静态函数自动将输出写入到 .wav 文件。To start, create an AudioConfig to automatically write the output to a .wav file using the fromAudioFileOutput() static function.

function synthesizeSpeech() {
    const speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const audioConfig = AudioConfig.fromAudioFileOutput("path/to/file.wav");
}

接下来,实例化 SpeechSynthesizer 并将 speechConfig 对象和 audioConfig 对象作为参数传递。Next, instantiate a SpeechSynthesizer passing your speechConfig object and the audioConfig object as params. 然后,只需结合一个文本字符串运行 speakTextAsync(),就能执行语音合成和写入文件的操作。Then, executing speech synthesis and writing to a file is as simple as running speakTextAsync() with a string of text. 结果回调是调用 synthesizer.close() 的理想位置,实际上,必须有此调用才能使合成正常运行。The result callback is a great place to call synthesizer.close(), in fact - this call is needed in order for synthesis to function correctly.

function synthesizeSpeech() {
    const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const audioConfig = AudioConfig.fromAudioFileOutput("path-to-file.wav");

    const synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    synthesizer.speakTextAsync(
        "A simple test to write to a file.",
        result => {
            if (result) {
                console.log(JSON.stringify(result));
            }
            synthesizer.close();
        }
    },
    error => {
        console.log(error);
        synthesizer.close();
    });
}

运行程序,合成的 .wav 文件将写入到指定的位置。Run the program, and a synthesized .wav file is written to the location you specified. 这只是最基本用法的一个很好示例。接下来,你将了解如何自定义输出,并将输出响应作为适用于自定义方案的内存中流进行处理。This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

合成到扬声器输出Synthesize to speaker output

在某些情况下,你可能希望直接将合成的语音输出到扬声器。In some cases, you may want to directly output synthesized speech directly to a speaker. 为此,请使用 fromDefaultSpeakerOutput() 静态函数实例化 AudioConfigTo do this, instantiate the AudioConfig using the fromDefaultSpeakerOutput() static function. 这会将语音输出到当前处于活动状态的输出设备。This outputs to the current active output device.

function synthesizeSpeech() {
    const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const audioConfig = AudioConfig.fromDefaultSpeakerOutput();

    const synthesizer = new SpeechSynthesizer(speechConfig, audioConfig);
    synthesizer.speakTextAsync(
        "Synthesizing directly to speaker output.",
        result => {
            if (result) {
                console.log(JSON.stringify(result));
            }
            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

获取内存中流形式的结果Get result as an in-memory stream

在许多语音应用程序开发方案中,你可能需要内存中流形式的最终音频数据,而不是直接写入文件。For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. 这样便可以构建自定义行为,包括:This will allow you to build custom behavior including:

  • 抽取生成的字节数组,作为自定义下游服务的可搜寻流。Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • 将结果与其他 API 或服务相集成。Integrate the result with other API's or services.
  • 修改音频数据、写入自定义 .wav 标头,等等。Modify the audio data, write custom .wav headers, etc.

可以轻松地在前一个示例的基础上进行此项更改。It's simple to make this change from the previous example. 首先删除 AudioConfig 块,因为从现在起,你将手动管理输出行为,以提高控制度。First, remove the AudioConfig block, as you will manage the output behavior manually from this point onward for increased control. 然后在 SpeechSynthesizer 构造函数中为 AudioConfig 传递 undefinedThen pass undefined for the AudioConfig in the SpeechSynthesizer constructor.

备注

如果为 AudioConfig 传递 undefined,而不是像在前面的扬声器输出示例中那样省略它,则默认不会在当前处于活动状态的输出设备上播放音频。Passing undefined for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

这一次,请将结果保存到 SpeechSynthesisResult 变量。This time, you save the result to a SpeechSynthesisResult variable. SpeechSynthesisResult.audioData 属性返回输出数据的 ArrayBufferThe SpeechSynthesisResult.audioData property returns an ArrayBuffer of the output data. 可以手动处理此 ArrayBufferYou can work with this ArrayBuffer manually.

function synthesizeSpeech() {
    const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const synthesizer = new sdk.SpeechSynthesizer(speechConfig);

    synthesizer.speakTextAsync(
        "Getting the response as an in-memory stream.",
        result => {
            // Interact with the audio ArrayBuffer data
            const audioData = result.audioData;
            console.log(`Audio data byte size: ${audioData.byteLength}.`)

            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

在此处,可以使用生成的 ArrayBuffer 对象来实现任何自定义行为。From here you can implement any custom behavior using the resulting ArrayBuffer object.

自定义音频格式Customize audio format

以下部分介绍如何自定义音频输出属性,包括:The following section shows how to customize audio output attributes including:

  • 音频文件类型Audio file type
  • 采样率Sample-rate
  • 位深度Bit-depth

若要更改音频格式,请使用 SpeechConfig 对象的 speechSynthesisOutputFormat 属性。To change the audio format, you use the speechSynthesisOutputFormat property on the SpeechConfig object. 此属性需要 SpeechSynthesisOutputFormat 类型的 enum,用于选择输出格式。This property expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. 请参阅参考文档,获取可用的音频格式列表See the reference docs for a list of audio formats that are available.

可根据要求对不同的文件类型使用不同的选项。There are various options for different file types depending on your requirements. 请注意,根据定义,Raw24Khz16BitMonoPcm 等原始格式不包括音频标头。Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. 仅当你知道下游实现可以解码原始位流,或者你打算基于位深度、采样率、通道数等属性手动生成标头时,才使用原始格式。Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

此示例通过对 SpeechConfig 对象设置 speechSynthesisOutputFormat 来指定高保真 RIFF 格式 Riff24Khz16BitMonoPcmIn this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the speechSynthesisOutputFormat on the SpeechConfig object. 与上一部分中的示例类似,获取音频 ArrayBuffer 数据并与之进行交互。Similar to the example in the previous section, get the audio ArrayBuffer data and interact with it.

function synthesizeSpeech() {
    const speechConfig = SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");

    // Set the output format
    speechConfig.speechSynthesisOutputFormat = SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm;

    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, undefined);
    synthesizer.speakTextAsync(
        "Customizing audio output format.",
        result => {
            // Interact with the audio ArrayBuffer data
            const audioData = result.audioData;
            console.log(`Audio data byte size: ${audioData.byteLength}.`)

            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

再次运行程序会将 .wav 文件写入到指定的路径。Running your program again will write a .wav file to the specified path.

使用 SSML 自定义语音特征Use SSML to customize speech characteristics

借助语音合成标记语言 (SSML),可以通过从 XML 架构中提交请求,来微调文本转语音输出的音节、发音、语速、音量等特征。Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. 本部分将演示一些实际用法示例,但如果你需要更详细的指导,请参阅 SSML 操作指南文章This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

若要开始使用 SSML 进行自定义,请做出一项切换语音的简单更改。To start using SSML for customization, you make a simple change that switches the voice. 首先,在根项目目录中为 SSML 配置创建一个新的 XML 文件,在本示例中为 ssml.xmlFirst, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. 根元素始终是 <speak>。将文本包装在 <voice> 元素中可以使用 name 参数来更改语音。The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. 本示例将语音更改为英式英语男声语音。This example changes the voice to a male English (UK) voice. 请注意,此语音是标准语音,其定价和可用性与神经语音不同。 Note that this voice is a standard voice, which has different pricing and availability than neural voices. 查看受支持标准语音的完整列表See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

接下来,需要更改语音合成请求以引用 XML 文件。Next, you need to change the speech synthesis request to reference your XML file. 该请求基本上保持不变,只不过需要使用 speakSsmlAsync() 而不是 speakTextAsync() 函数。The request is mostly the same, but instead of using the speakTextAsync() function, you use speakSsmlAsync(). 此函数需要 XML 字符串,因此,请先创建一个加载 XML 文件并将其作为字符串返回的函数。This function expects an XML string, so first you create a function to load an XML file and return it as a string.

function xmlToString(filePath) {
    const xml = readFileSync(filePath, "utf8");
    return xml;
}

有关 readFileSync 的详细信息,请参阅 Node.js 文件系统For more information on readFileSync, see Node.js file system. 在此处,结果对象与前面的示例完全相同。From here, the result object is exactly the same as previous examples.

function synthesizeSpeech() {
    const speechConfig = sdk.SpeechConfig.fromSubscription("YourSubscriptionKey", "YourServiceRegion");
    const synthesizer = new sdk.SpeechSynthesizer(speechConfig, undefined);

    const ssml = xmlToString("ssml.xml");
    synthesizer.speakSsmlAsync(
        ssml,
        result => {
            if (result.errorDetails) {
                console.error(result.errorDetails);
            } else {
                console.log(JSON.stringify(result));
            }

            synthesizer.close();
        },
        error => {
            console.log(error);
            synthesizer.close();
        });
}

输出正常,但可以做出几项简单的附加更改,使语音输出听起来更自然。The output works, but there a few simple additional changes you can make to help it sound more natural. 整体语速稍有点快,因此,我们将添加一个 <prosody> 标记,并将语速降至默认语速的 90%。 The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. 此外,句子中逗号后面的停顿稍有点短,听起来不太自然。Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. 若要解决此问题,可添加一个 <break> 标记来延迟语音,然后将时间参数设置为 200ms。 To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. 重新运行合成,以查看这些自定义操作对输出的影响。Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

神经语音Neural voices

神经语音是立足于深度神经网络的语音合成算法。Neural voices are speech synthesis algorithms powered by deep neural networks. 使用神经语音时,几乎无法将合成的语音与人类录音区分开来。When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. 随着类人的自然韵律和字词的清晰发音,用户在与 AI 系统交互时,神经语音显著减轻了听力疲劳。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

若要切换到某种神经语音,请将 name 更改为神经语音选项之一。To switch to a neural voice, change the name to one of the neural voice options. 然后,为 mstts 添加 XML 命名空间,并在 <mstts:express-as> 标记中包装文本。Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. 使用 style 参数自定义讲话风格。Use the style param to customize the speaking style. 此示例使用 cheerful,但请尝试将其设置为 customerservicechat,以了解讲话风格的差别。This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

先决条件Prerequisites

本文假定你有 Azure 帐户和语音服务订阅。This article assumes that you have an Azure account and Speech service subscription. 如果你没有帐户和订阅,可以免费试用语音服务If you don't have an account and subscription, try the Speech service for free.

安装语音 SDKInstall the Speech SDK

你需要先安装语音 SDK,然后才能执行任何操作。Before you can do anything, you'll need to install the Speech SDK.

pip install azure-cognitiveservices-speech

如果使用的是 macOS 且你遇到安装问题,则可能需要先运行此命令。If you're on macOS and run into install issues, you may need to run this command first.

python3 -m pip install --upgrade pip

安装语音 SDK 后,在脚本顶部包含以下 import 语句。After the Speech SDK is installed, include the following import statements at the top of your script.

from azure.cognitiveservices.speech import AudioDataStream, SpeechConfig, SpeechSynthesizer, SpeechSynthesisOutputFormat
from azure.cognitiveservices.speech.audio import AudioOutputConfig

创建语音配置Create a speech configuration

若要使用语音 SDK 调用语音服务,需要创建 SpeechConfigTo call the Speech service using the Speech SDK, you need to create a SpeechConfig. 此类包含有关你的订阅的信息,例如你的密钥和关联的区域、终结点、主机或授权令牌。This class includes information about your subscription, like your key and associated region, endpoint, host, or authorization token.

备注

无论你是要执行语音识别、语音合成、翻译,还是意向识别,都需要创建一个配置。Regardless of whether you're performing speech recognition, speech synthesis, translation, or intent recognition, you'll always create a configuration.

可以通过以下几种方法初始化 SpeechConfigThere are a few ways that you can initialize a SpeechConfig:

  • 使用订阅:传入密钥和关联的区域。With a subscription: pass in a key and the associated region.
  • 使用终结点:传入语音服务终结点。With an endpoint: pass in a Speech service endpoint. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用主机:传入主机地址。With a host: pass in a host address. 密钥或授权令牌是可选的。A key or authorization token is optional.
  • 使用授权令牌:传入授权令牌和关联的区域。With an authorization token: pass in an authorization token and the associated region.

在此示例中,你将使用订阅密钥和区域创建一个 SpeechConfigIn this example, you create a SpeechConfig using a subscription key and region. 请查看区域支持页,找到你的区域标识符。See the region support page to find your region identifier.

speech_config = SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion")

将语音合成到文件中Synthesize speech to a file

接下来,创建一个 SpeechSynthesizer 对象,用于执行文本到语音的转换,并将转换结果输出到扬声器、文件或其他输出流。Next, you create a SpeechSynthesizer object, which executes text-to-speech conversions and outputs to speakers, files, or other output streams. SpeechSynthesizer 接受的参数包括上一步骤创建的 SpeechConfig 对象,以及用于指定如何处理输出结果的 AudioOutputConfig 对象。The SpeechSynthesizer accepts as params the SpeechConfig object created in the previous step, and an AudioOutputConfig object that specifies how output results should be handled.

若要开始,请创建一个 AudioOutputConfig,以使用 filename 构造函数参数自动将输出写入到 .wav 文件。To start, create an AudioOutputConfig to automatically write the output to a .wav file, using the filename constructor param.

audio_config = AudioOutputConfig(filename="path/to/write/file.wav")

接下来,通过将 speech_config 对象和 audio_config 对象作为参数传递来实例化 SpeechSynthesizerNext, instantiate a SpeechSynthesizer by passing your speech_config object and the audio_config object as params. 然后,只需结合一个文本字符串运行 speak_text_async(),就能执行语音合成和写入文件的操作。Then, executing speech synthesis and writing to a file is as simple as running speak_text_async() with a string of text.

synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
synthesizer.speak_text_async("A simple test to write to a file.")

运行程序,合成的 .wav 文件会写入指定的位置。Run the program, and a synthesized .wav file is written to the location you specified. 这只是最基本用法的一个很好示例。接下来,你将了解如何自定义输出,并将输出响应作为适用于自定义方案的内存中流进行处理。This is a good example of the most basic usage, but next you look at customizing output and handling the output response as an in-memory stream for working with custom scenarios.

合成到扬声器输出Synthesize to speaker output

在某些情况下,你可能希望直接将合成的语音输出到扬声器。In some cases, you may want to directly output synthesized speech directly to a speaker. 为此,可使用上一部分中的示例,但需要通过删除 filename 参数来更改 AudioOutputConfig,并设置 use_default_speaker=TrueTo do this, use the example in the previous section, but change the AudioOutputConfig by removing the filename param, and set use_default_speaker=True. 这会将语音输出到当前处于活动状态的输出设备。This outputs to the current active output device.

audio_config = AudioOutputConfig(use_default_speaker=True)

获取内存中流形式的结果Get result as an in-memory stream

在许多语音应用程序开发方案中,你可能需要内存中流形式的最终音频数据,而不是直接写入文件。For many scenarios in speech application development, you likely need the resulting audio data as an in-memory stream rather than directly writing to a file. 这样便可以构建自定义行为,包括:This will allow you to build custom behavior including:

  • 抽取生成的字节数组,作为自定义下游服务的可搜寻流。Abstract the resulting byte array as a seek-able stream for custom downstream services.
  • 将结果与其他 API 或服务相集成。Integrate the result with other API's or services.
  • 修改音频数据、写入自定义 .wav 标头,等等。Modify the audio data, write custom .wav headers, etc.

可以轻松地在前一个示例的基础上进行此项更改。It's simple to make this change from the previous example. 首先删除 AudioConfig,因为从现在起,你将手动管理输出行为,以提高控制度。First, remove the AudioConfig, as you will manage the output behavior manually from this point onward for increased control. 然后在 SpeechSynthesizer 构造函数中为 AudioConfig 传递 NoneThen pass None for the AudioConfig in the SpeechSynthesizer constructor.

备注

如果为 AudioConfig 传递 None,而不是像在前面的扬声器输出示例中那样省略它,则默认不会在当前处于活动状态的输出设备上播放音频。Passing None for the AudioConfig, rather than omitting it like in the speaker output example above, will not play the audio by default on the current active output device.

这一次,请将结果保存到 SpeechSynthesisResult 变量。This time, you save the result to a SpeechSynthesisResult variable. audio_data 属性包含输出数据的 bytes 对象。The audio_data property contains a bytes object of the output data. 可以手动使用此对象,也可以使用 AudioDataStream 类来管理内存中流。You can work with this object manually, or you can use the AudioDataStream class to manage the in-memory stream. 此示例使用 AudioDataStream 构造函数获取结果中的流。In this example you use the AudioDataStream constructor to get a stream from the result.

synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)
result = synthesizer.speak_text_async("Getting the response as an in-memory stream.").get()
stream = AudioDataStream(result)

在此处,可以使用生成的 stream 对象来实现任何自定义行为。From here you can implement any custom behavior using the resulting stream object.

自定义音频格式Customize audio format

以下部分介绍如何自定义音频输出属性,包括:The following section shows how to customize audio output attributes including:

  • 音频文件类型Audio file type
  • 采样率Sample-rate
  • 位深度Bit-depth

若要更改音频格式,请对 SpeechConfig 对象使用 set_speech_synthesis_output_format() 函数。To change the audio format, you use the set_speech_synthesis_output_format() function on the SpeechConfig object. 此函数需要 SpeechSynthesisOutputFormat 类型的 enum,用于选择输出格式。This function expects an enum of type SpeechSynthesisOutputFormat, which you use to select the output format. 请参阅参考文档,获取可用的音频格式列表See the reference docs for a list of audio formats that are available.

可根据要求对不同的文件类型使用不同的选项。There are various options for different file types depending on your requirements. 请注意,根据定义,Raw24Khz16BitMonoPcm 等原始格式不包括音频标头。Note that by definition, raw formats like Raw24Khz16BitMonoPcm do not include audio headers. 仅当你知道下游实现可以解码原始位流,或者你打算基于位深度、采样率、通道数等属性手动生成标头时,才使用原始格式。Use raw formats only when you know your downstream implementation can decode a raw bitstream, or if you plan on manually building headers based on bit-depth, sample-rate, number of channels, etc.

此示例通过对 SpeechConfig 对象设置 SpeechSynthesisOutputFormat 来指定高保真 RIFF 格式 Riff24Khz16BitMonoPcmIn this example, you specify a high-fidelity RIFF format Riff24Khz16BitMonoPcm by setting the SpeechSynthesisOutputFormat on the SpeechConfig object. 类似于上一部分中的示例,可以使用 AudioDataStream 获取结果的内存中流,然后将其写入文件。Similar to the example in the previous section, you use AudioDataStream to get an in-memory stream of the result, and then write it to a file.

speech_config.set_speech_synthesis_output_format(SpeechSynthesisOutputFormat["Riff24Khz16BitMonoPcm"])
synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)

result = synthesizer.speak_text_async("Customizing audio output format.").get()
stream = AudioDataStream(result)
stream.save_to_wav_file("path/to/write/file.wav")

再次运行程序会将自定义的 .wav 文件写入指定的路径。Running your program again will write a customized .wav file to the specified path.

使用 SSML 自定义语音特征Use SSML to customize speech characteristics

借助语音合成标记语言 (SSML),可以通过从 XML 架构中提交请求,来微调文本转语音输出的音节、发音、语速、音量等特征。Speech Synthesis Markup Language (SSML) allows you to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output by submitting your requests from an XML schema. 本部分将演示一些实际用法示例,但如果你需要更详细的指导,请参阅 SSML 操作指南文章This section shows a few practical usage examples, but for a more detailed guide, see the SSML how-to article.

若要开始使用 SSML 进行自定义,请做出一项切换语音的简单更改。To start using SSML for customization, you make a simple change that switches the voice. 首先,在根项目目录中为 SSML 配置创建一个新的 XML 文件,在本示例中为 ssml.xmlFirst, create a new XML file for the SSML config in your root project directory, in this example ssml.xml. 根元素始终是 <speak>。将文本包装在 <voice> 元素中可以使用 name 参数来更改语音。The root element is always <speak>, and wrapping the text in a <voice> element allows you to change the voice using the name param. 本示例将语音更改为英式英语男声语音。This example changes the voice to a male English (UK) voice. 请注意,此语音是标准语音,其定价和可用性与神经语音不同。 Note that this voice is a standard voice, which has different pricing and availability than neural voices. 查看受支持标准语音的完整列表See the full list of supported standard voices.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    When you're on the motorway, it's a good idea to use a sat-nav.
  </voice>
</speak>

接下来,需要更改语音合成请求以引用 XML 文件。Next, you need to change the speech synthesis request to reference your XML file. 该请求基本上保持不变,只不过需要使用 speak_ssml_async() 而不是 speak_text_async() 函数。The request is mostly the same, but instead of using the speak_text_async() function, you use speak_ssml_async(). 此函数需要 XML 字符串,因此,请先读取字符串形式的 SSML 配置。This function expects an XML string, so you first read your SSML config as a string. 在此处,结果对象与前面的示例完全相同。From here, the result object is exactly the same as previous examples.

备注

如果 ssml_string 在字符串开头包含 ,则需要去除 BOM 格式,否则服务会返回错误。If your ssml_string contains  at the beginning of the string, you need to strip off the BOM format or the service will return an error. 为此,可按如下所示设置 encoding 参数:open("ssml.xml", "r", encoding="utf-8-sig")You do this by setting the encoding parameter as follows: open("ssml.xml", "r", encoding="utf-8-sig").

synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)

ssml_string = open("ssml.xml", "r").read()
result = synthesizer.speak_ssml_async(ssml_string).get()

stream = AudioDataStream(result)
stream.save_to_wav_file("path/to/write/file.wav")

输出正常,但可以做出几项简单的附加更改,使语音输出听起来更自然。The output works, but there a few simple additional changes you can make to help it sound more natural. 整体语速稍有点快,因此,我们将添加一个 <prosody> 标记,并将语速降至默认语速的 90%。 The overall speaking speed is a little too fast, so we'll add a <prosody> tag and reduce the speed to 90% of the default rate. 此外,句子中逗号后面的停顿稍有点短,听起来不太自然。Additionally, the pause after the comma in the sentence is a little too short and unnatural sounding. 若要解决此问题,可添加一个 <break> 标记来延迟语音,然后将时间参数设置为 200ms。 To fix this issue, add a <break> tag to delay the speech, and set the time param to 200ms. 重新运行合成,以查看这些自定义操作对输出的影响。Re-run the synthesis to see how these customizations affected the output.

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-GB-George-Apollo">
    <prosody rate="0.9">
      When you're on the motorway,<break time="200ms"/> it's a good idea to use a sat-nav.
    </prosody>
  </voice>
</speak>

神经语音Neural voices

神经语音是立足于深度神经网络的语音合成算法。Neural voices are speech synthesis algorithms powered by deep neural networks. 使用神经语音时,几乎无法将合成的语音与人类录音区分开来。When using a neural voice, synthesized speech is nearly indistinguishable from the human recordings. 随着类人的自然韵律和字词的清晰发音,用户在与 AI 系统交互时,神经语音显著减轻了听力疲劳。With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when users interact with AI systems.

若要切换到某种神经语音,请将 name 更改为神经语音选项之一。To switch to a neural voice, change the name to one of the neural voice options. 然后,为 mstts 添加 XML 命名空间,并在 <mstts:express-as> 标记中包装文本。Then, add an XML namespace for mstts, and wrap your text in the <mstts:express-as> tag. 使用 style 参数自定义讲话风格。Use the style param to customize the speaking style. 此示例使用 cheerful,但请尝试将其设置为 customerservicechat,以了解讲话风格的差别。This example uses cheerful, but try setting it to customerservice or chat to see the difference in speaking style.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-AriaNeural">
    <mstts:express-as style="cheerful">
      This is awesome!
    </mstts:express-as>
  </voice>
</speak>

其他语言和平台支持Additional language and platform support

如果已单击此选项卡,则可能看不到你偏好的编程语言的基础知识文章。If you've clicked this tab, you probably didn't see a basics article in your favorite programming language. 别担心,我们在 GitHub 上提供了其他代码示例。Don't worry, we have additional code samples available on GitHub. 使用表格查找适用于编程语言和平台/OS 组合的相应示例。Use the table to find the right sample for your programming language and platform/OS combination.

语言Language 代码示例Code samples
C#C# .NET Framework.NET CoreUWPUnityXamarin.NET Framework, .NET Core, UWP, Unity, Xamarin
C++C++ 快速入门示例Quickstarts, Samples
JavaJava AndroidJREAndroid, JRE
JavascriptJavaScript 浏览器Browser
Node.jsNode.js Windows、Linux 和 macOSWindows, Linux, macOS
Objective-CObjective-C iOSmacOSiOS, macOS
PythonPython Windows、Linux 和 macOSWindows, Linux, macOS
SwiftSwift iOSmacOSiOS, macOS

后续步骤Next steps