创建自定义语音Create a Custom Voice

为自定义语音准备数据中已介绍可用于训练自定义语音的不同数据类型,以及不同的格式要求。In Prepare data for Custom Voice, we described the different data types you can use to train a custom voice and the different format requirements. 准备好数据后,可以开始将其上传到自定义语音门户;也可以通过自定义语音训练 API 上传。Once you have prepared your data, you can start to upload them to the Custom Voice portal, or through the Custom Voice training API. 本文介绍通过门户训练自定义语音的步骤。Here we describe the steps of training a custom voice through the portal.

备注

本页假设你已阅读自定义语音入门为自定义语音准备数据,并已创建一个自定义语音项目。This page assumes you have read Get started with Custom Voice and Prepare data for Custom Voice, and have created a Custom Voice project.

检查自定义语音支持的语言:支持自定义的语言Check the languages supported for custom voice: language for customization.

上传数据集Upload your datasets

如果你已准备好上传数据,请转到自定义语音门户When you're ready to upload your data, go to the Custom Voice portal. 创建或选择一个自定义语音项目。Create or select a Custom Voice project. 该项目必须与你打算用于语音训练的数据共享正确的语言/区域设置和性别属性。The project must share the right language/locale and the gender properties as the data you intent to use for your voice training. 例如,如果你的音频录制内容是使用英式口音录制的,请选择 en-GBFor example, select en-GB if the audio recordings you have is done in English with a UK accent.

转到“数据”选项卡并单击“上传数据”。 Go to the Data tab and click Upload data . 在向导中,选择与准备好的数据匹配的适当数据类型。In the wizard, select the correct data type that matches what you have prepared.

上传的每个数据集必须符合所选数据类型的要求。Each dataset you upload must meet the requirements for the data type that you choose. 在上传数据之前,必须正确设置数据的格式。It is important to correctly format your data before it's uploaded. 这可以确保自定义语音服务准确处理数据。This ensures the data will be accurately processed by the Custom Voice service. 转到为自定义语音准备数据,并确保数据的格式正确。Go to Prepare data for Custom Voice and make sure your data has been rightly formatted.

备注

免费订阅 (F0) 用户可以同时上传两个数据集。Free subscription (F0) users can upload two datasets simultaneously. 标准订阅 (S0) 用户可以同时上传五个数据集。Standard subscription (S0) users can upload five datasets simultaneously. 如果达到限制,请先等待,直至至少其中一个数据集导入完毕。If you reach the limit, wait until at least one of your datasets finishes importing. 然后重试。Then try again.

备注

每个订阅允许导入的数据集的最大数目为 10 个(对于免费订阅 (F0) 用户)和 500 个(对于标准订阅 (S0) 用户).zip 文件。The maximum number of datasets allowed to be imported per subscription is 10 .zip files for free subscription (F0) users and 500 for standard subscription (S0) users.

点击“上传”按钮后,系统会自动验证数据集。Datasets are automatically validated once you hit the upload button. 数据验证包括针对音频文件执行一系列检查以验证文件格式、大小和采样率。Data validation includes series of checks on the audio files to verify their file format, size, and sampling rate. 修复出现的任何错误,然后重新提交。Fix the errors if any and submit again. 成功发起数据导入请求后,数据表中应会出现一个对应于刚刚上传的数据集的条目。When the data-importing request is successfully initiated, you should see an entry in the data table that corresponds to the dataset you’ve just uploaded.

下表显示了已导入数据集的处理状态:The following table shows the processing states for imported datasets:

状态State 含义Meaning
正在处理Processing 已收到并正在处理数据集。Your dataset has been received and is being processed.
已成功Succeeded 已验证数据集,现在可以使用它来生成语音模型。Your dataset has been validated and may now be used to build a voice model.
已失败Failed 数据集在处理过程中因多种原因(例如文件错误、数据问题或网络问题)而失败。Your dataset has been failed during processing due to many reasons, for example file errors, data problems or network issues.

验证完成后,可以在“言语”列中看到每个数据集的已匹配言语总数。After validation is complete, you can see the total number of matched utterances for each of your datasets in the Utterances column. 如果所选数据类型需要进行长音频分段,此列只会反映已根据脚本或通过语音听录服务分段的言语。If the data type you have selected requires long-audio segmentation, this column only reflects the utterances we have segmented for you either based on your transcripts or through the speech transcription service. 可以进一步下载已验证的数据集,以查看已成功导入的言语的详细结果及其映射脚本。You can further download the dataset validated to view the detail results of the utterances successfully imported and their mapping transcripts. 提示:长音频分段可能需要一小时以上才能完成数据处理。Hint: long-audio segmentation can take more than an hour to complete data processing.

对于 en-US 和 zh-CN 数据集,可以下载一份报告,以检查每个录制内容的发音评分和噪音级别。For en-US and zh-CN datasets, you can further download a report to check the pronunciation scores and the noise level for each of your recordings. 发音评分范围为 0 到 100。The pronunciation score ranges from 0 to 100. 评分低于 70 通常表示语音错误或脚本不匹配。A score below 70 normally indicates a speech error or script mismatch. 口音重可能会降低发音分数,影响生成的数字语音。A heavy accent can reduce your pronunciation score and impact the generated digital voice.

信噪比 (SNR) 高表明音频中的噪音少。A higher signal-to-noise ratio (SNR) indicates lower noise in your audio. 通过专业录音棚录音通常可以达到 50+ 的 SNR。You can typically reach a 50+ SNR by recording at professional studios. 音频的 SNR 低于 20 可能导致生成的语音中出现明显的噪音。Audio with an SNR below 20 can result in obvious noise in your generated voice.

考虑重写录制发音分数低或信噪比不佳的表述。Consider re-recording any utterances with low pronunciation scores or poor signal-to-noise ratios. 如果无法重新录制,可以从数据集中排除这些话语。If you can't re-record, you might exclude those utterances from your dataset.

生成自定义语音模型Build your custom voice model

验证数据集后,可以使用它来生成自定义语音模型。After your dataset has been validated, you can use it to build your custom voice model.

  1. 导航到“文本转语音”>“自定义语音”> [项目名称] >“定型”。Navigate to Text-to-Speech > Custom Voice > [name of project] > Training .

  2. 单击“训练模型”。Click Train model .

  3. 接下来,输入 名称说明 以帮助识别此模型。Next, enter a Name and Description to help you identify this model.

    请谨慎选择名称。Choose a name carefully. 此处输入的名称将是在 SSML 输入过程中在请求中指定语音合成所需语音时使用的名称。The name you enter here will be the name you use to specify the voice in your request for speech synthesis as part of the SSML input. 只允许字母、数字以及部分标点字符,例如“-”、“_”和“,”。Only letters, numbers, and a few punctuation characters such as -, _, and (', ') are allowed. 请对不同的语音模型使用不同的名称。Use different names for different voice models.

    通常使用“说明”字段来记录创建模型时所使用的数据集的名称。A common use of the Description field is to record the names of the datasets that were used to create the model.

  4. 在“选择训练数据”页中,选择一个或多个用于训练的数据集。From the Select training data page, choose one or multiple datasets that you would like to use for training. 提交言语之前请检查其数目。Check the number of utterances before you submit them. 对于 en-US 和 zh-CN 语音模型,可以从任意数目的言语着手。You can start with any number of utterances for en-US and zh-CN voice models. 对于其他区域设置,必须选择 2,000 个以上的言语才能训练语音。For other locales, you must select more than 2,000 utterances to be able to train a voice.

    备注

    在训练中将会删除重复的音频名称。Duplicate audio names will be removed from the training. 确保所选的数据集不会在多个 .zip 文件中包含相同的音频名称。Make sure the datasets you select do not contain the same audio names across multiple .zip files.

    提示

    为获得优质结果,必须使用同一讲话者的数据集。Using the datasets from the same speaker is required for quality results. 如果提交用于训练的数据集包含的相异言语总数小于 6,000 个,你将通过“统计参数合成”技术训练语音模型。When the datasets you have submitted for training contain a total number of less than 6,000 distinct utterances, you will train your voice model through the Statistical Parametric Synthesis technique. 如果训练数据的相异言语总数超过 6,000 个,则你将使用“串联合成”技术启动训练过程。In the case where your training data exceeds a total number of 6,000 distinct utterances, you will kick off a training process with the Concatenation Synthesis technique. 通常,串联技术可以生成更自然、更具保真度的语音结果。Normally the concatenation technology can result in more natural, and higher-fidelity voice results. 若要使用最新的“神经 TTS”技术定型模型,请联系自定义语音团队。该技术可生成相当于已公开的神经语音的数字语音。Contact the Custom Voice team if you want to train a model with the latest Neural TTS technology that can produce a digital voice equivalent to the publicly available neural voices.

  5. 单击“训练”开始创建语音模型。Click Train to begin creating your voice model.

“训练”表将显示对应于此新建模型的新条目。The Training table displays a new entry that corresponds to this newly created model. 该表还会显示以下状态:“正在处理”、“成功”、“失败”。The table also displays the status: Processing, Succeeded, Failed.

显示的状态反映了将数据集转换为语音模型的过程,如下所示。The status that's shown reflects the process of converting your dataset to a voice model, as shown here.

状态State 含义Meaning
正在处理Processing 正在创建语音模型。Your voice model is being created.
已成功Succeeded 语音模型已创建并可部署。Your voice model has been created and can be deployed.
已失败Failed 语音模型在训练过程中由于多种原因(例如,察觉不到的数据问题或网络问题)而失败。Your voice model has been failed in training due to many reasons, for example unseen data problems or network issues.

训练时间因处理的音频数据量而异。Training time varies depending on the volume of audio data processed. 通常情况下,处理数百个表述需要大约 30 分钟的时间,处理 20,000 个表述需要 40 小时。Typical times range from about 30 minutes for hundreds of utterances to 40 hours for 20,000 utterances. 模型训练成功后,可以开始对其进行测试。Once your model training is succeeded, you can start to test it.

备注

免费订阅 (F0) 用户可以同时训练一个语音字体。Free subscription (F0) users can train one voice font simultaneously. 标准订阅 (S0) 用户可以同时训练三个语音。Standard subscription (S0) users can train three voices simultaneously. 如果达到限制,请先等待,直至至少其中一种语音字体训练完毕,然后再试。If you reach the limit, wait until at least one of your voice fonts finishes training, and then try again.

备注

允许为每个订阅训练的语音模型的最大数目为 10 个(对于免费订阅 (F0) 用户)和 100 个(对于标准订阅 (S0) 用户)模型。The maximum number of voice models allowed to be trained per subscription is 10 models for free subscription (F0) users and 100 for standard subscription (S0) users.

如果你使用神经语音定型功能,可选择定型针对实时流式处理场景进行优化的模型,或针对异步长音频合成优化的 HD 神经模型。If you are using the neural voice training capability, you can select to train a model optimized for real-time streaming scenarios, or a HD neural model optimized for asynchronous long-audio synthesis.

测试语音模型Test your voice model

成功生成语音字体以后,可以对其先测试后部署,然后就可以使用了。After your voice font is successfully built, you can test it before deploying it for use.

  1. 导航到“文本转语音”>“自定义语音”> [项目名称] >“测试”。Navigate to Text-to-Speech > Custom Voice > [name of project] > Testing .

  2. 单击“添加测试”。Click Add test .

  3. 选择要测试的一个或多个模型。Select one or multiple models that you would like to test.

  4. 提供语音中要朗读的文本。Provide the text you want the voice(s) to speak. 如果选择一次性测试多个模型,则测试不同的模型时将使用相同的文本。If you have selected to test multiple models at one time, the same text will be used for the testing for different models.

    备注

    文本的语言必须与语音字体的语言相同。The language of your text must be the same as the language of your voice font. 只能测试已成功训练的模型。Only successfully trained models can be tested. 此步骤仅支持纯文本。Only plain text is supported in this step.

  5. 单击 创建Click Create .

提交测试请求后,将返回到测试页。Once you have submitted your test request, you will return to the test page. 表中现在会有新请求的对应条目以及状态列。The table now includes an entry that corresponds to your new request and the status column. 可能需要数分钟来合成语音。It can take a few minutes to synthesize speech. 状态列显示“成功”后,你可以播放音频,或下载文本输入(txt 文件)和音频输出(.wav 文件),然后进一步试听后者以检查其质量。When the status column says Succeeded , you can play the audio, or download the text input (a .txt file) and audio output (a .wav file), and further audition the latter for quality.

还可以在选择用于测试的每个模型的详细信息页中找到测试结果。You can also find the test results in the detail page of each models you have selected for testing. 转到“训练”选项卡,然后单击模型名称进入模型详细信息页。Go to the Training tab, and click the model name to enter the model detail page.

创建并使用自定义语音终结点Create and use a custom voice endpoint

成功创建并测试语音模型以后,即可在自定义的文本转语音终结点中部署它。After you've successfully created and tested your voice model, you deploy it in a custom Text-to-Speech endpoint. 然后即可在通过 REST API 发出文本转语音请求时使用此终结点来替代常用的终结点。You then use this endpoint in place of the usual endpoint when making Text-to-Speech requests through the REST API. 只能通过部署字体时所用的订阅来调用自定义终结点。Your custom endpoint can be called only by the subscription that you have used to deploy the font.

若要创建新的自定义语音终结点,请转到“文本转语音”>“自定义语音”>“部署”。To create a new custom voice endpoint, go to Text-to-Speech > Custom Voice > Deployment . 选择“添加终结点”,并输入自定义终结点的 名称说明Select Add endpoint and enter a Name and Description for your custom endpoint. 然后选择要与此终结点关联的自定义语音模型。Then select the custom voice model you would like to associate with this endpoint.

单击“添加”按钮后,会在终结点表中看到新终结点的条目。After you have clicked the Add button, in the endpoint table, you will see an entry for your new endpoint. 新终结点的实例化可能需要数分钟。It may take a few minutes to instantiate a new endpoint. 当部署状态为“成功”时,终结点便可供使用。When the status of the deployment is Succeeded , the endpoint is ready for use.

备注

免费订阅 (F0) 用户只能部署一个模型。Free subscription (F0) users can have only one model deployed. 标准订阅 (S0) 用户可以创建多达 50 个终结点,每个终结点具有自身的自定义语音。Standard subscription (S0) users can create up to 50 endpoints, each with its own custom voice.

备注

若要使用自定义语音,必须指定语音模型名称,直接使用 HTTP 请求中的自定义 URI,并使用同一订阅来通过 TTS 服务的身份验证。To use your custom voice, you must specify the voice model name, use the custom URI directly in an HTTP request, and use the same subscription to pass through the authentication of TTS service.

部署终结点后,其名称将以链接的形式显示。After your endpoint is deployed, the endpoint name appears as a link. 单击此链接可显示特定于该终结点的信息,例如终结点密钥、终结点 URL 和示例代码。Click the link to display information specific to your endpoint, such as the endpoint key, endpoint URL, and sample code.

也可通过自定义语音门户对终结点进行联机测试。Online testing of the endpoint is also available via the custom voice portal. 若要测试终结点,请在“终结点详细信息”页中选择“检查终结点”。 To test your endpoint, choose Check endpoint from the Endpoint detail page. 此时会显示终结点测试页。The endpoint testing page appears. 在文本框中输入要朗读的文本(采用纯文本或 SSML 格式)。Enter the text to be spoken (in either plain text or SSML format in the text box. 若要收听以自定义语音字体朗读的文本,请选择“播放”。To hear the text spoken in your custom voice font, select Play . 此项测试功能会收取自定义语音识别合成使用费。This testing feature will be charged against your custom speech synthesis usage.

从功能上说,自定义终结点与用于文本转语音请求的标准终结点相同。The custom endpoint is functionally identical to the standard endpoint that's used for text-to-speech requests. 有关详细信息,请参阅 REST APISee REST API for more information.

后续步骤Next steps