准备自定义语音识别的数据Prepare data for Custom Speech

在测试 Microsoft 语音识别的准确性或训练自定义模型时,需要音频和文本数据。When testing the accuracy of Microsoft speech recognition or training your custom models, you'll need audio and text data. 本页介绍数据的类型、用法及其管理方式。On this page, we cover the types of data, how to use, and manage them.

数据类型Data types

下表列出了接受的数据类型、何时使用每种数据类型,以及建议的数量。This table lists accepted data types, when each data type should be used, and the recommended quantity. 创建模型不一定要用到每种数据类型。Not every data type is required to create a model. 数据要求根据是要创建测试还是训练模型而异。Data requirements will vary depending on whether you're creating a test or training a model.

数据类型Data type 用于测试Used for testing 建议的数量Recommended quantity 用于训练Used for training 建议的数量Recommended quantity
音频:Audio Yes
用于视觉检测Used for visual inspection
5 个以上的音频文件5+ audio files No 不适用N/a
音频和人为标记的听录内容Audio + Human-labeled transcripts Yes
用于评估准确度Used to evaluate accuracy
0.5-5 小时的音频0.5-5 hours of audio Yes 1-1,000 小时的音频1-1,000 hours of audio
相关文本Related text No 不适用N/a Yes 1-200 MB 的相关文本1-200 MB of related text

文件应按类型分组成数据集,并作为 .zip 文件上传。Files should be grouped by type into a dataset and uploaded as a .zip file. 每个数据集只能包含一种数据类型。Each dataset can only contain a single data type.


若要快速开始使用,请考虑使用示例数据。To quickly get started, consider using sample data. 请参阅此 GitHub 存储库,了解“自定义语音识别”数据示例See this GitHub repository for sample Custom Speech data

上传数据Upload data

若要上传数据,请导航到“自定义语音识别”门户To upload your data, navigate to the Custom Speech portal . 在门户中,单击“上传数据”启动向导并创建第一个数据集 。From the portal, click Upload data to launch the wizard and create your first dataset. 在上传数据之前,系统会要求你为数据集选择语音数据类型。You'll be asked to select a speech data type for your dataset, before allowing you to upload your data.


上传的每个数据集必须符合所选数据类型的要求。Each dataset you upload must meet the requirements for the data type that you choose. 必须先将数据设置为正确格式再上传它。Your data must be correctly formatted before it's uploaded. 格式正确的数据可确保“自定义语音识别”服务对其进行准确处理。Correctly formatted data ensures it will be accurately processed by the Custom Speech service. 以下部分列出了要求。Requirements are listed in the following sections.

上传数据集后,可以使用几个选项:After your dataset is uploaded, you have a few options:

  • 可以导航到“测试”选项卡,并直观地查看仅包含音频的数据,或同时包含音频和人为标记的听录内容的数据。 You can navigate to the Testing tab and visually inspect audio only or audio + human-labeled transcription data.
  • 可以导航到“训练”选项卡,并使用音频和人为听录数据或相关文本数据来训练自定义模型。 You can navigate to the Training tab and use audio + human transcription data or related text data to train a custom model.

用于测试的音频数据Audio data for testing

音频数据最适合用于测试 Microsoft 基线语音转文本模型或自定义模型的准确度。Audio data is optimal for testing the accuracy of Microsoft's baseline speech-to-text model or a custom model. 请记住,音频数据用于检查语音的准确度,反映特定模型的性能。Keep in mind, audio data is used to inspect the accuracy of speech with regards to a specific model's performance. 若要量化模型的准确度,请使用音频和人为标记的听录数据If you're looking to quantify the accuracy of a model, use audio + human-labeled transcription data.

参考下表来确保正确设置用于自定义语音识别的音频文件的格式:Use this table to ensure that your audio files are formatted correctly for use with Custom Speech:

属性Property ValueValue
文件格式File format RIFF (WAV)RIFF (WAV)
采样速率Sample rate 8,000 Hz 或 16,000 Hz8,000 Hz or 16,000 Hz
声道Channels 1(单音)1 (mono)
每个音频的最大长度Maximum length per audio 2 小时2 hours
示例格式Sample format PCM,16 位PCM, 16-bit
存档格式Archive format .zip.zip
最大存档大小Maximum archive size 2 GB2 GB


上传训练和测试数据时,.zip 文件大小不能超过 2 GB。When uploading training and testing data, the .zip file size cannot exceed 2 GB. 如果需要更多数据来进行训练,请将其划分为多个 .zip 文件并分别上传。If you require more data for training, divide it into several .zip files and upload them separately. 稍后,可选择从多个数据集进行训练 。Later, you can choose to train from multiple datasets. 但是,只能从单个数据集进行测试 。However, you can only test from a single dataset.

使用 SoX 来验证音频属性,或将现有音频转换为适当的格式。Use SoX to verify audio properties or convert existing audio to the appropriate formats. 下面这些示例演示如何通过 SoX 命令行完成其中的每个活动:Below are some examples of how each of these activities can be done through the SoX command line:

活动Activity 说明Description SoX 命令SoX command
检查音频格式Check audio format 使用此命令检查Use this command to check
音频文件格式。the audio file format.
sox --i <filename>
转换音频格式Convert audio format 使用此命令Use this command to convert
将音频文件转换为单声道 16 位 16 KHz。the audio file to single channel, 16-bit, 16 KHz.
sox <input> -b 16 -e signed-integer -c 1 -r 16k -t wav <output>.wav

用于测试/训练的音频和人为标记的听录数据Audio + human-labeled transcript data for testing/training

若要在处理音频文件时测量 Microsoft 语音转文本的准确度,必须提供人为标记的听录内容(逐字对照)进行比较。To measure the accuracy of Microsoft's speech-to-text accuracy when processing your audio files, you must provide human-labeled transcriptions (word-by-word) for comparison. 尽管人为标记的听录往往很耗时,但有必要评估准确度并根据用例训练模型。While human-labeled transcription is often time consuming, it's necessary to evaluate accuracy and to train the model for your use cases. 请记住,识别能力的改善程度以提供的数据质量为界限。Keep in mind, the improvements in recognition will only be as good as the data provided. 出于此原因,只能上传优质的听录内容,这一点非常重要。For that reason, it's important that only high-quality transcripts are uploaded.

属性Property ValueValue
文件格式File format RIFF (WAV)RIFF (WAV)
采样速率Sample rate 8,000 Hz 或 16,000 Hz8,000 Hz or 16,000 Hz
声道Channels 1(单音)1 (mono)
每个音频的最大长度Maximum length per audio 2 小时(测试)/ 60 秒(训练)2 hours (testing) / 60 s (training)
示例格式Sample format PCM,16 位PCM, 16-bit
存档格式Archive format .zip.zip
最大 zip 大小Maximum zip size 2 GB2 GB


上传训练和测试数据时,.zip 文件大小不能超过 2 GB。When uploading training and testing data, the .zip file size cannot exceed 2 GB. 只能从单个数据集进行测试,请确保将其保持在适当的文件大小 。Uou can only test from a single dataset, be sure to keep it within the appropriate file size.

若要解决字词删除或替换等问题,需要提供大量的数据来改善识别能力。To address issues like word deletion or substitution, a significant amount of data is required to improve recognition. 通常,我们建议为大约 10 到 1,000 小时的音频提供逐字对照的听录。Generally, it's recommended to provide word-by-word transcriptions for roughly 10 to 1,000 hours of audio. 应在单个纯文本文件中包含所有 WAV 文件的听录。The transcriptions for all WAV files should be contained in a single plain-text file. 听录文件的每一行应包含一个音频文件的名称,后接相应的听录。Each line of the transcription file should contain the name of one of the audio files, followed by the corresponding transcription. 文件名和听录应以制表符 (\t) 分隔。The file name and transcription should be separated by a tab (\t).

例如:For example:

  speech01.wav  speech recognition is awesome
  speech02.wav  the quick brown fox jumped all over the place
  speech03.wav  the lazy dog was not amused


听录应编码为 UTF-8 字节顺序标记 (BOM)。Transcription should be encoded as UTF-8 byte order mark (BOM).

听录内容应经过文本规范化,以便可由系统处理。The transcriptions are text-normalized so they can be processed by the system. 但是,将数据上传到 Speech Studio 之前,必须完成一些重要的规范化操作。However, there are some important normalizations that must be done before uploading the data to the Speech Studio. 有关在准备听录内容时可用的适当语言,请参阅如何创建人为标记的听录内容For the appropriate language to use when you prepare your transcriptions, see How to create a human-labeled transcription

收集音频文件和相应的听录内容后,应将其打包成单个 .zip 文件,然后上传到“自定义语音识别”门户After you've gathered your audio files and corresponding transcriptions, package them as a single .zip file before uploading to the Custom Speech portal . 下面是一个示例数据集,其中包含三个音频文件和一个人为标记的听录文件:Below is an example dataset with three audio files and a human-labeled transcription file:

从语音门户选择音频Select audio from the Speech Portal

唯一的产品名称或功能应包含用于训练的相关文本数据。Product names or features that are unique, should include related text data for training. 相关文本有助于确保正确识别。Related text helps ensure correct recognition. 可以提供两种类型的相关文本数据来改善识别能力:Two types of related text data can be provided to improve recognition:

数据类型Data type 这些数据如何改善识别能力How this data improves recognition
句子(言语)Sentences (utterances) 在识别句子上下文中的产品名称或行业特定的词汇时,可以提高准确度。Improve accuracy when recognizing product names, or industry-specific vocabulary within the context of a sentence.
发音Pronunciations 改善不常见字词、缩写词或其他未定义发音的单词的发音。Improve pronunciation of uncommon terms, acronyms, or other words with undefined pronunciations.

可将言语作为单个或多个文本文件提供。Sentences can be provided as a single text file or multiple text files. 若要提高准确性,请使用较接近预期口头言语的文本数据。To improve accuracy, use text data that is closer to the expected spoken utterances. 应以单个文本文件的形式提供发音。Pronunciations should be provided as a single text file. 可将所有内容打包成单个 zip 文件,并上传到“自定义语音识别”门户Everything can be packaged as a single zip file and uploaded to the Custom Speech portal .

有关创建句子文件的指导原则Guidelines to create a sentences file

若要使用句子的自定义模型,需要提供示例言语表。To create a custom model using sentences, you'll need to provide a list of sample utterances. 言语不一定要是完整的或者语法正确的,但必须准确反映生产环境中预期的口头输入 。Utterances do not need to be complete or grammatically correct, but they must accurately reflect the spoken input you expect in production. 如果想要增大某些字词的权重,可添加包含这些特定字词的多个句子。If you want certain terms to have increased weight, add several sentences that include these specific terms.

一般原则是,训练文本越接近生产环境中预期的实际文本,模型适应越有效。As general guidance, model adaptation is most effective when the training text is as close as possible to the real text expected in production. 应在训练文本中包含要增强的行话和短语。Domain-specific jargon and phrases that you're targeting to enhance, should be included in training text. 如果可能,尽量将一个句子或关键字控制在单独的一行中。When possible, try to have one sentence or keyword controlled on a separate line. 对于重要的关键字和短语(例如产品名),可以将其复制几次。For keywords and phrases that are important to you (for example, product names), you can copy them a few times. 但请记住,不要复制太多次,这可能会影响总体识别率。But keep in mind, don't copy too much - it could affect the overall recognition rate.

参考下表来确保正确设置言语相关数据文件的格式:Use this table to ensure that your related data file for utterances is formatted correctly:

属性Property ValueValue
文本编码Text encoding UTF-8 BOMUTF-8 BOM
每行的话语数# of utterances per line 11
文件大小上限Maximum file size 200 MB200 MB

此外,还需要考虑以下限制:Additionally, you'll want to account for the following restrictions:

  • 避免重复字符四次以上。Avoid repeating characters more than four times. 例如:“aaaa”或“uuuu”。For example: "aaaa" or "uuuu".
  • 请勿使用特殊字符或编码在 U+00A1 以后的 UTF-8 字符。Don't use special characters or UTF-8 characters above U+00A1.
  • 将会拒绝 URI。URIs will be rejected.

有关创建发音文件的指导原则Guidelines to create a pronunciation file

如果用户会遇到或使用没有标准发音的不常见字词,你可以提供自定义发音文件来改善识别能力。If there are uncommon terms without standard pronunciations that your users will encounter or use, you can provide a custom pronunciation file to improve recognition.


建议不要使用自定义发音文件来改变常用字的发音。It is not recommended to use custom pronunciation files to alter the pronunciation of common words.

下面提供了口述言语的示例,以及每个言语的自定义发音:This includes examples of a spoken utterance, and a custom pronunciation for each:

识别/显示的形式Recognized/displayed form 口头形式Spoken form
3CPO3CPO three c p othree c p o
CNTKCNTK c n t kc n t k
IEEEIEEE i triple ei triple e

口述形式是拼写的拼音顺序。它可以由字母、单词、音节或三者的组合构成。The spoken form is the phonetic sequence spelled out. It can be composed of letter, words, syllables, or a combination of all three.

自定义发音适用于英语 (en-US) 和德语 (de-DE)。Customized pronunciation is available in English (en-US) and German (de-DE). 下表按语言显示了支持的字符:This table shows supported characters by language:

语言Language LocaleLocale 字符Characters
英语English en-US a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z
德语German de-DE ä, ö, ü, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z

参考下表来确保相关发音数据文件的格式正确。Use the following table to ensure that your related data file for pronunciations is correctly formatted. 发音文件较小,应只占几千字节。Pronunciation files are small, and should only be a few kilobytes in size.

属性Property ValueValue
文本编码Text encoding UTF-8 BOM(英语还支持 ANSI)UTF-8 BOM (ANSI is also supported for English)
每行的发音数目# of pronunciations per line 11
文件大小上限Maximum file size 1 MB(在免费层中为 1 KB)1 MB (1 KB for free tier)

后续步骤Next steps