准备自定义语音识别的数据Prepare data for Custom Speech

在测试 Microsoft 语音识别的准确性或训练自定义模型时,需要音频和文本数据。When testing the accuracy of Microsoft speech recognition or training your custom models, you'll need audio and text data. 在本页中,我们将介绍自定义语音模型所需的数据的类型。On this page, we cover the types of data a custom speech model needs.

数据多样性Data diversity

用来测试和训练自定义模型的文本和音频需要包含你的模型需要识别的来自各种说话人和场景的示例。Text and audio used to test and train a custom model need to include samples from a diverse set of speakers and scenarios you need your model to recognize. 收集进行自定义模型测试和训练所需的数据时,请考虑以下因素:Consider these factors when gathering data for custom model testing and training:

  • 你的文本和语音音频数据需要涵盖用户在与你的模型互动时所用的各种语言陈述。Your text and speech audio data need to cover the kinds of verbal statements your users will make when interacting with your model. 例如,一个能升高和降低温度的模型需要针对人们在请求进行这种更改时会用的陈述进行训练。For example, a model that raises and lowers the temperature needs training on statements people might make to request such changes.
  • 你的数据需要包含模型需要识别的所有语音变型。Your data need to include all speech variances your model will need to recognize. 许多因素可能会改变语音,包括口音、方言、语言混合、年龄、性别、语音音调、紧张程度和当日时间。Many factors can vary speech, including accents, dialects, language-mixing, age, gender, voice pitch, stress level, and time of day.
  • 你包括的示例必须来自使用模型时所在的各种环境(室内、户外、公路噪音)。You must include samples from different environments (indoor, outdoor, road noise) where your model will be used.
  • 必须使用生产系统将要使用的硬件设备来收集音频。Audio must be gathered using hardware devices the production system will use. 如果你的模型需要识别在不同质量的录音设备上录制的语音,则你提供的用来训练模型的音频数据也必须能够代表这些不同的场景。If your model needs to identify speech recorded on recording devices of varying quality, the audio data you provide to train your model must also represent these diverse scenarios.
  • 以后可以向模型中添加更多数据,但要注意使数据集保持多样性并且能够代表你的项目需求。You can add more data to your model later, but take care to keep the dataset diverse and representative of your project needs.
  • 将不在你的自定义模型识别需求范围内的数据包括在内可能会损害整体识别质量,因此请不要包括你的模型不需要转录的数据。Including data that is not within your custom model recognition needs can harm recognition quality overall, so do not include data that your model does not need to transcribe.

基于部分场景训练的模型只能在这些场景中很好地执行。A model trained on a subset of scenarios can only perform well in those scenarios. 请仔细选择能够代表你要求自定义模型识别的全部场景范围的数据。Carefully choose data that represents the full scope of scenarios you need your custom model to recognize.


请从与模型会遇到的语言和声效相匹配的较小的示例数据集着手。Start with small sets of sample data that match the language and acoustics your model will encounter. 例如,可以采用与模型的生产方案相同的硬件和声效环境录制一小段有代表性的示例音频。For example, record a small but representative sample of audio on the same hardware and in the same acoustic environment your model will find in production scenarios. 具有代表性的数据的小型数据集可能会在你投入精力收集大得多的数据集进行训练之前暴露一些问题。Small datasets of representative data can expose problems before you have invested in gathering a much larger datasets for training.

数据类型Data types

下表列出了接受的数据类型、何时使用每种数据类型,以及建议的数量。This table lists accepted data types, when each data type should be used, and the recommended quantity. 创建模型不一定要用到每种数据类型。Not every data type is required to create a model. 数据要求根据是要创建测试还是训练模型而异。Data requirements will vary depending on whether you're creating a test or training a model.

数据类型Data type 用于测试Used for testing 建议的数量Recommended quantity 用于训练Used for training 建议的数量Recommended quantity
音频:Audio Yes
用于视觉检测Used for visual inspection
5 个以上的音频文件5+ audio files No 空值N/A
音频和人为标记的听录内容Audio + Human-labeled transcripts Yes
用于评估准确度Used to evaluate accuracy
0.5-5 小时的音频0.5-5 hours of audio Yes 1-1,000 小时的音频1-1,000 hours of audio
相关文本Related text No 不适用N/a Yes 1-200 MB 的相关文本1-200 MB of related text

文件应按类型分组成数据集,并作为 .zip 文件上传。Files should be grouped by type into a dataset and uploaded as a .zip file. 每个数据集只能包含一种数据类型。Each dataset can only contain a single data type.


若要快速开始使用,请考虑使用示例数据。To quickly get started, consider using sample data. 请参阅此 GitHub 存储库,了解自定义语音识别数据示例See this GitHub repository for sample Custom Speech data

上传数据Upload data

若要上传数据,请导航到自定义语音识别门户To upload your data, navigate to the Custom Speech portal . 在门户中,单击“上传数据”启动向导并创建第一个数据集。From the portal, click Upload data to launch the wizard and create your first dataset. 在上传数据之前,系统会要求你为数据集选择语音数据类型。You'll be asked to select a speech data type for your dataset, before allowing you to upload your data.


上传的每个数据集必须符合所选数据类型的要求。Each dataset you upload must meet the requirements for the data type that you choose. 必须先将数据设置为正确格式再上传它。Your data must be correctly formatted before it's uploaded. 格式正确的数据可确保自定义语音识别服务对其进行准确处理。Correctly formatted data ensures it will be accurately processed by the Custom Speech service. 以下部分列出了要求。Requirements are listed in the following sections.

上传数据集后,可以使用几个选项:After your dataset is uploaded, you have a few options:

  • 可以导航到“测试”选项卡,并直观地查看仅包含音频的数据,或同时包含音频和人为标记的听录内容的数据。You can navigate to the Testing tab and visually inspect audio only or audio + human-labeled transcription data.
  • 可以导航到“训练”选项卡,并使用音频和人为听录数据或相关文本数据来训练自定义模型。You can navigate to the Training tab and use audio + human transcription data or related text data to train a custom model.

用于测试的音频数据Audio data for testing

音频数据最适合用于测试 Microsoft 基线语音转文本模型或自定义模型的准确度。Audio data is optimal for testing the accuracy of Microsoft's baseline speech-to-text model or a custom model. 请记住,音频数据用于检查语音的准确度,反映特定模型的性能。Keep in mind, audio data is used to inspect the accuracy of speech with regards to a specific model's performance. 若要量化模型的准确度,请使用音频和人为标记的听录数据If you're looking to quantify the accuracy of a model, use audio + human-labeled transcription data.

参考下表来确保正确设置用于自定义语音识别的音频文件的格式:Use this table to ensure that your audio files are formatted correctly for use with Custom Speech:

属性Property ValueValue
文件格式File format RIFF (WAV)RIFF (WAV)
采样速率Sample rate 8,000 Hz 或 16,000 Hz8,000 Hz or 16,000 Hz
声道Channels 1(单音)1 (mono)
每个音频的最大长度Maximum length per audio 2 小时2 hours
示例格式Sample format PCM,16 位PCM, 16-bit
存档格式Archive format .zip.zip
最大存档大小Maximum archive size 2 GB2 GB

默认音频流格式为 WAV(16kHz 或 8kHz,16 位,单声道 PCM)。The default audio streaming format is WAV (16kHz or 8kHz, 16-bit, and mono PCM). 除了 WAV/PCM 外,还支持下列压缩输入格式。Outside of WAV / PCM, the compressed input formats listed below are also supported. 若要启用下列格式,需要其他配置Additional configuration is needed to enable the formats listed below.

  • MP3MP3
  • wav 容器中的 ALAWALAW in wav container
  • wav 容器中的 MULAWMULAW in wav container


上传训练和测试数据时,.zip 文件大小不能超过 2 GB。When uploading training and testing data, the .zip file size cannot exceed 2 GB. 如果需要更多数据来进行训练,请将其划分为多个 .zip 文件并分别上传。If you require more data for training, divide it into several .zip files and upload them separately. 稍后,可选择从多个数据集进行训练。Later, you can choose to train from multiple datasets. 但是,只能从单个数据集进行测试。However, you can only test from a single dataset.

使用 SoX 来验证音频属性,或将现有音频转换为适当的格式。Use SoX to verify audio properties or convert existing audio to the appropriate formats. 下面这些示例演示如何通过 SoX 命令行完成其中的每个活动:Below are some examples of how each of these activities can be done through the SoX command line:

活动Activity 说明Description SoX 命令SoX command
检查音频格式Check audio format 使用此命令检查Use this command to check
音频文件格式。the audio file format.
sox --i <filename>
转换音频格式Convert audio format 使用此命令Use this command to convert
将音频文件转换为单声道 16 位 16 KHz。the audio file to single channel, 16-bit, 16 KHz.
sox <input> -b 16 -e signed-integer -c 1 -r 16k -t wav <output>.wav

用于测试/训练的音频和人为标记的听录数据Audio + human-labeled transcript data for testing/training

若要在处理音频文件时测量 Microsoft 语音转文本的准确度,必须提供人为标记的听录内容(逐字对照)进行比较。To measure the accuracy of Microsoft's speech-to-text accuracy when processing your audio files, you must provide human-labeled transcriptions (word-by-word) for comparison. 尽管人为标记的听录往往很耗时,但有必要评估准确度并根据用例训练模型。While human-labeled transcription is often time consuming, it's necessary to evaluate accuracy and to train the model for your use cases. 请记住,识别能力的改善程度以提供的数据质量为界限。Keep in mind, the improvements in recognition will only be as good as the data provided. 出于此原因,只能上传优质的听录内容,这一点非常重要。For that reason, it's important that only high-quality transcripts are uploaded.

音频文件在录音开始和结束时可以保持静音。Audio files can have silence at the beginning and end of the recording. 如果可能,请在每个示例文件中的语音前后包含至少半秒的静音。If possible, include at least a half-second of silence before and after speech in each sample file. 录音音量小或具有干扰性背景噪音的音频没什么用,但不应损害你的自定义模型。While audio with low recording volume or disruptive background noise is not helpful, it should not hurt your custom model. 收集音频示例之前,请务必考虑升级麦克风和信号处理硬件。Always consider upgrading your microphones and signal processing hardware before gathering audio samples.

属性Property ValueValue
文件格式File format RIFF (WAV)RIFF (WAV)
采样速率Sample rate 8,000 Hz 或 16,000 Hz8,000 Hz or 16,000 Hz
声道Channels 1(单音)1 (mono)
每个音频的最大长度Maximum length per audio 2 小时(测试)/ 60 秒(训练)2 hours (testing) / 60 s (training)
示例格式Sample format PCM,16 位PCM, 16-bit
存档格式Archive format .zip.zip
最大 zip 大小Maximum zip size 2 GB2 GB

默认音频流格式为 WAV(16kHz 或 8kHz,16 位,单声道 PCM)。The default audio streaming format is WAV (16kHz or 8kHz, 16-bit, and mono PCM). 除了 WAV/PCM 外,还支持下列压缩输入格式。Outside of WAV / PCM, the compressed input formats listed below are also supported. 若要启用下列格式,需要其他配置Additional configuration is needed to enable the formats listed below.

  • MP3MP3
  • wav 容器中的 ALAWALAW in wav container
  • wav 容器中的 MULAWMULAW in wav container


上传训练和测试数据时,.zip 文件大小不能超过 2 GB。When uploading training and testing data, the .zip file size cannot exceed 2 GB. 只能从单个数据集进行测试,请确保将其保持在适当的文件大小内。You can only test from a single dataset, be sure to keep it within the appropriate file size. 另外,每个训练文件不能超过 60 秒,否则将出错。Additionally, each training file cannot exceed 60 seconds otherwise it will error out.

若要解决字词删除或替换等问题,需要提供大量的数据来改善识别能力。To address issues like word deletion or substitution, a significant amount of data is required to improve recognition. 通常,我们建议为大约 10 到 1,000 小时的音频提供逐字对照的听录。Generally, it's recommended to provide word-by-word transcriptions for roughly 10 to 1,000 hours of audio. 应在单个纯文本文件中包含所有 WAV 文件的听录。The transcriptions for all WAV files should be contained in a single plain-text file. 听录文件的每一行应包含一个音频文件的名称,后接相应的听录。Each line of the transcription file should contain the name of one of the audio files, followed by the corresponding transcription. 文件名和听录应以制表符 (\t) 分隔。The file name and transcription should be separated by a tab (\t).

例如:For example:

  speech01.wav  speech recognition is awesome
  speech02.wav  the quick brown fox jumped all over the place
  speech03.wav  the lazy dog was not amused


听录应编码为 UTF-8 字节顺序标记 (BOM)。Transcription should be encoded as UTF-8 byte order mark (BOM).

听录内容应经过文本规范化,以便可由系统处理。The transcriptions are text-normalized so they can be processed by the system. 但是,将数据上传到 Speech Studio 之前,必须完成一些重要的规范化操作。However, there are some important normalizations that must be done before uploading the data to the Speech Studio. 有关在准备听录内容时可用的适当语言,请参阅如何创建人为标记的听录内容For the appropriate language to use when you prepare your transcriptions, see How to create a human-labeled transcription

收集音频文件和相应的听录内容后,应将其打包成单个 .zip 文件,然后上传到自定义语音识别门户After you've gathered your audio files and corresponding transcriptions, package them as a single .zip file before uploading to the Custom Speech portal . 下面是一个示例数据集,其中包含三个音频文件和一个人为标记的听录文件:Below is an example dataset with three audio files and a human-labeled transcription file:

从语音门户选择音频Select audio from the Speech Portal

唯一的产品名称或功能应包含用于训练的相关文本数据。Product names or features that are unique, should include related text data for training. 相关文本有助于确保正确识别。Related text helps ensure correct recognition. 可以提供两种类型的相关文本数据来改善识别能力:Two types of related text data can be provided to improve recognition:

数据类型Data type 这些数据如何改善识别能力How this data improves recognition
句子(言语)Sentences (utterances) 在识别句子上下文中的产品名称或行业特定的词汇时,可以提高准确度。Improve accuracy when recognizing product names, or industry-specific vocabulary within the context of a sentence.
发音Pronunciations 改善不常见字词、缩写词或其他未定义发音的单词的发音。Improve pronunciation of uncommon terms, acronyms, or other words with undefined pronunciations.

可将言语作为单个或多个文本文件提供。Sentences can be provided as a single text file or multiple text files. 若要提高准确性,请使用较接近预期口头言语的文本数据。To improve accuracy, use text data that is closer to the expected spoken utterances. 应以单个文本文件的形式提供发音。Pronunciations should be provided as a single text file. 可将所有内容打包成单个 zip 文件,并上传到自定义语音识别门户Everything can be packaged as a single zip file and uploaded to the Custom Speech portal .

有关创建句子文件的指导原则Guidelines to create a sentences file

若要使用句子的自定义模型,需要提供示例言语表。To create a custom model using sentences, you'll need to provide a list of sample utterances. 言语不一定要是完整的或者语法正确的,但必须准确反映生产环境中预期的口头输入。Utterances do not need to be complete or grammatically correct, but they must accurately reflect the spoken input you expect in production. 如果想要增大某些字词的权重,可添加包含这些特定字词的多个句子。If you want certain terms to have increased weight, add several sentences that include these specific terms.

一般原则是,训练文本越接近生产环境中预期的实际文本,模型适应越有效。As general guidance, model adaptation is most effective when the training text is as close as possible to the real text expected in production. 应在训练文本中包含要增强的行话和短语。Domain-specific jargon and phrases that you're targeting to enhance, should be included in training text. 如果可能,尽量将一个句子或关键字控制在单独的一行中。When possible, try to have one sentence or keyword controlled on a separate line. 对于重要的关键字和短语(例如产品名),可以将其复制几次。For keywords and phrases that are important to you (for example, product names), you can copy them a few times. 但请记住,不要复制太多次,这可能会影响总体识别率。But keep in mind, don't copy too much - it could affect the overall recognition rate.

参考下表来确保正确设置言语相关数据文件的格式:Use this table to ensure that your related data file for utterances is formatted correctly:

属性Property ValueValue
文本编码Text encoding UTF-8 BOMUTF-8 BOM
每行的话语数# of utterances per line 11
文件大小上限Maximum file size 200 MB200 MB

此外,还需要考虑以下限制:Additionally, you'll want to account for the following restrictions:

  • 避免重复字符四次以上。Avoid repeating characters more than four times. 例如:“aaaa”或“uuuu”。For example: "aaaa" or "uuuu".
  • 请勿使用特殊字符或编码在 U+00A1 以后的 UTF-8 字符。Don't use special characters or UTF-8 characters above U+00A1.
  • 将会拒绝 URI。URIs will be rejected.

有关创建发音文件的指导原则Guidelines to create a pronunciation file

如果用户会遇到或使用没有标准发音的不常见字词,你可以提供自定义发音文件来改善识别能力。If there are uncommon terms without standard pronunciations that your users will encounter or use, you can provide a custom pronunciation file to improve recognition.


建议不要使用自定义发音文件来改变常用字的发音。It is not recommended to use custom pronunciation files to alter the pronunciation of common words.

下面提供了口述言语的示例,以及每个言语的自定义发音:This includes examples of a spoken utterance, and a custom pronunciation for each:

识别/显示的形式Recognized/displayed form 口头形式Spoken form
3CPO3CPO three c p othree c p o
CNTKCNTK c n t kc n t k
IEEEIEEE i triple ei triple e

口述形式是拼写的拼音顺序。它可以由字母、单词、音节或三者的组合构成。The spoken form is the phonetic sequence spelled out. It can be composed of letter, words, syllables, or a combination of all three.

自定义发音适用于英语 (en-US) 和德语 (de-DE)。Customized pronunciation is available in English (en-US) and German (de-DE). 下表按语言显示了支持的字符:This table shows supported characters by language:

语言Language LocaleLocale 字符Characters
英语English en-US a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z
德语German de-DE ä, ö, ü, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z

参考下表来确保相关发音数据文件的格式正确。Use the following table to ensure that your related data file for pronunciations is correctly formatted. 发音文件较小,应只占几千字节。Pronunciation files are small, and should only be a few kilobytes in size.

属性Property ValueValue
文本编码Text encoding UTF-8 BOM(英语还支持 ANSI)UTF-8 BOM (ANSI is also supported for English)
每行的发音数目# of pronunciations per line 11
文件大小上限Maximum file size 1 MB(在免费层中为 1 KB)1 MB (1 KB for free tier)

后续步骤Next steps