评估并提升自定义语音识别准确度Evaluate and improve Custom Speech accuracy

本文档介绍如何以定量方式度量和提高 Microsoft 的语音转文本模型或你自己的自定义模型的准确度。In this article, you learn how to quantitatively measure and improve the accuracy of Microsoft's speech-to-text models or your own custom models. 需要使用音频 + 人为标记的听录数据来测试准确度,并应提供 30 分钟到 5 小时的代表性音频。Audio + human-labeled transcription data is required to test accuracy, and 30 minutes to 5 hours of representative audio should be provided.

评估自定义语音识别准确度Evaluate Custom Speech accuracy

用于度量模型准确度的行业标准是误字率 (WER)。The industry standard to measure model accuracy is Word Error Rate (WER). 计算 WER 时,先对识别过程中标识的错误单词计数,然后将其除以人为标记的听录中提供的单词的总数(下面显示为 N)。WER counts the number of incorrect words identified during recognition, then divides by the total number of words provided in the human-labeled transcript (shown below as N). 最后将该数字乘以 100%。Finally, that number is multiplied by 100% to calculate the WER.

WER 公式

错误标识的单词分为三个类别:Incorrectly identified words fall into three categories:

  • 插入 (I):在假设脚本中错误添加的单词Insertion (I): Words that are incorrectly added in the hypothesis transcript
  • 删除 (D):在假设脚本中未检测到的单词Deletion (D): Words that are undetected in the hypothesis transcript
  • 替换 (S):在引用和假设之间替换的单词Substitution (S): Words that were substituted between reference and hypothesis

下面是一个示例:Here's an example:


若要在本地复制 WER 度量,可使用 SCTK 中的 sclite。If you want to replicate WER measurements locally, you can use sclite from SCTK.

解决错误并降低 WERResolve errors and improve WER

可以使用机器识别结果中的 WER 来评估与应用、工具或产品配合使用的模型的质量。You can use the WER from the machine recognition results to evaluate the quality of the model you are using with your app, tool, or product. WER 为 5%-10% 表明质量好,可以使用。A WER of 5%-10% is considered to be good quality and is ready to use. WER 为 20% 可以接受,但可能需要考虑进行更多的训练。A WER of 20% is acceptable, however you may want to consider additional training. WER 为 30% 或以上表明质量差,需要自定义和训练。A WER of 30% or more signals poor quality and requires customization and training.

错误的分布情况很重要。How the errors are distributed is important. 如果遇到许多删除错误,通常是因为音频信号强度弱。When many deletion errors are encountered, it's usually because of weak audio signal strength. 若要解决此问题,需要在收集音频数据时更靠近源。To resolve this issue, you'll need to collect audio data closer to the source. 插入错误意味着音频是在嘈杂环境中记录的,并且可能存在串音,导致识别问题。Insertion errors mean that the audio was recorded in a noisy environment and crosstalk may be present, causing recognition issues. 如果以人为标记的听录或相关文本形式提供特定于领域的术语样本不足,则通常会遇到替换错误。Substitution errors are often encountered when an insufficient sample of domain-specific terms has been provided as either human-labeled transcriptions or related text.

可以通过分析单个文件来确定存在的错误的类型,以及哪些错误是特定文件独有的。By analyzing individual files, you can determine what type of errors exist, and which errors are unique to a specific file. 在文件级别了解问题将有助于你确定改进目标。Understanding issues at the file level will help you target improvements.

创建测试Create a test

若要测试 Microsoft 的语音转文本基线模型或你训练的自定义模型的质量,可以将两个模型并排比较一下,评估准确度。If you'd like to test the quality of Microsoft's speech-to-text baseline model or a custom model that you've trained, you can compare two models side by side to evaluate accuracy. 此比较包括 WER 和识别结果。The comparison includes WER and recognition results. 通常情况下,自定义模型会与 Microsoft 的基线模型比较。Typically, a custom model is compared with Microsoft's baseline model.

若要并排评估模型,请执行以下操作:To evaluate models side by side:

  1. 登录到自定义语音识别门户Sign in to the Custom Speech portal.
  2. 导航到“语音转文本”>“自定义语音识别”> [项目名称] >“测试”。Navigate to Speech-to-text > Custom Speech > [name of project] > Testing.
  3. 单击“添加测试”。Click Add Test.
  4. 选择“评估准确度”。Select Evaluate accuracy. 为测试提供名称和说明,然后选择你的音频和人为标记的听录数据集。Give the test a name, description, and select your audio + human-labeled transcription dataset.
  5. 选择最多两个要测试的模型。Select up to two models that you'd like to test.
  6. 单击 创建Click Create.

成功创建测试后,可以并排比较结果。After your test has been successfully created, you can compare the results side by side.

并排比较Side-by-side comparison

测试完成(状态更改为“成功”即表明完成)后,就可以找到测试中包括的两个模型的 WER 值。Once the test is complete, indicated by the status change to Succeeded, you'll find a WER number for both models included in your test. 单击测试名称可查看测试详细信息页。Click on the test name to view the testing detail page. 该详细信息页会列出数据集中的所有言语,指示两个模型的识别结果以及提供的数据集中的听录。This detail page lists all the utterances in your dataset, indicating the recognition results of the two models alongside the transcription from the submitted dataset. 可以通过切换各种错误类型(包括插入、删除和替换)来查看并排比较的结果。To help inspect the side-by-side comparison, you can toggle various error types including insertion, deletion, and substitution. 通过听音频并比较每个列(显示人为标记的听录和两个语音转文本模型的结果)中的识别结果,你可以确定哪个模型符合自己的需求,以及需要在哪些方面进行更多的训练和改进。By listening to the audio and comparing recognition results in each column, which shows the human-labeled transcription and the results for two speech-to-text models, you can decide which model meets your needs and where additional training and improvements are required.

提升自定义语音准确度Improve Custom Speech accuracy

语音识别方案因音频质量和语言(词汇和说话风格)而有所不同。Speech recognition scenarios vary by audio quality and language (vocabulary and speaking style). 下表检查了四种常见方案:The following table examines four common scenarios:

方案Scenario 音频质量Audio Quality 词汇Vocabulary 说话风格Speaking Style
呼叫中心Call center 低、8 kHz、可以是 1 个音频通道上的 2 个人、可以压缩Low, 8 kHz, could be 2 humans on 1 audio channel, could be compressed 窄、对于域和产品是唯一的Narrow, unique to domain and products 对话、松散结构化Conversational, loosely structured
语音助理(如 Cortana 或驾车通过式窗口)Voice assistant (such as Cortana, or a drive-through window) 高、16 kHzHigh, 16 kHz 实体拥堵(歌曲名、产品、位置)Entity heavy (song titles, products, locations) 明确表述的词汇和短语Clearly stated words and phrases
听写(即时消息、注释、搜索)Dictation (instant message, notes, search) 高、16 kHzHigh, 16 kHz VariedVaried 笔记记录Note-taking
视频隐藏式字幕Video closed captioning 不同,包括各种麦克风使用、添加的音乐Varied, including varied microphone use, added music 不同,来自会议、口述语音、音乐歌词Varied, from meetings, recited speech, musical lyrics 读取、准备就绪或松散结构化Read, prepared, or loosely structured

不同的方案会产生不同的质量结果。Different scenarios produce different quality outcomes. 下表检查了如何用字错误率 (WER) 计算这四种方案中内容的错误率。The following table examines how content from these four scenarios rates in the word error rate (WER). 下表显示了在每个方案中最常见的错误类型。The table shows which error types are most common in each scenario.

方案Scenario 语音识别质量Speech Recognition Quality 插入错误Insertion Errors 删除错误Deletion Errors 替换错误Substitution Errors
呼叫中心Call center 中 (< 30% WER)Medium (< 30% WER) 低,但其他人在后台说话时除外Low, except when other people talk in the background 可能很高。Can be high. 呼叫中心可能很嘈杂,重叠的扬声器可能会混淆模型Call centers can be noisy, and overlapping speakers can confuse the model 中等。Medium. 产品和人员的名称可能导致这些错误Products and people's names can cause these errors
语音助手Voice assistant 高(可以 < 10% WER)High (can be < 10% WER) Low Low 中,由于歌曲名、产品名称或位置Medium, due to song titles, product names, or locations
听写Dictation 高(可以 < 10% WER)High (can be < 10% WER) Low Low High
视频隐藏式字幕Video closed captioning 取决于视频类型(可以 < 50% WER)Depends on video type (can be < 50% WER) Low 可能由于音乐、噪音、麦克风质量而较高Can be high due to music, noises, microphone quality 专门术语可能导致这些错误Jargon may cause these errors

确定 WER 的组成部分(插入、删除和替换错误的数目)可帮助确定要添加哪种类型的数据来改善模型。Determining the components of the WER (number of insertion, deletion, and substitution errors) helps determine what kind of data to add to improve the model. 使用自定义语音门户查看基线模型的质量。Use the Custom Speech portal to view the quality of a baseline model. 门户报告与 WER 合格率组合的插入、替换和删除错误率。The portal reports insertion, substitution, and deletion error rates that are combined in the WER quality rate.

改善模型识别Improve model recognition

可以通过在自定义语音识别门户中添加训练数据来减少识别错误。You can reduce recognition errors by adding training data in the Custom Speech portal.

计划通过定期添加源材料来维护自定义模型。Plan to maintain your custom model by adding source materials periodically. 自定义模型需要额外的训练来了解对实体的更改。Your custom model needs additional training to stay aware of changes to your entities. 例如,可能需要更新的产品名称、歌曲名称或新的服务位置。For example, you may need updates to product names, song names, or new service locations.

以下各节介绍了其他每种类型的训练数据如何减少错误。The following sections describe how each kind of additional training data can reduce errors.

训练新的自定义模型时,首先添加相关文本,来改进对领域特定字词和短语的识别。When you train a new custom model, start by adding related text to improve the recognition of domain-specific words and phrases. 相关文本句子主要可通过在上下文中显示常见字词和领域特定字词,来减少与错误识别这些字词相关的替换错误。Related text sentences can primarily reduce substitution errors related to misrecognition of common words and domain-specific words by showing them in context. 特定领域的字词可能不太常见或者属于杜撰的字词,但其发音必须易于识别。Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized.


避免相关的文本句子包含无法识别的字符或字词等干扰因素。Avoid related text sentences that include noise such as unrecognizable characters or words.

添加具有人为标记的脚本的音频Add audio with human-labeled transcripts

如果音频来自目标用例,则带有人为标记的脚本的音频可提供最大的准确性改进。Audio with human-labeled transcripts offers the greatest accuracy improvements if the audio comes from the target use case. 示例必须涵盖整个语音范围。Samples must cover the full scope of speech. 例如,零售商店的呼叫中心会在夏季接到大量有关泳装和太阳镜的电话。For example, a call center for a retail store would get most calls about swimwear and sunglasses during summer months. 确保示例包括要检测的整个语音范围。Assure that your sample includes the full scope of speech you want to detect.

考虑以下细节:Consider these details:

  • 如果音频对人们来说也很难理解,那么通过音频训练将提供最大优势。Training with audio will bring the most benefits if the audio is also hard to understand for humans. 在大多数情况下,你应该只使用相关文本开始训练。In most cases, you should start training by just using related text.
  • 如果使用最常使用的语言之一(如美国英语),则很可能无需使用音频数据进行训练。If you use one of the most heavily used languages such as US-English, there's a good chance that there's no need to train with audio data. 对于此类语言,基本模型在大多数方案中都提供非常好的识别结果;这可能足以用相关文本进行训练。For such languages, the base models offer already very good recognition results in most scenarios; it's probably enough to train with related text.
  • 自定义语音只能捕获字词上下文来减少替换错误,而不会减少插入或删除错误。Custom Speech can only capture word context to reduce substitution errors, not insertion, or deletion errors.
  • 避免使用包含脚本错误的示例,但使用包含各种音频质量的示例。Avoid samples that include transcription errors, but do include a diversity of audio quality.
  • 避免使用与问题领域无关的句子。Avoid sentences that are not related to your problem domain. 不相关的句子可能损坏模型。Unrelated sentences can harm your model.
  • 当脚本质量参差不齐时,可以复制非常好的句子(例如包含关键短语的优秀脚本)以增加其权重。When the quality of transcripts vary, you can duplicate exceptionally good sentences (like excellent transcriptions that include key phrases) to increase their weight.
  • 语音服务将自动使用脚本来改进对领域特定字词和短语的识别,就像它们被添加作为相关文本一样。The Speech service will automatically use the transcripts to improve the recognition of domain-specific words and phrases, as if they were added as related text.
  • 完成训练操作可能需要几天时间。It can take several days for a training operation to complete. 为了加快训练速度,请确保在具有专用硬件的区域创建语音服务订阅来进行训练。To improve the speed of training, make sure to create your Speech service subscription in a region with the dedicated hardware for training.


并非所有基本模型都支持通过音频训练。Not all base models support training with audio. 如果基本模型不支持该训练,语音服务将仅使用脚本中的文本,而忽略音频。If a base model does not support it, the Speech service will only use the text from the transcripts and ignore the audio. 有关支持使用音频数据进行训练的基础模型的列表,请参阅语言支持See Language support for a list of base models that support training with audio data. 即使基础模型支持使用音频数据进行训练,该服务也可能只使用部分音频。Even if a base model supports training with audio data, the service might use only part of the audio. 它仍将使用所有脚本。Still it will use all the transcripts.

添加具有发音的新字词Add new words with pronunciation

杜撰的字词或高度专业化的字词可能具有独特发音。Words that are made-up or highly specialized may have unique pronunciations. 如果可以将这类字词分解为更小的字词,则可以识别这些字词。These words can be recognized if the word can be broken down into smaller words to pronounce it. 例如,若要识别 Xbox,可以发音为 X box。For example, to recognize Xbox, pronounce as X box. 此方法不会提高总体准确性,但可以增加对这些关键字的识别度。This approach will not increase overall accuracy, but can increase recognition of these keywords.


此方法目前仅适用于某些语言。This technique is only available for some languages at this time. 有关详细信息,请参阅语音转文本表中的发音自定义。See customization for pronunciation in the Speech-to-text table for details.

按场景划分的来源Sources by scenario

下表显示了语音识别场景,列出了上述三个训练内容类别中需要考虑的来源资料。The following table shows voice recognition scenarios and lists source materials to consider within the three training content categories listed above.

方案Scenario 相关文本句子Related text sentences 音频和人为标记的听录内容Audio + human-labeled transcripts 带有发音的新字词New words with pronunciation
呼叫中心Call center 与呼叫中心活动相关的营销文档、网站和产品评审marketing documents, website, product reviews related to call center activity 呼叫中心通话采用人工转录call center calls transcribed by humans 发音模糊的术语(请查看上面的 Xbox)terms that have ambiguous pronunciations (see Xbox above)
语音助手Voice assistant 使用命令和实体的各种组合列出句子list sentences using all combinations of commands and entities 将语音命令录制到设备中,并转录到文本中record voices speaking commands into device, and transcribe into text 具有独特发音的名称(电影、歌曲、产品)names (movies, songs, products) that have unique pronunciations
听写Dictation 书面输入内容,例如即时消息或电子邮件written input, like instant messages or emails 与上述内容相似similar to above 与上述内容相似similar to above
视频隐藏式字幕Video closed captioning 电视节目脚本、电影、营销内容、视频摘要TV show scripts, movies, marketing content, video summaries 视频的确切脚本exact transcripts of videos 与上述内容相似similar to above

后续步骤Next steps

其他资源Additional resources