评估并提升自定义语音识别准确度Evaluate and improve Custom Speech accuracy

本文档介绍如何以定量方式度量和提高 Microsoft 的语音转文本模型或你自己的自定义模型的准确度。In this article, you learn how to quantitatively measure and improve the accuracy of Microsoft's speech-to-text models or your own custom models. 需要使用音频 + 人为标记的听录数据来测试准确度,并应提供 30 分钟到 5 小时的代表性音频。Audio + human-labeled transcription data is required to test accuracy, and 30 minutes to 5 hours of representative audio should be provided.

评估自定义语音识别准确度Evaluate Custom Speech accuracy

用于度量模型准确度的行业标准是误字率 (WER)。The industry standard to measure model accuracy is Word Error Rate (WER). 计算 WER 时,先对识别过程中标识的错误单词计数,然后将其除以人为标记的听录中提供的单词的总数(下面显示为 N)。WER counts the number of incorrect words identified during recognition, then divides by the total number of words provided in the human-labeled transcript (shown below as N). 最后将该数字乘以 100%。Finally, that number is multiplied by 100% to calculate the WER.

WER 公式

错误标识的单词分为三个类别:Incorrectly identified words fall into three categories:

  • 插入 (I):在假设脚本中错误添加的单词Insertion (I): Words that are incorrectly added in the hypothesis transcript
  • 删除 (D):在假设脚本中未检测到的单词Deletion (D): Words that are undetected in the hypothesis transcript
  • 替换 (S):在引用和假设之间替换的单词Substitution (S): Words that were substituted between reference and hypothesis

下面是一个示例:Here's an example:


解决错误并降低 WERResolve errors and improve WER

可以使用机器识别结果中的 WER 来评估与应用、工具或产品配合使用的模型的质量。You can use the WER from the machine recognition results to evaluate the quality of the model you are using with your app, tool, or product. WER 为 5%-10% 表明质量好,可以使用。A WER of 5%-10% is considered to be good quality and is ready to use. WER 为 20% 可以接受,但可能需要考虑进行更多的训练。A WER of 20% is acceptable, however you may want to consider additional training. WER 为 30% 或以上表明质量差,需要自定义和训练。A WER of 30% or more signals poor quality and requires customization and training.

错误的分布情况很重要。How the errors are distributed is important. 如果遇到许多删除错误,通常是因为音频信号强度弱。When many deletion errors are encountered, it's usually because of weak audio signal strength. 若要解决此问题,需要在收集音频数据时更靠近源。To resolve this issue, you'll need to collect audio data closer to the source. 插入错误意味着音频是在嘈杂环境中记录的,并且可能存在串音,导致识别问题。Insertion errors mean that the audio was recorded in a noisy environment and crosstalk may be present, causing recognition issues. 如果以人为标记的听录或相关文本形式提供特定于领域的术语样本不足,则通常会遇到替换错误。Substitution errors are often encountered when an insufficient sample of domain-specific terms has been provided as either human-labeled transcriptions or related text.

可以通过分析单个文件来确定存在的错误的类型,以及哪些错误是特定文件独有的。By analyzing individual files, you can determine what type of errors exist, and which errors are unique to a specific file. 在文件级别了解问题将有助于你确定改进目标。Understanding issues at the file level will help you target improvements.

创建测试Create a test

若要测试 Microsoft 的语音转文本基线模型或你训练的自定义模型的质量,可以将两个模型并排比较一下,评估准确度。If you'd like to test the quality of Microsoft's speech-to-text baseline model or a custom model that you've trained, you can compare two models side by side to evaluate accuracy. 此比较包括 WER 和识别结果。The comparison includes WER and recognition results. 通常情况下,自定义模型会与 Microsoft 的基线模型比较。Typically, a custom model is compared with Microsoft's baseline model.

若要并排评估模型,请执行以下操作:To evaluate models side by side:

  1. 登录到自定义语音识别门户Sign in to the Custom Speech portal.
  2. 导航到“语音转文本”>“自定义语音识别”> [项目名称] >“测试”。Navigate to Speech-to-text > Custom Speech > [name of project] > Testing.
  3. 单击“添加测试”。Click Add Test.
  4. 选择“评估准确度”。Select Evaluate accuracy. 为测试提供名称和说明,然后选择你的音频和人为标记的听录数据集。Give the test a name, description, and select your audio + human-labeled transcription dataset.
  5. 选择最多两个要测试的模型。Select up to two models that you'd like to test.
  6. 单击 创建Click Create.

成功创建测试后,可以并排比较结果。After your test has been successfully created, you can compare the results side by side.

并排比较Side-by-side comparison

测试完成(状态更改为“成功”即表明完成)后,就可以找到测试中包括的两个模型的 WER 值。Once the test is complete, indicated by the status change to Succeeded, you'll find a WER number for both models included in your test. 单击测试名称可查看测试详细信息页。Click on the test name to view the testing detail page. 该详细信息页会列出数据集中的所有言语,指示两个模型的识别结果以及提供的数据集中的听录。This detail page lists all the utterances in your dataset, indicating the recognition results of the two models alongside the transcription from the submitted dataset. 可以通过切换各种错误类型(包括插入、删除和替换)来查看并排比较的结果。To help inspect the side-by-side comparison, you can toggle various error types including insertion, deletion, and substitution. 通过听音频并比较每个列(显示人为标记的听录和两个语音转文本模型的结果)中的识别结果,你可以确定哪个模型符合自己的需求,以及需要在哪些方面进行更多的训练和改进。By listening to the audio and comparing recognition results in each column, which shows the human-labeled transcription and the results for two speech-to-text models, you can decide which model meets your needs and where additional training and improvements are required.

后续步骤Next steps

其他资源Additional resources