语音转文本常见问题解答Speech to Text frequently asked questions

如果在本常见问题解答中找不到你的问题的解答,请检查其他支持选项If you can't find answers to your questions in this FAQ, check out other support options.

常规General

问:基线模型和自定义语音转文本模型之间有什么区别?Q: What is the difference between a baseline model and a custom Speech to Text model?

:基线模型已使用 Microsoft 拥有的数据定型,并且已部署在云中。A: A baseline model has been trained by using Microsoft-owned data and is already deployed in the cloud. 你可以使用自定义模型来调整模型,以便更好地适应具有特定环境噪音或语言的具体环境。You can use a custom model to adapt a model to better fit a specific environment that has specific ambient noise or language. 工厂、汽车或嘈杂的街道需要适应的声学模型。Factory floors, cars, or noisy streets would require an adapted acoustic model. 生物学、物理学、放射学、产品名称和自定义首字母缩略词等主题需要适应的语言模型。Topics like biology, physics, radiology, product names, and custom acronyms would require an adapted language model. 如果训练自定义模型,则应首先添加相关文本来改进对特殊术语和短语的识别。If you train a custom model, you should start with related text to improve the recognition of special terms and phrases.

问:如果想要使用基线模型,从何处开始?Q: Where do I start if I want to use a baseline model?

:首先,获取 订阅密钥A: First, get a subscription key. 如果想要对预先部署的基线模型进行 REST 调用,请参阅 REST APIIf you want to make REST calls to the predeployed baseline models, see the REST APIs. 如果想要使用 WebSocket,请下载 SDKIf you want to use WebSockets, download the SDK.

问:是否始终需要生成自定义语音识别模型?Q: Do I always need to build a custom speech model?

:否。A: No. 如果应用程序使用通用的日常语言,则无需自定义模型。If your application uses generic, day-to-day language, you don't need to customize a model. 如果应用程序用于背景噪音很小或无背景噪音的环境,则无需自定义模型。If your application is used in an environment where there's little or no background noise, you don't need to customize a model.

你可以在门户中部署基线模型和自定义模型,并针对这些模型运行准确度测试。You can deploy baseline and customized models in the portal and then run accuracy tests against them. 可以使用此功能衡量基线模型与自定义模型的准确度。You can use this feature to measure the accuracy of a baseline model versus a custom model.

问:如何知道何时完成数据集或模型的处理?Q: How will I know when processing for my dataset or model is complete?

:目前,表中模型或数据集的状态是唯一可以了解的途径。A: Currently, the status of the model or dataset in the table is the only way to know. 处理完成后,状态是“成功” 。When the processing is complete, the status is Succeeded.

问:能否创建多个模型?Q: Can I create more than one model?

:集合中可以拥有的模型数量没有限制。A: There's no limit on the number of models you can have in your collection.

问:我意识到自己犯了一个错误。 如何取消正在进行的数据导入或模型创建?Q: I realized I made a mistake. How do I cancel my data import or model creation that’s in progress?

:当前无法回滚声学或语言适应过程。A: Currently, you can't roll back an acoustic or language adaptation process. 可以在导入的数据和模型处于终点状态时删除它们。You can delete imported data and models when they're in a terminal state.

问:我针对每个短语获得了采用详细输出格式的多个结果。我应该使用哪一种?Q: I get several results for each phrase with the detailed output format. Which one should I use?

:始终采用第一个结果,即使另一个结果(“N-最佳”)可能具有更高的置信度值。A: Always take the first result, even if another result ("N-Best") might have a higher confidence value. 语音服务认为第一个结果是最佳的。The Speech service considers the first result to be the best. 如果未识别出语音,则它也可以是空字符串。It can also be an empty string if no speech was recognized.

其他结果可能更糟,可能没有应用完整的大写和标点。The other results are likely worse and might not have full capitalization and punctuation applied. 这些结果在特殊情况下非常有用,例如,为用户提供选项来从列表中选取更正项或处理错误识别的命令。These results are most useful in special scenarios such as giving users the option to pick corrections from a list or handling incorrectly recognized commands.

问: 为什么会有不同的基础模型?Q: Why are there different base models?

:你可以从语音服务的多个基础模型中进行选择。A: You can choose from more than one base model in the Speech service. 每个模型名称都包含添加它的日期。Each model name contains the date when it was added. 开始训练自定义模型时,请使用最新模型以获取最佳准确度。When you start training a custom model, use the latest model to get the best accuracy. 当有新模型可用时,较旧的基础模型在一段时间内仍可供使用。Older base models are still available for some time when a new model is made available. 你可以继续使用所使用的模型,直到它被停用(请参阅模型生命周期)。You can continue using the model that you have worked with until it is retired (see Model lifecycle). 仍建议切换到最新的基础模型,以提高准确度。It is still recommended to switch to the latest base model for better accuracy.

问:能否更新现有模型(模型堆叠)?Q: Can I update my existing model (model stacking)?

:无法更新现有模型。A: You can't update an existing model. 一种解决方案是将旧数据集与新数据集合并,然后重新适应。As a solution, combine the old dataset with the new dataset and readapt.

旧数据集和新数据集必须合并为单个 .zip 文件(用于声学数据)或 .txt 文件(用于语言数据)。The old dataset and the new dataset must be combined in a single .zip file (for acoustic data) or in a .txt file (for language data). 适应完成后,需要重新部署新的更新后模型以获取新的终结点When adaptation is finished, the new, updated model needs to be redeployed to obtain a new endpoint

问: 当有新版本的基础模型可用时,我的部署是否会自动更新?Q: When a new version of a base model is available, is my deployment automatically updated?

:部署不会自动更新。A: Deployments will NOT be automatically updated.

如果你已调整并部署了某个模型,该部署会保持原样。If you have adapted and deployed a model, that deployment will remain as is. 你可以解除已部署的模型,使用较新版本的基础模型重新调整,并重新部署以提高准确度。You can decommission the deployed model, readapt using the newer version of the base model and redeploy for better accuracy.

基础模型和自定义模型在一段时间后都会停用(请参阅模型生命周期)。Both base models and custom models will be retired after some time (see Model lifecycle).

问: 是否可以将数据集、模型和部署复制或移动到另一个区域或订阅?Q: Can I copy or move my datasets, models, and deployments to another region or subscription?

:你可以使用 REST API 将自定义模型复制到另一个区域或订阅。A: You can use the REST API to copy a custom model to another region or subscription. 无法复制数据集或部署。Datasets or deployments cannot be copied. 可以在另一个订阅中再次导入数据集,并使用模型副本在其中创建终结点。You can import a dataset again in another subscription and create endpoints there using the model copies.

问:是否会记录我的请求?Q: Are my requests logged?

:默认情况下不记录请求(既不进行音频记录,也不进行听录)。A: By default requests are not logged (neither audio, nor transcription). 如果需要,可以在创建自定义终结点时选择“从此终结点记录内容”选项。If necessary, you may select Log content from this endpoint option when you create a custom endpoint. 你还可以在语音 SDK 中逐个请求启用音频日志记录,而无需创建自定义终结点。You can also enable audio logging in the Speech SDK on a per-request basis without creating a custom endpoint. 在两种情况下,请求的音频和识别结果都将存储在安全的存储中。In both cases, audio and recognition results of requests will be stored in secure storage. 对于使用 Microsoft 拥有的存储的订阅,它们将可供使用 30 天。For subscriptions that use Microsoft-owned storage, they will be available for 30 days.

如果你在启用了“从此终结点记录内容”的情况下使用自定义终结点,则可在 Speech Studio 中的部署页面上导出所记录的文件。You can export the logged files on the deployment page in Speech Studio if you use a custom endpoint with Log content from this endpoint enabled. 如果音频日志记录是通过 SDK 启用的,请调用 API 来访问文件。If audio logging is enabled via the SDK, call the API to access the files.

问:我的请求是否受到限制?Q: Are my requests throttled?

:请参阅 语音服务配额和限制A: See Speech Services Quotas and Limits.

问:双声道音频如何收费?Q: How am I charged for dual channel audio?

:如果你单独提交每个声道(每个声道在其自己的文件中),则将按每个文件的持续时间对你收费。A: If you submit each channel separately (each channel in its own file), you will be charged for the duration of each file. 如果你提交单个文件,其中每个声道都一起多路复用,则按单个文件的持续时间对你收费。If you submit a single file with each channel multiplexed together, then you will be charged for the duration of the single file. 有关定价的详细信息,请参阅 Azure 认知服务定价页For details on pricing please refer to the Azure Cognitive Services pricing page.

重要

如果有禁止使用自定义语音识别服务的其他隐私问题,请联系其中一个支持渠道。If you have further privacy concerns that prohibit you from using the custom Speech service, contact one of the support channels.

提高并发性Increasing concurrency

请参阅语音服务配额和限制See Speech Services Quotas and Limits.

导入数据Importing data

问:数据集大小的限制是什么?为何限制?Q: What is the limit on the size of a dataset, and why is it the limit?

:之所以有此限制,是由于 HTTP 上传文件大小存在限制。A: The limit is due to the restriction on the size of a file for HTTP upload. 有关实际限制,请参阅语音服务配额和限制See Speech Services Quotas and Limits for the actual limit. 你可以将数据拆分为多个数据集,并选择所有数据集来训练模型。You can split your data into multiple datasets and select all of them to train the model.

问:是否可以压缩文本文件,以便上传更大的文本文件?Q: Can I zip my text files so I can upload a larger text file?

:否。A: No. 目前,仅允许未压缩的文本文件。Currently, only uncompressed text files are allowed.

问:数据报告表明,有言语导入失败。问题出在哪里?Q: The data report says there were failed utterances. What is the issue?

:未能上传文件中 100% 的话语并不是什么问题。A: Failing to upload 100 percent of the utterances in a file is not a problem. 如果成功导入了声学或语言数据集中的绝大多数话语(如 95% 以上的话语),则该数据集可用。If the vast majority of the utterances in an acoustic or language dataset (for example, more than 95 percent) are successfully imported, the dataset can be usable. 但是,建议尝试了解话语失败的原因并解决问题。However, we recommend that you try to understand why the utterances failed and fix the problems. 大多数常见问题(如格式设置错误)很容易修复。Most common problems, such as formatting errors, are easy to fix.

创建声学模型Creating an acoustic model

问:需要多少声学数据?Q: How much acoustic data do I need?

:建议开始时先使用 30 分钟到 1 小时的声学数据。A: We recommend starting with between 30 minutes and one hour of acoustic data.

问:应该收集哪些数据?Q: What data should I collect?

:收集尽可能接近于应用程序方案和用例的数据。A: Collect data that's as close to the application scenario and use case as possible. 数据收集应在设备、环境和说话人类型方面与目标应用程序和用户匹配。The data collection should match the target application and users in terms of device or devices, environments, and types of speakers. 一般而言,应从尽可能广泛的说话人中收集数据。In general, you should collect data from as broad a range of speakers as possible.

问:如何收集声学数据?Q: How should I collect acoustic data?

:可以创建独立的数据收集应用程序,或使用现成的录音软件。A: You can create a standalone data collection application or use off-the-shelf audio recording software. 你还可以创建一个用于记录音频数据并使用该数据的应用程序版本。You can also create a version of your application that logs the audio data and then uses the data.

问:是否需要自行转录适应数据?Q: Do I need to transcribe adaptation data myself?

:是的。A: Yes. 可以自行转录或使用专业听录服务进行转录。You can transcribe it yourself or use a professional transcription service. 有些用户更喜欢使用专业听录器,而其他用户则使用众包或自己进行听录。Some users prefer professional transcribers and others use crowdsourcing or do the transcriptions themselves.

问: 使用音频数据训练一个自定义模型需要多长时间?Q: How long will it take to train a custom model with audio data?

答:使用音频数据训练模型可能是一个漫长的过程。A: Training a model with audio data can be a lengthy process. 创建自定义模型可能需要几天时间,具体取决于数据量。Depending on the amount of data, it can take several days to create a custom model. 如果它无法在一周内完成,则服务可能会中止训练操作并将该模型报告为失败。If it cannot be finished within one week, the service might abort the training operation and report the model as failed.

某些基础模型不能使用音频数据进行自定义。Some base models cannot be customized with audio data. 对于这些模型,该服务会仅使用听录的文本进行训练并忽略音频数据。For them the service will just use the text of the transcription for training and ignore the audio data. 然后,训练的完成速度会快得多,结果将与仅使用文本进行训练相同。Training will then be finished much faster and results will be the same as training with just text. 有关支持使用音频数据进行训练的基础模型的列表,请参阅语言支持See Language support for a list of base models that support training with audio data.

精确度测试Accuracy testing

问:什么是字错误率 (WER) 以及如何计算此错误率?Q: What is word error rate (WER) and how is it computed?

:WER 是用于语音识别的评估指标。A: WER is the evaluation metric for speech recognition. WER 由错误总数(包括插入、删除和替换)除以引用听录中的总字数得出。WER is counted as the total number of errors, which includes insertions, deletions, and substitutions, divided by the total number of words in the reference transcription. 有关详细信息,请参阅评估自定义语音识别准确度For more information, see Evaluate Custom Speech accuracy.

问:如何确定准确度测试的结果是否良好?Q: How do I determine whether the results of an accuracy test are good?

:测试结果对基线模型和自定义模型进行了比较。A: The results show a comparison between the baseline model and the model you customized. 应以超越基线模型为目标,使自定义模型变得有价值。You should aim to beat the baseline model to make customization worthwhile.

问:如何确定基础模型的 WER 以便查看是否有改进?Q: How do I determine the WER of a base model so I can see if there was an improvement?

:离线测试结果显示了自定义模型的基线准确度以及与基线相比的改进情况。A: The offline test results show the baseline accuracy of the custom model and the improvement over baseline.

创建语言模型Creating a language model

问:需要上传多少文本数据?Q: How much text data do I need to upload?

:这取决于应用程序中使用的词汇和短语与初始语言模型存在多大差异。A: It depends on how different the vocabulary and phrases used in your application are from the starting language models. 对于所有新字词,尽可能多地提供这些字的使用示例很有用。For all new words, it's useful to provide as many examples as possible of the usage of those words. 对于应用程序中使用的常用短语,在语言数据中添加短语也很有用,因为这会告知系统也要侦听这些术语。For common phrases that are used in your application, including phrases in the language data is also useful because it tells the system to also listen for these terms. 在语言数据集中至少有 100 句话语(通常几百句或更多话语)是很常见的。It's common to have at least 100, and typically several hundred or more utterances in the language dataset. 另外,如果预期某些类型的查询比其他查询更加常用,则可以在数据集中插入常用查询的多个副本。Also, if some types of queries are expected to be more common than others, you can insert multiple copies of the common queries in the dataset.

问:能否只上传字词列表?Q: Can I just upload a list of words?

:上传字词列表会将字词添加到词汇中,但不会告知系统这些字词的通常用法。A: Uploading a list of words will add the words to the vocabulary, but it won't teach the system how the words are typically used. 通过提供完整或部分话语(用户很可能会说事物的句子或短语),语言模型可以学习这些新字词及其用法。By providing full or partial utterances (sentences or phrases of things that users are likely to say), the language model can learn the new words and how they are used. 自定义语言模型不仅适用于向系统中添加新字词,还适用于调整应用程序已知字词的概率。The custom language model is good not only for adding new words to the system, but also for adjusting the likelihood of known words for your application. 提供完整话语可帮助系统更好地学习。Providing full utterances helps the system learn better.

后续步骤Next steps