什么是批量听录?What is batch transcription?

批量听录是一组 REST API 操作,可用于听录存储中的大量音频。Batch transcription is a set of REST API operations that enables you to transcribe a large amount of audio in storage. 你可以指向具有共享访问签名 (SAS) URI 的音频文件并异步接收听录结果。You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results.

异步语音转文本听录只是其中的一项功能。Asynchronous speech-to-text transcription is just one of the features. 可以使用批量听录 REST API 调用以下方法:You can use batch transcription REST APIs to call the following methods:

批量听录操作Batch Transcription Operation 方法Method REST API 调用REST API Call
创建一个新的听录。Creates a new transcription. POSTPOST api/speechtotext/v2.0/transcriptionsapi/speechtotext/v2.0/transcriptions
检索经过身份验证的订阅的听录列表。Retrieves a list of transcriptions for the authenticated subscription. GETGET api/speechtotext/v2.0/transcriptionsapi/speechtotext/v2.0/transcriptions
获取脱机听录支持的区域设置列表。Gets a list of supported locales for offline transcriptions. GETGET api/speechtotext/v2.0/transcriptions/localesapi/speechtotext/v2.0/transcriptions/locales
更新由 ID 标识的听录的可变详细信息。Updates the mutable details of the transcription identified by its ID. PATCHPATCH api/speechtotext/v2.0/transcriptions/{id}api/speechtotext/v2.0/transcriptions/{id}
删除指定的听录任务。Deletes the specified transcription task. DELETEDELETE api/speechtotext/v2.0/transcriptions/{id}api/speechtotext/v2.0/transcriptions/{id}
获取由给定 ID 标识的听录。Gets the transcription identified by the given ID. GETGET api/speechtotext/v2.0/transcriptions/{id}api/speechtotext/v2.0/transcriptions/{id}

你可以查看和测试详细的 API,它以 Swagger 文档的形式显示在标题 Custom Speech transcriptions 下。You can review and test the detailed API, which is available as a Swagger document, under the heading Custom Speech transcriptions.

批量听录作业是按“尽力而为”的原则安排的。Batch transcription jobs are scheduled on a best effort basis. 目前,无法预估作业何时会变为正在运行状态。Currently there is no estimate for when a job changes into the running state. 在正常的系统负载下,几分钟内应该就可以发生该作业。Under normal system load, it should happen within minutes. 进入运行状态后,实际听录的处理速度比实时音频更快。Once in the running state, the actual transcription is processed faster than the audio real time.

凭借易用的 API,无需部署自定义终结点,且无需遵守任何并发性要求。Next to the easy-to-use API, you don't need to deploy custom endpoints, and you don't have any concurrency requirements to observe.

先决条件Prerequisites

订阅密钥Subscription Key

与语音服务的其他所有功能一样,需要按照入门指南通过 Azure 门户创建订阅密钥。As with all features of the Speech service, you create a subscription key from the Azure portal by following our Get started guide.

备注

若要使用批量听录,需要具备语音服务的标准订阅 (S0)。A standard subscription (S0) for Speech service is required to use batch transcription. 免费订阅密钥 (F0) 无效。Free subscription keys (F0) don't work. 有关详细信息,请参阅定价和限制For more information, see pricing and limits.

自定义模式Custom models

如果你打算自定义声学或语言模型,请遵循自定义声学模型设计自定义语言模型中的步骤。If you plan to customize acoustic or language models, follow the steps in Customize acoustic models and Design customization language models. 若要在批量听录中使用所创建的模型,需要提供其模型 ID。To use the created models in batch transcription, you need their model IDs. 可以在检查模型的详细信息时检索模型 ID。You can retrieve the model ID when you inspect the details of the model. 批量听录服务不需要部署的自定义终结点。A deployed custom endpoint is not needed for the batch transcription service.

批量听录 APIThe Batch Transcription API

支持的格式Supported formats

批量听录 API 支持以下格式:The Batch Transcription API supports the following formats:

格式Format 编解码器Codec 比特率Bitrate 采样率Sample Rate
WAVWAV PCMPCM 16 位16-bit 8 kHz 或 16 kHz,单声道或立体声8 kHz or 16 kHz, mono or stereo
MP3MP3 PCMPCM 16 位16-bit 8 kHz 或 16 kHz,单声道或立体声8 kHz or 16 kHz, mono or stereo
OGGOGG OPUSOPUS 16 位16-bit 8 kHz 或 16 kHz,单声道或立体声8 kHz or 16 kHz, mono or stereo

对于立体声音频流,在听录期间会拆分左右声道。For stereo audio streams, the left and right channels are split during the transcription. 对于每个声道,将创建一个 JSON 结果文件。For each channel, a JSON result file is being created. 开发人员可利用为每个言语生成的时间戳创建有序的最终脚本。The timestamps generated per utterance enable the developer to create an ordered final transcript.

配置Configuration

配置参数以 JSON 形式提供:Configuration parameters are provided as JSON:

{
  "recordingsUrl": "<URL to the Azure blob to transcribe>",
  "models": [{"Id":"<optional acoustic model ID>"},{"Id":"<optional language model ID>"}],
  "locale": "<locale to use, for example en-US>",
  "name": "<user defined name of the transcription batch>",
  "description": "<optional description of the transcription>",
  "properties": {
    "ProfanityFilterMode": "None | Removed | Tags | Masked",
    "PunctuationMode": "None | Dictated | Automatic | DictatedAndAutomatic",
    "AddWordLevelTimestamps" : "True | False",
    "AddSentiment" : "True | False",
    "AddDiarization" : "True | False",
    "TranscriptionResultsContainerUrl" : "<service SAS URI to Azure container to store results into (write permission required)>"
  }
}

配置属性Configuration properties

使用以下可选属性来配置听录:Use these optional properties to configure transcription:

参数Parameter

说明Description

ProfanityFilterMode

指定如何处理识别结果中的不雅内容。Specifies how to handle profanity in recognition results. 接受的值为 None(禁用不雅内容筛选)、Masked(将不雅内容替换为星号)、Removed(从结果中删除所有不雅内容)或 Tags(添加“不雅内容”标记)。Accepted values are None to disable profanity filtering, Masked to replace profanity with asterisks, Removed to remove all profanity from the result, or Tags to add "profanity" tags. 默认设置为 MaskedThe default setting is Masked.

PunctuationMode

指定如何处理识别结果中的标点。Specifies how to handle punctuation in recognition results. 接受的值为 None(禁用标点)、Dictated(暗示显式(口述)标点)、Automatic(让解码器处理标点),或 DictatedAndAutomatic(使用听写标点和自动标点)。Accepted values are None to disable punctuation, Dictated to imply explicit (spoken) punctuation, Automatic to let the decoder deal with punctuation, or DictatedAndAutomatic to use dictated and automatic punctuation. 默认设置为 DictatedAndAutomaticThe default setting is DictatedAndAutomatic.

AddWordLevelTimestamps

指定是否应将字级时间戳添加到输出。Specifies if word level timestamps should be added to the output. 接受的值为 true(启用字级时间戳)和 false(默认值,禁用字级时间戳)。Accepted values are true to enable word level timestamps and false (the default value) to disable it.

AddSentiment

指定是否应当向言语应用情绪分析。Specifies if sentiment analysis should be applied to the utterance. 接受的值为 true(启用此项)和 false(默认值,禁用此项)。Accepted values are true to enable and false (the default value) to disable it. 有关更多详细信息,请参阅情绪分析See Sentiment Analysis for more detail.

AddDiarization

指定应该对输入(预期为包含两个语音的单声道)执行分割聚类分析。Specifies that diarization analysis should be carried out on the input, which is expected to be mono channel containing two voices. 接受的值为 true(启用分割聚类)和 false(默认值,禁用分割聚类)。Accepted values are true enabling diarization and false (the default value) to disable it. 还需要将 AddWordLevelTimestamps 设置为 true。It also requires AddWordLevelTimestamps to be set to true.

TranscriptionResultsContainerUrl

Azure 中可写容器的可选 URL(包含服务 SAS)。Optional URL with service SAS to a writeable container in Azure. 结果存储在此容器中。The result is stored in this container.

存储Storage

批量听录支持使用 Azure Blob 存储来读取音频以及将听录内容写入存储。Batch transcription supports Azure Blob storage for reading audio and writing transcriptions to storage.

批量听录结果The batch transcription result

对于单声道输入音频,将创建一个听录结果文件。For mono input audio, one transcription result file is being created. 对于立体声输入音频,将创建两个听录结果文件。For stereo input audio, two transcription result files are being created. 每个文件采用以下结构:Each has this structure:

{
  "AudioFileResults":[
    {
      "AudioFileName": "Channel.0.wav | Channel.1.wav"      'maximum of 2 channels supported'
      "AudioFileUrl": null                                  'always null'
      "AudioLengthInSeconds": number                        'Real number. Two decimal places'
      "CombinedResults": [
        {
          "ChannelNumber": null                             'always null'
          "Lexical": string
          "ITN": string
          "MaskedITN": string
          "Display": string
        }
      ]
      SegmentResults:[                                      'for each individual segment'
        {
          "RecognitionStatus": "Success | Failure"
          "ChannelNumber": null
          "SpeakerId": null | "1 | 2"                       'null if no diarization
                                                             or stereo input file, the
                                                             speakerId as a string if
                                                             diarization requested for
                                                             mono audio file'
          "Offset": number                                  'time in ticks (1 tick is 100 nanosec)'
          "Duration": number                                'time in ticks (1 tick is 100 nanosec)'
          "OffsetInSeconds" : number                        'Real number. Two decimal places'
          "DurationInSeconds" : number                      'Real number. Two decimal places'
          "NBest": [
            {
              "Confidence": number                          'between 0 and 1'
              "Lexical": string
              "ITN": string
              "MaskedITN": string
              "Display": string
              "Sentiment":
                {                                           'this is omitted if sentiment is
                                                             not requested'
                  "Negative": number                        'between 0 and 1'
                  "Neutral": number                         'between 0 and 1'
                  "Positive": number                        'between 0 and 1'
                }
              "Words": [
                {
                  "Word": string
                  "Offset": number                          'time in ticks (1 tick is 100 nanosec)'
                  "Duration": number                        'time in ticks (1 tick is 100 nanosec)'
                  "OffsetInSeconds": number                 'Real number. Two decimal places'
                  "DurationInSeconds": number               'Real number. Two decimal places'
                  "Confidence": number                      'between 0 and 1'
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

结果包含以下形式:The result contains these forms:

形式Form

内容Content

Lexical

识别的实际单词。The actual words recognized.

ITN

已识别文本的反向文本规范化形式。Inverse-text-normalized form of the recognized text. 已应用缩写(“doctor smith”缩写为“dr smith”)、电话号码和其他转换。Abbreviations ("doctor smith" to "dr smith"), phone numbers, and other transformations are applied.

MaskedITN

应用了亵渎内容屏蔽的 ITN 形式。The ITN form with profanity masking applied.

Display

已识别文本的显示形式。The display form of the recognized text. 包括添加的标点和大小写。Added punctuation and capitalization are included.

讲述人分离(分割聚类)Speaker separation (Diarization)

分割聚类是将讲述人语音分隔成音频片段的过程。Diarization is the process of separating speakers in a piece of audio. Batch 管道支持分割聚类,并且能够识别单声道录制内容中的两个讲述人。Our Batch pipeline supports diarization and is capable of recognizing two speakers on mono channel recordings. 此功能不适用于立体声录音。The feature is not available on stereo recordings.

所有听录输出包含一个 SpeakerIdAll transcription output contains a SpeakerId. 如果不使用分割聚类,则会在 JSON 输出中显示 "SpeakerId": nullIf diarization is not used, it shows "SpeakerId": null in the JSON output. 对于分割聚类,我们支持两段语音,因此讲述人标识为 "1""2"For diarization we support two voices, so the speakers are identified as "1" or "2".

若要请求分割聚类,只需在 HTTP 请求中添加相关的参数,如下所示。To request diarization, you simply have to add the relevant parameter in the HTTP request as shown below.

{
 "recordingsUrl": "<URL to the Azure blob to transcribe>",
 "models": [{"Id":"<optional acoustic model ID>"},{"Id":"<optional language model ID>"}],
 "locale": "<locale to us, for example en-US>",
 "name": "<user defined name of the transcription batch>",
 "description": "<optional description of the transcription>",
 "properties": {
   "AddWordLevelTimestamps" : "True",
   "AddDiarization" : "True"
 }
}

如上述请求中的参数所示,还必须“启用”单词级时间戳。Word-level timestamps would also have to be 'turned on' as the parameters in the above request indicate.

情绪分析Sentiment analysis

情绪分析功能评估音频中表达的情绪。The sentiment feature estimates the sentiment expressed in the audio. 情绪由 0 到 1 的值表示,分别表示 NegativeNeutralPositive 情绪。The sentiment is expressed by a value between 0 and 1 for Negative, Neutral, and Positive sentiment. 例如,可以在呼叫中心方案中使用情绪分析:For example, sentiment analysis can be used in call center scenarios:

  • 获取有关客户满意度的见解Get insight on customer satisfaction
  • 获取有关座席(受理来电的团队)绩效的见解Get insight on the performance of the agents (team taking the calls)
  • 找出某次通话改变消极结局的确切时间点Find the exact point in time when a call took a turn in a negative direction
  • 将消极通话转变为积极通话时的良好交互情况What went well when turning a negative call into a positive direction
  • 了解客户对某个产品或服务喜欢和不喜欢的方面Identify what customers like and what they dislike about a product or a service

情绪根据词法形式按音频段评分。Sentiment is scored per audio segment based on the lexical form. 该音频段内的整个文本用于计算情绪。The entire text within that audio segment is used to calculate sentiment. 不会计算整个听录的聚合情绪。No aggregate sentiment is being calculated for the entire transcription. 当前,情绪分析仅适用于英语。Currently sentiment analysis is only available for the english language.

备注

建议改用 Microsoft 文本分析 API。We recommend using the Microsoft Text Analytics API instead. 它提供了情绪分析以外的更多高级功能,例如关键短语提取、自动语言检测,等等。It offers more advanced features beyond sentiment analysis like key phrase extraction, automatic language detection, and more. 可以在文本分析文档中找到信息和示例。You can find information and samples in the Text Analytics documentation.

下面是 JSON 输出示例:A JSON output sample looks like below:

{
  "AudioFileResults": [
    {
      "AudioFileName": "Channel.0.wav",
      "AudioFileUrl": null,
      "SegmentResults": [
        {
          "RecognitionStatus": "Success",
          "ChannelNumber": null,
          "Offset": 400000,
          "Duration": 13300000,
          "NBest": [
            {
              "Confidence": 0.976174,
              "Lexical": "what's the weather like",
              "ITN": "what's the weather like",
              "MaskedITN": "what's the weather like",
              "Display": "What's the weather like?",
              "Words": null,
              "Sentiment": {
                "Negative": 0.206194,
                "Neutral": 0.793785,
                "Positive": 0.0
              }
            }
          ]
        }
      ]
    }
  ]
}

最佳实践Best practices

听录服务可以处理大量的已提交听录内容。The transcription service can handle large number of submitted transcriptions. 可以通过听录方法中的 GET 查询听录的状态。You can query the status of your transcriptions through a GET on the transcriptions method. 通过指定 take 参数来保持以合理的大小(数百)返回信息。Keep the information returned to a reasonable size by specifying the take parameter (a few hundred). 检索结果后,定期从服务中删除听录Delete transcriptions regularly from the service once you retrieved the results. 这可以保证快速从听录管理调用获得回复。This guarantees quick replies from the transcription management calls.

代码示例Sample code

samples/batch 子目录内的 GitHub 示例存储库中提供了完整示例。Complete samples are available in the GitHub sample repository inside the samples/batch subdirectory.

如要使用自定义声学或语言模型,必须使用订阅信息、服务区域、指向要转录的音频文件的 SAS URI 和模型 ID 来自定义示例代码。You have to customize the sample code with your subscription information, the service region, the SAS URI pointing to the audio file to transcribe, and model IDs in case you want to use a custom acoustic or language model.

备注

创建批处理客户端时,请为 Azure 中国将终结点更改为 https://chinaeast2.cris.azure.cnWhen creating batch client, please change the endpoint as https://chinaeast2.cris.azure.cn for Azure China.

// Replace with your subscription key
private const string SubscriptionKey = "YourSubscriptionKey";

// Update with your service region
private const string Region = "YourServiceRegion";
private const int Port = 443;

// recordings and locale
private const string Locale = "en-US";
private const string RecordingsBlobUri = "<SAS URI pointing to an audio file stored in Azure Blob Storage>";

// For usage of baseline models, no acoustic and language model needs to be specified.
private static Guid[] modelList = new Guid[0];

// For use of specific acoustic and language models:
// - comment the previous line
// - uncomment the next lines to create an array containing the guids of your required model(s)
// private static Guid AdaptedAcousticId = new Guid("<id of the custom acoustic model>");
// private static Guid AdaptedLanguageId = new Guid("<id of the custom language model>");
// private static Guid[] modelList = new[] { AdaptedAcousticId, AdaptedLanguageId };

//name and description
private const string Name = "Simple transcription";
private const string Description = "Simple transcription description";

示例代码设置客户端并提交听录请求。The sample code sets up the client and submits the transcription request. 然后,它会轮询状态信息并输出关于听录进度的详细信息。It then polls for the status information and print details about the transcription progress.

// get all transcriptions for the user
transcriptions = await client.GetTranscriptionsAsync().ConfigureAwait(false);

completed = 0; running = 0; notStarted = 0;
// for each transcription in the list we check the status
foreach (var transcription in transcriptions)
{
    switch (transcription.Status)
    {
        case "Failed":
        case "Succeeded":
            // we check to see if it was one of the transcriptions we created from this client.
            if (!createdTranscriptions.Contains(transcription.Id))
            {
                // not created form here, continue
                continue;
            }
            completed++;

            // if the transcription was successful, check the results
            if (transcription.Status == "Succeeded")
            {
                var resultsUri0 = transcription.ResultsUrls["channel_0"];

                WebClient webClient = new WebClient();

                var filename = Path.GetTempFileName();
                webClient.DownloadFile(resultsUri0, filename);
                var results0 = File.ReadAllText(filename);
                var resultObject0 = JsonConvert.DeserializeObject<RootObject>(results0);
                Console.WriteLine(results0);

                Console.WriteLine("Transcription succeeded. Results: ");
                Console.WriteLine(results0);
            }
            else
            {
                Console.WriteLine("Transcription failed. Status: {0}", transcription.StatusMessage);
            }
            break;

        case "Running":
            running++;
            break;

        case "NotStarted":
            notStarted++;
            break;
    }
}

有关上述调用的完整详细信息,请参阅 Swagger 文档For full details about the preceding calls, see our Swagger document. 有关此处所示的完整示例,请转到 samples/batch 子目录中的 GitHubFor the full sample shown here, go to GitHub in the samples/batch subdirectory.

请注意用于发布音频和接收听录状态的异步设置。Take note of the asynchronous setup for posting audio and receiving transcription status. 创建的客户端是一个 .NET HTTP 客户端。The client that you create is a .NET HTTP client. PostTranscriptions 方法用于发送音频文件详细信息,GetTranscriptions 方法用于接收结果。There's a PostTranscriptions method for sending the audio file details and a GetTranscriptions method for receiving the results. PostTranscriptions 返回句柄,GetTranscriptions 使用此句柄创建一个句柄来获取听录状态。PostTranscriptions returns a handle, and GetTranscriptions uses it to create a handle to get the transcription status.

当前示例代码未指定任何自定义模型。The current sample code doesn't specify a custom model. 该服务使用基线模型来听录一个或多个文件。The service uses the baseline models for transcribing the file or files. 若要指定模型,可为声学和语言模型传递与模型 ID 相同的方法。To specify the models, you can pass on the same method as the model IDs for the acoustic and the language model.

备注

对于基线听录,无需声明基线模型的 ID。For baseline transcriptions, you don't need to declare the ID for the baseline models. 如果只指定语言模型 ID(而不指定声学模型 ID),则自动选择匹配的声学模型。If you only specify a language model ID (and no acoustic model ID), a matching acoustic model is automatically selected. 如果只指定声学模型 ID,则自动选择匹配的语言模型。If you only specify an acoustic model ID, a matching language model is automatically selected.

下载示例Download the sample

可以在 GitHub 示例存储库samples/batch 目录中查找到该示例。You can find the sample in the samples/batch directory in the GitHub sample repository.

后续步骤Next steps