如何使用批量听录How to use batch transcription

批量听录是一组 REST API 操作,可用于听录存储中的大量音频。Batch transcription is a set of REST API operations that enables you to transcribe a large amount of audio in storage. 可以使用典型 URI 或共享访问签名 (SAS) URI 指向音频文件并异步接收听录结果。You can point to audio files using a typical URI or a shared access signature (SAS) URI and asynchronously receive transcription results. 使用 v3.0 API,你可以听录一个或多个音频文件,或者处理整个存储容器。With the v3.0 API, you can transcribe one or more audio files, or process a whole storage container.

可以使用批量听录 REST API 调用以下方法:You can use batch transcription REST APIs to call the following methods:

批量听录操作Batch Transcription Operation 方法Method REST API 调用REST API Call
创建一个新的听录。Creates a new transcription. POSTPOST speechtotext/v3.0/transcriptionsspeechtotext/v3.0/transcriptions
检索经过身份验证的订阅的听录列表。Retrieves a list of transcriptions for the authenticated subscription. GETGET speechtotext/v3.0/transcriptionsspeechtotext/v3.0/transcriptions
获取脱机听录支持的区域设置列表。Gets a list of supported locales for offline transcriptions. GETGET speechtotext/v3.0/transcriptions/localesspeechtotext/v3.0/transcriptions/locales
更新由 ID 标识的听录的可变详细信息。Updates the mutable details of the transcription identified by its ID. PATCHPATCH speechtotext/v3.0/transcriptions/{id}speechtotext/v3.0/transcriptions/{id}
删除指定的听录任务。Deletes the specified transcription task. DELETEDELETE speechtotext/v3.0/transcriptions/{id}speechtotext/v3.0/transcriptions/{id}
获取由给定 ID 标识的听录。Gets the transcription identified by the given ID. GETGET speechtotext/v3.0/transcriptions/{id}speechtotext/v3.0/transcriptions/{id}
获取由给定 ID 标识的听录的结果文件。Gets the result files of the transcription identified by the given ID. GETGET speechtotext/v3.0/transcriptions/{id}/filesspeechtotext/v3.0/transcriptions/{id}/files

你可以查看和测试详细的 API,它以 Swagger 文档的形式提供。You can review and test the detailed API, which is available as a Swagger document.

此 API 不需要自定义终结点,并且没有并发要求。This API does not require custom endpoints, and has no concurrency requirements.

批量听录作业是按“尽力而为”的原则安排的。Batch transcription jobs are scheduled on a best effort basis. 你无法估计作业何时将变为运行状态,但在正常系统负载下,它应该在几分钟内发生。You cannot estimate when a job will change into the running state, but it should happen within minutes under normal system load. 进入运行状态后,听录速度比音频运行时播放速度更快。Once in the running state, the transcription occurs faster than the audio runtime playback speed.

先决条件Prerequisites

与语音服务的其他所有功能一样,需要按照入门指南通过 Azure 门户创建订阅密钥。As with all features of the Speech service, you create a subscription key from the Azure portal by following our Get started guide.

备注

若要使用批量听录,需要具备语音服务的标准订阅 (S0)。A standard subscription (S0) for Speech service is required to use batch transcription. 免费订阅密钥 (F0) 无效。Free subscription keys (F0) don't work. 有关详细信息,请参阅定价和限制For more information, see pricing and limits.

如果计划自定义模型,请按照声音自定义语言自定义中的步骤操作。If you plan to customize models, follow the steps in Acoustic customization and Language customization. 若要在批量听录中使用所创建的模型,需要提供其模型位置。To use the created models in batch transcription, you need their model location. 可以在检查模型的详细信息(self 属性)时检索模型位置。You can retrieve the model location when you inspect the details of the model (self property). 批量听录服务不需要已部署的自定义终结点。A deployed custom endpoint is not needed for the batch transcription service.

批量听录 APIBatch transcription API

批量听录 API 支持以下格式:The batch transcription API supports the following formats:

格式Format 编解码器Codec 每个样本的位数Bits Per Sample 采样率Sample Rate
WAVWAV PCMPCM 16 位16-bit 8 kHz 或 16 kHz,单声道或立体声8 kHz or 16 kHz, mono or stereo
MP3MP3 PCMPCM 16 位16-bit 8 kHz 或 16 kHz,单声道或立体声8 kHz or 16 kHz, mono or stereo
OGGOGG OPUSOPUS 16 位16-bit 8 kHz 或 16 kHz,单声道或立体声8 kHz or 16 kHz, mono or stereo

对于立体声音频流,在听录期间会拆分左右声道。For stereo audio streams, the left and right channels are split during the transcription. 对于每个声道,将创建一个 JSON 结果文件。A JSON result file is being created for each channel. 若要创建有序的最终脚本,请使用为每个言语生成的时间戳。To create an ordered final transcript, use the timestamps generated per utterance.

配置Configuration

配置参数以 JSON 形式(一个或多个单独文件)提供:Configuration parameters are provided as JSON (one or more individual files):

{
  "contentUrls": [
    "<URL to an audio file to transcribe>",
  ],
  "properties": {
    "wordLevelTimestampsEnabled": true
  },
  "locale": "en-US",
  "displayName": "Transcription of file using default model for en-US"
}

配置参数以 JSON 形式(处理整个存储容器)提供:Configuration parameters are provided as JSON (processing a whole storage container):

{
  "contentContainerUrl": "<SAS URL to the Azure blob container to transcribe>",
  "properties": {
    "wordLevelTimestampsEnabled": true
  },
  "locale": "en-US",
  "displayName": "Transcription of container using default model for en-US"
}

以下 JSON 指定了用于批量听录的自定义训练模型:The following JSON specifies a custom trained model to use in a batch transcription:

{
  "contentUrls": [
    "<URL to an audio file to transcribe>",
  ],
  "properties": {
    "wordLevelTimestampsEnabled": true
  },
  "locale": "en-US",
  "model": {
    "self": "https://chinaeast2.api.cognitive.azure.cn/speechtotext/v3.0/models/{id}"
  },
  "displayName": "Transcription of file using default model for en-US"
}

配置属性Configuration properties

使用以下可选属性来配置听录:Use these optional properties to configure transcription:

参数Parameter

说明Description

profanityFilterMode

(可选)默认值为 MaskedOptional, defaults to Masked. 指定如何处理识别结果中的不雅内容。Specifies how to handle profanity in recognition results. 接受的值为 None(禁用不雅内容筛选)、Masked(将不雅内容替换为星号)、Removed(从结果中删除所有不雅内容)或 Tags(添加“不雅内容”标记)。Accepted values are None to disable profanity filtering, Masked to replace profanity with asterisks, Removed to remove all profanity from the result, or Tags to add "profanity" tags.

punctuationMode

(可选)默认值为 DictatedAndAutomaticOptional, defaults to DictatedAndAutomatic. 指定如何处理识别结果中的标点。Specifies how to handle punctuation in recognition results. 接受的值为 None(禁用标点)、Dictated(暗示显式(口述)标点)、Automatic(让解码器处理标点),或 DictatedAndAutomatic(使用听写标点和自动标点)。Accepted values are None to disable punctuation, Dictated to imply explicit (spoken) punctuation, Automatic to let the decoder deal with punctuation, or DictatedAndAutomatic to use dictated and automatic punctuation.

wordLevelTimestampsEnabled

(可选)默认值为 falseOptional, false by default. 指定是否应将字级时间戳添加到输出。Specifies if word level timestamps should be added to the output.

diarizationEnabled

(可选)默认值为 falseOptional, false by default. 指定应该对输入(预期为包含两个语音的单声道)执行分割聚类分析。Specifies that diarization analysis should be carried out on the input, which is expected to be mono channel containing two voices. 注意:需要将 wordLevelTimestampsEnabled 设置为 trueNote: Requires wordLevelTimestampsEnabled to be set to true.

channels

(可选)默认情况下转录 01Optional, 0 and 1 transcribed by default. 待处理的声道编号数组。An array of channel numbers to process. 可在此指定处理音频文件中可用通道的子集(例如只听录 0)。Here a subset of the available channels in the audio file can be specified to be processed (e.g. 0 only).

timeToLive

(可选)默认情况下不删除。Optional, no deletion by default. 完成听录后保留听录的持续时间,此持续时间之后会自动删除听录。A duration to automatically delete transcriptions after completing the transcription. timeToLive 适用于批量处理听录,可确保最终将会删除这些听录(例如 PT12H 12 小时)。The timeToLive is useful in mass processing transcriptions to ensure they will be eventually deleted (e.g. PT12H for 12 hours).

destinationContainerUrl

Azure 中可写容器的可选 URL(包含服务临时 SAS)。Optional URL with Service ad hoc SAS to a writeable container in Azure. 结果存储在此容器中。The result is stored in this container. 不支持具有存储访问策略的 SAS。SAS with stored access policy are not supported. 如果未指定,Microsoft 会将结果存储在由 Microsoft 管理的存储容器中。When not specified, Microsoft stores the results in a storage container managed by Microsoft. 当通过调用删除听录来删除听录时,结果数据也会被删除。When the transcription is deleted by calling Delete transcription, the result data will also be deleted.

存储Storage

批量听录可以从公共可见 Internet URI 中读取音频,并且可以使用具有 Azure Blob 存储的 SAS URI 来读取音频或写入听录。Batch transcription can read audio from a public-visible internet URI, and can read audio or write transcriptions using a SAS URI with Azure Blob storage.

批量听录结果Batch transcription result

对于每个音频输入,都将创建一个听录结果文件。For each audio input, one transcription result file is created. 获取听录文件操作可返回此听录的结果文件列表。The Get transcriptions files operation returns a list of result files for this transcription. 若要查找特定输入文件的听录文件,请使用 kind == Transcriptionname == {originalInputName.suffix}.json 来筛选所有返回的文件。To find the transcription file for a specific input file, filter all returned files with kind == Transcription and name == {originalInputName.suffix}.json.

每个听录结果文件都采用这种格式:Each transcription result file has this format:

{
  "source": "...",                      // sas url of a given contentUrl or the path relative to the root of a given container
  "timestamp": "2020-06-16T09:30:21Z",  // creation time of the transcription, ISO 8601 encoded timestamp, combined date and time
  "durationInTicks": 41200000,          // total audio duration in ticks (1 tick is 100 nanoseconds)
  "duration": "PT4.12S",                // total audio duration, ISO 8601 encoded duration
  "combinedRecognizedPhrases": [        // concatenated results for simple access in single string for each channel
    {
      "channel": 0,                     // channel number of the concatenated results
      "lexical": "hello world",
      "itn": "hello world",
      "maskedITN": "hello world",
      "display": "Hello world."
    }
  ],
  "recognizedPhrases": [                // results for each phrase and each channel individually
    {
      "recognitionStatus": "Success",   // recognition state, e.g. "Success", "Failure"
      "channel": 0,                     // channel number of the result
      "offset": "PT0.07S",              // offset in audio of this phrase, ISO 8601 encoded duration 
      "duration": "PT1.59S",            // audio duration of this phrase, ISO 8601 encoded duration
      "offsetInTicks": 700000.0,        // offset in audio of this phrase in ticks (1 tick is 100 nanoseconds)
      "durationInTicks": 15900000.0,    // audio duration of this phrase in ticks (1 tick is 100 nanoseconds)
      
      // possible transcriptions of the current phrase with confidences
      "nBest": [
        {
          "confidence": 0.898652852,    // confidence value for the recognition of the whole phrase
          "speaker": 1,                 // if `diarizationEnabled` is `true`, this is the identified speaker (1 or 2), otherwise this property is not present
          "lexical": "hello world",
          "itn": "hello world",
          "maskedITN": "hello world",
          "display": "Hello world.",
          
          // if wordLevelTimestampsEnabled is `true`, there will be a result for each word of the phrase, otherwise this property is not present
          "words": [
            {
              "word": "hello",
              "offset": "PT0.09S",
              "duration": "PT0.48S",
              "offsetInTicks": 900000.0,
              "durationInTicks": 4800000.0,
              "confidence": 0.987572
            },
            {
              "word": "world",
              "offset": "PT0.59S",
              "duration": "PT0.16S",
              "offsetInTicks": 5900000.0,
              "durationInTicks": 1600000.0,
              "confidence": 0.906032
            }
          ]
        }
      ]    
    }
  ]
}

结果包含以下字段:The result contains the following fields:

字段Field

内容Content

lexical

识别的实际单词。The actual words recognized.

itn

已识别文本的反向文本规范化形式。Inverse-text-normalized form of the recognized text. 已应用缩写(“doctor smith”缩写为“dr smith”)、电话号码和其他转换。Abbreviations ("doctor smith" to "dr smith"), phone numbers, and other transformations are applied.

maskedITN

应用了亵渎内容屏蔽的 ITN 形式。The ITN form with profanity masking applied.

display

已识别文本的显示形式。The display form of the recognized text. 包括添加的标点和大小写。Added punctuation and capitalization are included.

说话人分离(分割聚类)Speaker separation (diarization)

分割聚类是将讲述人语音分隔成音频片段的过程。Diarization is the process of separating speakers in a piece of audio. 批量管道支持分割聚类,并且能够识别单声道录制内容中的两个说话人。The batch pipeline supports diarization and is capable of recognizing two speakers on mono channel recordings. 此功能不适用于立体声录音。The feature is not available on stereo recordings.

对于每个听录短语,启用了分割聚类的听录的输出都包含一个 Speaker 项。The output of transcription with diarization enabled contains a Speaker entry for each transcribed phrase. 如果未使用分割聚类,JSON 输出中不会有属性 SpeakerIf diarization is not used, the Speaker property is not present in the JSON output. 对于分割聚类,我们支持两段语音,因此讲述人标识为 12For diarization we support two voices, so the speakers are identified as 1 or 2.

要请求分割聚类,请将 diarizationEnabled 属性设置为 true,如以下 HTTP 请求所示。To request diarization, add set the diarizationEnabled property to true like the HTTP request shows below.

{
 "contentUrls": [
   "<URL to an audio file to transcribe>",
 ],
 "properties": {
   "diarizationEnabled": true,
   "wordLevelTimestampsEnabled": true,
   "punctuationMode": "DictatedAndAutomatic",
   "profanityFilterMode": "Masked"
 },
 "locale": "en-US",
 "displayName": "Transcription of file using default model for en-US"
}

如上述请求中的参数所示,必须启用单词级时间戳。Word-level timestamps must be enabled as the parameters in the above request indicate.

最佳实践Best practices

批量听录服务可以处理大量的已提交听录内容。The batch transcription service can handle large number of submitted transcriptions. 可以通过获取听录查询听录的状态。You can query the status of your transcriptions with Get transcriptions. 检索结果后,请定期从服务中调用删除听录Call Delete transcription regularly from the service once you retrieved the results. 或者设置 timeToLive 属性,以确保最终删除结果。Alternatively set timeToLive property to ensure eventual deletion of the results.

代码示例Sample code

samples/batch 子目录内的 GitHub 示例存储库中提供了完整示例。Complete samples are available in the GitHub sample repository inside the samples/batch subdirectory.

如果使用的是自定义模型,请使用订阅信息、服务区域、指向要听录的音频文件的 URI 以及模型位置来更新示例代码。Update the sample code with your subscription information, service region, URI pointing to the audio file to transcribe, and model location if you're using a custom model.

/var newTranscription = new Transcription
{
    DisplayName = DisplayName, 
    Locale = Locale, 
    ContentUrls = new[] { RecordingsBlobUri },
    //ContentContainerUrl = ContentAzureBlobContainer,
    Model = CustomModel,
    Properties = new TranscriptionProperties
    {
        IsWordLevelTimestampsEnabled = true,
        TimeToLive = TimeSpan.FromDays(1)
    }
};

newTranscription = await client.PostTranscriptionAsync(newTranscription).ConfigureAwait(false);
Console.WriteLine($"Created transcription {newTranscription.Self}");

示例代码设置客户端并提交听录请求。The sample code sets up the client and submits the transcription request. 然后,它会轮询状态信息并输出关于听录进度的详细信息。It then polls for the status information and print details about the transcription progress.

    if (paginatedTranscriptions == null)
    {
        paginatedTranscriptions = await client.GetTranscriptionsAsync().ConfigureAwait(false);
    }
    else
    {
        paginatedTranscriptions = await client.GetTranscriptionsAsync(paginatedTranscriptions.NextLink).ConfigureAwait(false);
    }

    // delete all pre-existing completed transcriptions. If transcriptions are still running or not started, they will not be deleted
    foreach (var transcription in paginatedTranscriptions.Values)
    {
        switch (transcription.Status)
        {
            case "Failed":
            case "Succeeded":
                // we check to see if it was one of the transcriptions we created from this client.
                if (!createdTranscriptions.Contains(transcription.Self))
                {
                    // not created form here, continue
                    continue;
                }

                completed++;

                // if the transcription was successful, check the results
                if (transcription.Status == "Succeeded")
                {
                    var paginatedfiles = await client.GetTranscriptionFilesAsync(transcription.Links.Files).ConfigureAwait(false);

                    var resultFile = paginatedfiles.Values.FirstOrDefault(f => f.Kind == ArtifactKind.Transcription);
                    var result = await client.GetTranscriptionResultAsync(new Uri(resultFile.Links.ContentUrl)).ConfigureAwait(false);
                    Console.WriteLine("Transcription succeeded. Results: ");
                    Console.WriteLine(JsonConvert.SerializeObject(result, SpeechJsonContractResolver.WriterSettings));
                }
                else
                {
                    Console.WriteLine("Transcription failed. Status: {0}", transcription.Properties.Error.Message);
                }

                break;

            case "Running":
                running++;
                break;

            case "NotStarted":
                notStarted++;
                break;
        }
    }

    // for each transcription in the list we check the status
    Console.WriteLine(string.Format("Transcriptions status: {0} completed, {1} running, {2} not started yet", completed, running, notStarted));
}
while (paginatedTranscriptions.NextLink != null);

此示例代码未指定自定义模型。This sample code doesn't specify a custom model. 该服务使用基线模型来听录一个或多个文件。The service uses the baseline model for transcribing the file or files. 若要指定模型,可将自定义模型的模型引用传递到相同的方法。To specify the model, you can pass on the same method the model reference for the custom model.

备注

对于基线听录,无需声明基线模型的 ID。For baseline transcriptions, you don't need to declare the ID for the baseline model.

后续步骤Next steps