将快速听录 API 与 Azure 语音配合使用

快速听录 API 用于听录音频文件并同步返回结果，且速度比实时音频更快。在您需要尽快获得音频录制的转录文本，并且延迟可预测的情况下，可以使用快速转录，例如：

快速音频或视频听录、字幕和编辑。
会议记录
语音邮件

与批量听录 API 不同，快速听录 API 仅生成显示形式(而不是词法形式)的听录内容。显示形式是一种更方便阅读的听录形式，包含标点和大写。

先决条件

快速听录 API 可用的区域之一中的Azure语音资源。有关受支持区域的当前列表，请参阅语音服务区域表。
音频文件（长度少于 5 小时且大小小于 500 MB），采用批处理听录 API 支持的格式和编解码器之一：WAV、MP3、OPUS/OGG、FLAC、WMA、AAC、WAV 容器中的 ALAW、WAV 容器中的 MULAW、AMR、WebM 和 SPEEX。有关受支持的音频格式的详细信息，请参阅受支持的音频格式。

上传音频

您可以通过以下方式提供音频数据以实现快速转录：

内嵌音频上传

--form 'audio=@"YourAudioFile"'

来自公共 URL 的音频

--form 'definition="{"audioUrl": "https://crbn.us/hello.wav"}"'

小窍门

对于长音频文件，建议从公共 URL 上传。

在以下部分中，内联音频上传用作示例。

使用快速转录 API

我们了解如何在以下场景中使用快速听录 API (通过听录 - 转录):

指定的已知区域设置: 使用已知的区域设置转录音频文件。如果知道音频文件的地域设置，可以指定它以提高转录准确性并减少延迟。
启用语言识别：转录含语言识别的音频文件。如果不确定音频文件的区域设置，可以启用语言标识，让语音服务标识区域设置（每个音频一个区域设置）。
多语言听录：使用最新的多语言语音听录模型转录音频文件。如果音频包含要持续准确地听录的多语言内容，则可以在不指定区域设置代码的情况下使用最新的多语言语音听录模型。
启用了分割聚类：转录启用了分割聚类功能的音频文件。分割聚类可区分对话中的不同说话人。语音服务提供有关哪个讲话者在转录语音的特定部分发言的信息。
启用了多声道：转录包含一个或两个声道的音频文件。多声道听录对于具有多个声道的音频文件非常有用，例如包含多个说话人的音频文件或有背景噪音的音频文件。默认情况下，快速听录 API 将所有输入声道合并到单个声道，然后执行听录。如果不希望这样处理，可以独立转录各个声道，而不进行合并。

指定了已知语言设置
语言识别已开启
多语言听录
启用了分割聚类
多声道开启

使用音频文件和请求正文属性向transcriptions终结点发出多部分/表单数据 POST 请求。

以下示例演示了如何使用指定的区域设置转录音频文件。如果知道音频文件的地域设置，可以指定它以提高转录准确性并减少延迟。

将 YourSpeechResourceKey 替换为语音资源密钥。
将 YourServiceRegion 替换为你的语音资源所在区域。
将YourAudioFile替换为音频文件的路径。

重要

对于建议的无密钥身份验证和Microsoft Entra ID，请将 --header 'Ocp-Apim-Subscription-Key: YourSpeechResourceKey' 替换为 --header "Authorization: Bearer YourAccessToken"。有关无密钥身份验证的详细信息，请参阅基于角色的访问控制作指南。

curl --location 'https://YourServiceRegion.api.cognitive.azure.cn/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResourceKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"]}"'

根据以下说明构建形式定义:

设置可选(但建议选择)的 locales 属性，该属性应与要转录的音频数据的预期语言设置匹配。在此示例中，区域设置为 en-US。有关支持的语言环境的详细信息，请参阅语音转文本支持的语言。

有关快速听录 API locales 和其他属性的详细信息，请参阅本指南后面的“请求配置选项”部分。

响应包括 durationMilliseconds、offsetMilliseconds等。 combinedPhrases 属性包含每个说话人的完整听录内容。

{
    "durationMilliseconds": 182439,
    "combinedPhrases": [
        {
            "text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And you're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
        }
    ],
    "phrases": [
        {
            "offsetMilliseconds": 960,
            "durationMilliseconds": 640,
            "text": "Good afternoon.",
            "words": [
                {
                    "text": "Good",
                    "offsetMilliseconds": 960,
                    "durationMilliseconds": 240
                },
                {
                    "text": "afternoon.",
                    "offsetMilliseconds": 1200,
                    "durationMilliseconds": 400
                }
            ],
            "locale": "en-US",
            "confidence": 0.93554276
        },
        {
            "offsetMilliseconds": 1600,
            "durationMilliseconds": 640,
            "text": "This is Sam.",
            "words": [
                {
                    "text": "This",
                    "offsetMilliseconds": 1600,
                    "durationMilliseconds": 240
                },
                {
                    "text": "is",
                    "offsetMilliseconds": 1840,
                    "durationMilliseconds": 120
                },
                {
                    "text": "Sam.",
                    "offsetMilliseconds": 1960,
                    "durationMilliseconds": 280
                }
            ],
            "locale": "en-US",
            "confidence": 0.93554276
        },
        {
            "offsetMilliseconds": 2240,
            "durationMilliseconds": 1040,
            "text": "Thank you for calling Contoso.",
            "words": [
                {
                    "text": "Thank",
                    "offsetMilliseconds": 2240,
                    "durationMilliseconds": 200
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 2440,
                    "durationMilliseconds": 80
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 2520,
                    "durationMilliseconds": 120
                },
                {
                    "text": "calling",
                    "offsetMilliseconds": 2640,
                    "durationMilliseconds": 200
                },
                {
                    "text": "Contoso.",
                    "offsetMilliseconds": 2840,
                    "durationMilliseconds": 440
                }
            ],
            "locale": "en-US",
            "confidence": 0.93554276
        },
        {
            "offsetMilliseconds": 3280,
            "durationMilliseconds": 640,
            "text": "How can I help?",
            "words": [
                {
                    "text": "How",
                    "offsetMilliseconds": 3280,
                    "durationMilliseconds": 120
                },
                {
                    "text": "can",
                    "offsetMilliseconds": 3440,
                    "durationMilliseconds": 120
                },
                {
                    "text": "I",
                    "offsetMilliseconds": 3560,
                    "durationMilliseconds": 40
                },
                {
                    "text": "help?",
                    "offsetMilliseconds": 3600,
                    "durationMilliseconds": 320
                }
            ],
            "locale": "en-US",
            "confidence": 0.93554276
        },
        {
            "offsetMilliseconds": 5040,
            "durationMilliseconds": 400,
            "text": "Hi there.",
            "words": [
                {
                    "text": "Hi",
                    "offsetMilliseconds": 5040,
                    "durationMilliseconds": 240
                },
                {
                    "text": "there.",
                    "offsetMilliseconds": 5280,
                    "durationMilliseconds": 160
                }
            ],
            "locale": "en-US",
            "confidence": 0.93554276
        },
        {
            "offsetMilliseconds": 5440,
            "durationMilliseconds": 800,
            "text": "My name is Mary.",
            "words": [
                {
                    "text": "My",
                    "offsetMilliseconds": 5440,
                    "durationMilliseconds": 80
                },
                {
                    "text": "name",
                    "offsetMilliseconds": 5520,
                    "durationMilliseconds": 120
                },
                {
                    "text": "is",
                    "offsetMilliseconds": 5640,
                    "durationMilliseconds": 80
                },
                {
                    "text": "Mary.",
                    "offsetMilliseconds": 5720,
                    "durationMilliseconds": 520
                }
            ],
            "locale": "en-US",
            "confidence": 0.93554276
        },
        // More transcription results...
        // Redacted for brevity
        {
            "offsetMilliseconds": 180320,
            "durationMilliseconds": 680,
            "text": "Thank you for your help.",
            "words": [
                {
                    "text": "Thank",
                    "offsetMilliseconds": 180320,
                    "durationMilliseconds": 160
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 180480,
                    "durationMilliseconds": 80
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 180560,
                    "durationMilliseconds": 120
                },
                {
                    "text": "your",
                    "offsetMilliseconds": 180680,
                    "durationMilliseconds": 120
                },
                {
                    "text": "help.",
                    "offsetMilliseconds": 180800,
                    "durationMilliseconds": 200
                }
            ],
            "locale": "en-US",
            "confidence": 0.92022026
        },
        {
            "offsetMilliseconds": 181960,
            "durationMilliseconds": 280,
            "text": "Thank you.",
            "words": [
                {
                    "text": "Thank",
                    "offsetMilliseconds": 181960,
                    "durationMilliseconds": 200
                },
                {
                    "text": "you.",
                    "offsetMilliseconds": 182160,
                    "durationMilliseconds": 80
                }
            ],
            "locale": "en-US",
            "confidence": 0.92022026
        }
    ]
}

使用音频文件和请求正文属性向transcriptions终结点发出多部分/表单数据 POST 请求。

以下示例演示了如何转录启用了语言识别功能的音频文件。如果不确定所使用的区域设置，可以指定多个区域设置。如果未指定任何语言，或者音频文件中没有指定的语言，那么语音服务将尝试识别语言。

注释

快速听录中的语言识别旨在识别音频文件的主要语言。如果需要在音频中转录多语言内容，请考虑进行多语言听录。

将 YourSpeechResoureKey 替换为语音资源密钥。
将 YourServiceRegion 替换为你的语音资源所在区域。
将YourAudioFile替换为音频文件的路径。

重要

对于建议的无密钥身份验证和Microsoft Entra ID，请将 --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' 替换为 --header "Authorization: Bearer YourAccessToken"。有关无密钥身份验证的详细信息，请参阅基于角色的访问控制作指南。

curl --location 'https://YourServiceRegion.api.cognitive.azure.cn/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US","ja-JP"]}"'

根据以下说明构建形式定义:

设置可选(但建议选择)的 locales 属性，该属性应与要转录的音频数据的预期语言设置匹配。在此示例中，区域设置设为 en-US 和 ja-JP。可以指定的受支持区域设置位于所有支持的语言中。

有关快速听录 API locales 和其他属性的详细信息，请参阅本指南后面的“请求配置选项”部分。

响应包括 durationMilliseconds、offsetMilliseconds等。 combinedPhrases 属性包含每个说话人的完整听录内容。

{
    "durationMilliseconds": 185079,
    "combinedPhrases": [
        {
            "text": "Hello, thank you for calling Contoso. Who am I speaking with today? Hi, my name is Mary Rondo. I'm trying to enroll myself with Contoso. Hi, Mary. Are you calling because you need health insurance? Yes. Yeah, I'm calling to sign up for insurance. Great. Uh If you can answer a few questions, we can get you signed up in a Jiffy. Okay. So what's your full name? uh So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. Got it. And what's the best callback number in case we get disconnected? I only have a cell phone, so I can give you that. Yep, that'll be fine. Sure. So it's 234-554 and then 9312. Got it. So to confirm, it's 234-554-9312. Yep, that's right. Excellent. Let's get some additional information for your application. Do you have a job? Uh Yes, I am self-employed. Okay, so then you have a social security number as well? Uh Yes, I do. Okay, and what is your social security number, please? Uh Sure, so it's 412-253-4931. 6789. Sorry, was that a 25 or a 225? You cut out for a bit. It's double two, so 412, then another two, then five. Thank you so much. And could I have your e-mail address, please? Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. That sounds good. Thank you. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Actually, so I have one more question. Yes, of course. I'm curious, will I be getting a physical card as proof of coverage? So the default is a digital membership card, but we can send you a physical card if you prefer. Uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? Uh Yeah. uh So it's 2660 Unit A on Maple Avenue, Southeast Lansing, and then zip code is 48823. Absolutely. I've made a note on your file. Awesome. Thanks so much. You're very welcome. Thank you for calling Contoso and have a great day."
        }
    ],
    "phrases": [
        {
            "offsetMilliseconds": 720,
            "durationMilliseconds": 1600,
            "text": "Hello, thank you for calling Contoso.",
            "words": [
                {
                    "text": "Hello,",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 480
                },
                {
                    "text": "thank",
                    "offsetMilliseconds": 1200,
                    "durationMilliseconds": 200
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 1400,
                    "durationMilliseconds": 80
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 1480,
                    "durationMilliseconds": 120
                },
                {
                    "text": "calling",
                    "offsetMilliseconds": 1600,
                    "durationMilliseconds": 240
                },
                {
                    "text": "Contoso.",
                    "offsetMilliseconds": 1840,
                    "durationMilliseconds": 480
                }
            ],
            "locale": "en-US",
            "confidence": 0.93265927
        },
        {
            "offsetMilliseconds": 2320,
            "durationMilliseconds": 1120,
            "text": "Who am I speaking with today?",
            "words": [
                {
                    "text": "Who",
                    "offsetMilliseconds": 2320,
                    "durationMilliseconds": 160
                },
                {
                    "text": "am",
                    "offsetMilliseconds": 2480,
                    "durationMilliseconds": 80
                },
                {
                    "text": "I",
                    "offsetMilliseconds": 2560,
                    "durationMilliseconds": 80
                },
                {
                    "text": "speaking",
                    "offsetMilliseconds": 2640,
                    "durationMilliseconds": 320
                },
                {
                    "text": "with",
                    "offsetMilliseconds": 2960,
                    "durationMilliseconds": 160
                },
                {
                    "text": "today?",
                    "offsetMilliseconds": 3120,
                    "durationMilliseconds": 320
                }
            ],
            "locale": "en-US",
            "confidence": 0.93265927
        },
        {
            "offsetMilliseconds": 4480,
            "durationMilliseconds": 1600,
            "text": "Hi, my name is Mary Rondo.",
            "words": [
                {
                    "text": "Hi,",
                    "offsetMilliseconds": 4480,
                    "durationMilliseconds": 400
                },
                {
                    "text": "my",
                    "offsetMilliseconds": 4880,
                    "durationMilliseconds": 120
                },
                {
                    "text": "name",
                    "offsetMilliseconds": 5000,
                    "durationMilliseconds": 120
                },
                {
                    "text": "is",
                    "offsetMilliseconds": 5120,
                    "durationMilliseconds": 160
                },
                {
                    "text": "Mary",
                    "offsetMilliseconds": 5280,
                    "durationMilliseconds": 240
                },
                {
                    "text": "Rondo.",
                    "offsetMilliseconds": 5520,
                    "durationMilliseconds": 560
                }
            ],
            "locale": "en-US",
            "confidence": 0.93265927
        },
        {
            "offsetMilliseconds": 6120,
            "durationMilliseconds": 1800,
            "text": "I'm trying to enroll myself with Contoso.",
            "words": [
                {
                    "text": "I'm",
                    "offsetMilliseconds": 6120,
                    "durationMilliseconds": 120
                },
                {
                    "text": "trying",
                    "offsetMilliseconds": 6240,
                    "durationMilliseconds": 200
                },
                {
                    "text": "to",
                    "offsetMilliseconds": 6440,
                    "durationMilliseconds": 80
                },
                {
                    "text": "enroll",
                    "offsetMilliseconds": 6520,
                    "durationMilliseconds": 200
                },
                {
                    "text": "myself",
                    "offsetMilliseconds": 6720,
                    "durationMilliseconds": 360
                },
                {
                    "text": "with",
                    "offsetMilliseconds": 7080,
                    "durationMilliseconds": 120
                },
                {
                    "text": "Contoso.",
                    "offsetMilliseconds": 7200,
                    "durationMilliseconds": 720
                }
            ],
            "locale": "en-US",
            "confidence": 0.93265927
        },
        // More transcription results...
        // Redacted for brevity
        {
            "offsetMilliseconds": 181520,
            "durationMilliseconds": 720,
            "text": "You're very welcome.",
            "words": [
                {
                    "text": "You're",
                    "offsetMilliseconds": 181520,
                    "durationMilliseconds": 160
                },
                {
                    "text": "very",
                    "offsetMilliseconds": 181680,
                    "durationMilliseconds": 200
                },
                {
                    "text": "welcome.",
                    "offsetMilliseconds": 181880,
                    "durationMilliseconds": 360
                }
            ],
            "locale": "en-US",
            "confidence": 0.90571773
        },
        {
            "offsetMilliseconds": 182320,
            "durationMilliseconds": 1840,
            "text": "Thank you for calling Contoso and have a great day.",
            "words": [
                {
                    "text": "Thank",
                    "offsetMilliseconds": 182320,
                    "durationMilliseconds": 200
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 182520,
                    "durationMilliseconds": 80
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 182600,
                    "durationMilliseconds": 120
                },
                {
                    "text": "calling",
                    "offsetMilliseconds": 182720,
                    "durationMilliseconds": 280
                },
                {
                    "text": "Contoso",
                    "offsetMilliseconds": 183000,
                    "durationMilliseconds": 520
                },
                {
                    "text": "and",
                    "offsetMilliseconds": 183520,
                    "durationMilliseconds": 160
                },
                {
                    "text": "have",
                    "offsetMilliseconds": 183680,
                    "durationMilliseconds": 120
                },
                {
                    "text": "a",
                    "offsetMilliseconds": 183800,
                    "durationMilliseconds": 40
                },
                {
                    "text": "great",
                    "offsetMilliseconds": 183840,
                    "durationMilliseconds": 200
                },
                {
                    "text": "day.",
                    "offsetMilliseconds": 184040,
                    "durationMilliseconds": 120
                }
            ],
            "locale": "en-US",
            "confidence": 0.90571773
        }
    ]
}

使用音频文件和请求正文属性向transcriptions终结点发出多部分/表单数据 POST 请求。

以下示例演示如何使用最新的多语言语音听录模型转录音频文件。如果音频包含要持续准确地听录的多语言内容，则可以在不指定区域设置代码的情况下使用最新的多语言语音听录模型。

将 YourSpeechResoureKey 替换为语音资源密钥。
将 YourServiceRegion 替换为你的语音资源所在区域。
将YourAudioFile替换为音频文件的路径。

重要

curl --location 'https://YourServiceRegion.api.cognitive.azure.cn/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":[]}"'

根据以下说明构建形式定义:

可以将属性留 locales 空（如前面的示例所示），也可以省略该属性。
支持具有当前多语言模型的音频输入区域设置包括：de-DE、en-AU、en-CA、en-GB、en-IN、en-US、es-ES、es-MX、fr-CA、fr-FR、it-IT、ja-JP、ko-KR、pt-BR和zh-cn。
转录结果在语言层面上区分，并遵循“此语言的主要语言环境”（例如，无论音频具有英国英语或印度英语口音，它将始终输出“en-US”语言环境代码）。

有关快速听录 API locales 和其他属性的详细信息，请参阅本指南后面的“请求配置选项”部分。

响应包括 durationMilliseconds、offsetMilliseconds等。 combinedPhrases 属性包含每个说话人的完整听录内容。

{
    "durationMilliseconds": 57187,
    "combinedPhrases": [
        {
            "text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products 现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。 Quand vous effectuez une demande de reconnaissance vocale, le modèle de base le plus récent pour chaque langue prise en charge est utilisé par défaut. Le modèle de base fonctionne très bien dans la plupart des scénarios de reconnaissance vocale. A custom model can be used to augment the base model to improve recognition of domain specific vocabulary specified to the application by providing text data to train the model. It can also be used to improve recognition based for the specific audio conditions of the application by providing audio data with reference transcriptions."
        }
    ],
    "phrases": [
        {
            "offsetMilliseconds": 80,
            "durationMilliseconds": 6960,
            "text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products.",
            "words": [
                {
                    "text": "with",
                    "offsetMilliseconds": 80,
                    "durationMilliseconds": 160
                },
                {
                    "text": "custom",
                    "offsetMilliseconds": 240,
                    "durationMilliseconds": 480
                },
                {
                    "text": "speech",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 360
                },
                {
                    "text": ",",
                    "offsetMilliseconds": 1080,
                    "durationMilliseconds": 10
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 1200,
                    "durationMilliseconds": 240
                },
                {
                    "text": "can",
                    "offsetMilliseconds": 1440,
                    "durationMilliseconds": 160
                },
                {
                    "text": "evaluate",
                    "offsetMilliseconds": 1600,
                    "durationMilliseconds": 640
                },
                {
                    "text": "and",
                    "offsetMilliseconds": 2240,
                    "durationMilliseconds": 200
                },
                {
                    "text": "improve",
                    "offsetMilliseconds": 2440,
                    "durationMilliseconds": 280
                },
                {
                    "text": "the",
                    "offsetMilliseconds": 2720,
                    "durationMilliseconds": 160
                },
                {
                    "text": "microsoft",
                    "offsetMilliseconds": 2880,
                    "durationMilliseconds": 640
                },
                {
                    "text": "speech",
                    "offsetMilliseconds": 3520,
                    "durationMilliseconds": 320
                },
                {
                    "text": "to",
                    "offsetMilliseconds": 3840,
                    "durationMilliseconds": 200
                },
                {
                    "text": "text",
                    "offsetMilliseconds": 4040,
                    "durationMilliseconds": 360
                },
                {
                    "text": "accuracy",
                    "offsetMilliseconds": 4400,
                    "durationMilliseconds": 560
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 4960,
                    "durationMilliseconds": 160
                },
                {
                    "text": "your",
                    "offsetMilliseconds": 5120,
                    "durationMilliseconds": 200
                },
                {
                    "text": "applications",
                    "offsetMilliseconds": 5320,
                    "durationMilliseconds": 760
                },
                {
                    "text": "and",
                    "offsetMilliseconds": 6080,
                    "durationMilliseconds": 200
                },
                {
                    "text": "products",
                    "offsetMilliseconds": 6280,
                    "durationMilliseconds": 680
                },
            ],
            "locale": "en-us",
            "confidence": 0.9539559
        },
        {
            "offsetMilliseconds": 8000,
            "durationMilliseconds": 8600,
            "text": "现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。",
            "words": [
                {
                    "text": "现",
                    "offsetMilliseconds": 8000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "成",
                    "offsetMilliseconds": 8040,
                    "durationMilliseconds": 40
                },
                {
                    "text": "的",
                    "offsetMilliseconds": 8160,
                    "durationMilliseconds": 40
                },
                {
                    "text": "语",
                    "offsetMilliseconds": 8200,
                    "durationMilliseconds": 40
                },
                {
                    "text": "音",
                    "offsetMilliseconds": 8240,
                    "durationMilliseconds": 40
                },
                {
                    "text": "转",
                    "offsetMilliseconds": 8280,
                    "durationMilliseconds": 40
                },
                {
                    "text": "文",
                    "offsetMilliseconds": 8320,
                    "durationMilliseconds": 40
                },
                {
                    "text": "本,",
                    "offsetMilliseconds": 8360,
                    "durationMilliseconds": 40
                },
                {
                    "text": "利",
                    "offsetMilliseconds": 8400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 8440,
                    "durationMilliseconds": 40
                },
                {
                    "text": "通",
                    "offsetMilliseconds": 8480,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 8520,
                    "durationMilliseconds": 40
                },
                {
                    "text": "语",
                    "offsetMilliseconds": 8560,
                    "durationMilliseconds": 40
                },
                {
                    "text": "言",
                    "offsetMilliseconds": 8600,
                    "durationMilliseconds": 40
                },
                {
                    "text": "模",
                    "offsetMilliseconds": 8640,
                    "durationMilliseconds": 40
                },
                {
                    "text": "型",
                    "offsetMilliseconds": 8680,
                    "durationMilliseconds": 40
                },
                {
                    "text": "作",
                    "offsetMilliseconds": 8800,
                    "durationMilliseconds": 40
                },
                {
                    "text": "为",
                    "offsetMilliseconds": 8840,
                    "durationMilliseconds": 40
                },
                {
                    "text": "一",
                    "offsetMilliseconds": 9520,
                    "durationMilliseconds": 40
                },
                {
                    "text": "个",
                    "offsetMilliseconds": 9560,
                    "durationMilliseconds": 40
                },
                {
                    "text": "基",
                    "offsetMilliseconds": 9600,
                    "durationMilliseconds": 40
                },
                {
                    "text": "本",
                    "offsetMilliseconds": 9640,
                    "durationMilliseconds": 40
                },
                {
                    "text": "模",
                    "offsetMilliseconds": 9680,
                    "durationMilliseconds": 40
                },
                {
                    "text": "型,",
                    "offsetMilliseconds": 9720,
                    "durationMilliseconds": 40
                },
                {
                    "text": "使",
                    "offsetMilliseconds": 9760,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 10080,
                    "durationMilliseconds": 320
                },
                {
                    "text": "microsoft",
                    "offsetMilliseconds": 10400,
                    "durationMilliseconds": 3600
                },
                {
                    "text": "自",
                    "offsetMilliseconds": 14000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "有",
                    "offsetMilliseconds": 14040,
                    "durationMilliseconds": 40
                },
                {
                    "text": "数",
                    "offsetMilliseconds": 14160,
                    "durationMilliseconds": 40
                },
                {
                    "text": "据",
                    "offsetMilliseconds": 14200,
                    "durationMilliseconds": 40
                },
                {
                    "text": "进",
                    "offsetMilliseconds": 14320,
                    "durationMilliseconds": 40
                },
                {
                    "text": "行",
                    "offsetMilliseconds": 14360,
                    "durationMilliseconds": 40
                },
                {
                    "text": "训",
                    "offsetMilliseconds": 14400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "练,",
                    "offsetMilliseconds": 14440,
                    "durationMilliseconds": 40
                },
                {
                    "text": "并",
                    "offsetMilliseconds": 14480,
                    "durationMilliseconds": 40
                },
                {
                    "text": "反",
                    "offsetMilliseconds": 14520,
                    "durationMilliseconds": 40
                },
                {
                    "text": "映",
                    "offsetMilliseconds": 14560,
                    "durationMilliseconds": 40
                },
                {
                    "text": "常",
                    "offsetMilliseconds": 14600,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 14640,
                    "durationMilliseconds": 40
                },
                {
                    "text": "的",
                    "offsetMilliseconds": 14680,
                    "durationMilliseconds": 40
                },
                {
                    "text": "口",
                    "offsetMilliseconds": 14720,
                    "durationMilliseconds": 40
                },
                {
                    "text": "语",
                    "offsetMilliseconds": 14760,
                    "durationMilliseconds": 40
                },
                {
                    "text": "。",
                    "offsetMilliseconds": 14800,
                    "durationMilliseconds": 40
                },
                {
                    "text": "此",
                    "offsetMilliseconds": 14840,
                    "durationMilliseconds": 40
                },
                {
                    "text": "基",
                    "offsetMilliseconds": 14880,
                    "durationMilliseconds": 40
                },
                {
                    "text": "础",
                    "offsetMilliseconds": 14920,
                    "durationMilliseconds": 40
                },
                {
                    "text": "模",
                    "offsetMilliseconds": 14960,
                    "durationMilliseconds": 40
                },
                {
                    "text": "型",
                    "offsetMilliseconds": 15000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "使",
                    "offsetMilliseconds": 15040,
                    "durationMilliseconds": 40
                },
                {
                    "text": "用",
                    "offsetMilliseconds": 15080,
                    "durationMilliseconds": 40
                },
                {
                    "text": "那",
                    "offsetMilliseconds": 15120,
                    "durationMilliseconds": 40
                },
                {
                    "text": "些",
                    "offsetMilliseconds": 15160,
                    "durationMilliseconds": 40
                },
                {
                    "text": "代",
                    "offsetMilliseconds": 15200,
                    "durationMilliseconds": 40
                },
                {
                    "text": "表",
                    "offsetMilliseconds": 15240,
                    "durationMilliseconds": 40
                },
                {
                    "text": "各",
                    "offsetMilliseconds": 15280,
                    "durationMilliseconds": 40
                },
                {
                    "text": "常",
                    "offsetMilliseconds": 15320,
                    "durationMilliseconds": 40
                },
                {
                    "text": "见",
                    "offsetMilliseconds": 15360,
                    "durationMilliseconds": 40
                },
                {
                    "text": "领",
                    "offsetMilliseconds": 15400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "域",
                    "offsetMilliseconds": 15760,
                    "durationMilliseconds": 40
                },
                {
                    "text": "的",
                    "offsetMilliseconds": 15800,
                    "durationMilliseconds": 40
                },
                {
                    "text": "方",
                    "offsetMilliseconds": 15920,
                    "durationMilliseconds": 40
                },
                {
                    "text": "言",
                    "offsetMilliseconds": 15960,
                    "durationMilliseconds": 40
                },
                {
                    "text": "和",
                    "offsetMilliseconds": 16000,
                    "durationMilliseconds": 40
                },
                {
                    "text": "发",
                    "offsetMilliseconds": 16040,
                    "durationMilliseconds": 40
                },
                {
                    "text": "音",
                    "offsetMilliseconds": 16080,
                    "durationMilliseconds": 40
                },
                {
                    "text": "进",
                    "offsetMilliseconds": 16120,
                    "durationMilliseconds": 40
                },
                {
                    "text": "行",
                    "offsetMilliseconds": 16160,
                    "durationMilliseconds": 40
                },
                {
                    "text": "了",
                    "offsetMilliseconds": 16200,
                    "durationMilliseconds": 40
                },
                {
                    "text": "预",
                    "offsetMilliseconds": 16320,
                    "durationMilliseconds": 40
                },
                {
                    "text": "先",
                    "offsetMilliseconds": 16360,
                    "durationMilliseconds": 40
                },
                {
                    "text": "训",
                    "offsetMilliseconds": 16400,
                    "durationMilliseconds": 40
                },
                {
                    "text": "练",
                    "offsetMilliseconds": 16560,
                    "durationMilliseconds": 40
                },
            ],
            "locale": "zh-cn",
            "confidence": 0.9241725
        },
        {
            "offsetMilliseconds": 24320,
            "durationMilliseconds": 6640,
            "text": "Quand vous effectuez une demande de reconnaissance vocale, le modèle de base le plus récent pour chaque langue prise en charge est utilisé par défaut.",
            "words": [
                {
                    "text": "Quand",
                    "offsetMilliseconds": 24320,
                    "durationMilliseconds": 160
                },
                {
                    "text": "vous",
                    "offsetMilliseconds": 24480,
                    "durationMilliseconds": 80
                },
        // More transcription results...
        // Redacted for brevity
                {
                    "text": "scénarios",
                    "offsetMilliseconds": 34200,
                    "durationMilliseconds": 400
                },
                {
                    "text": "de",
                    "offsetMilliseconds": 34600,
                    "durationMilliseconds": 120
                },
                {
                    "text": "reconnaissance",
                    "offsetMilliseconds": 34720,
                    "durationMilliseconds": 640
                },
                {
                    "text": "vocale.",
                    "offsetMilliseconds": 35360,
                    "durationMilliseconds": 480
                }
            ],
            "locale": "fr-fr",
            "confidence": 0.9308314
        },
        {
            "offsetMilliseconds": 36720,
            "durationMilliseconds": 10320,
            "text": "A custom model can be used to augment the base model to improve recognition of domain specific vocabulary spécifique to the application by providing text data to train the model.",
            "words": [
                {
                    "text": "A",
                    "offsetMilliseconds": 36720,
                    "durationMilliseconds": 80
                },
                {
                    "text": "custom",
                    "offsetMilliseconds": 36880,
                    "durationMilliseconds": 400
                },
                {
                    "text": "model",
                    "offsetMilliseconds": 37280,
                    "durationMilliseconds": 480
                },

        // More transcription results...
        // Redacted for brevity
                {
                    "text": "with",
                    "offsetMilliseconds": 54720,
                    "durationMilliseconds": 200
                },
                {
                    "text": "reference",
                    "offsetMilliseconds": 54920,
                    "durationMilliseconds": 360
                },
                {
                    "text": "transcriptions.",
                    "offsetMilliseconds": 55280,
                    "durationMilliseconds": 1200
                }
            ],
            "locale": "en-us",
            "confidence": 0.92155737
        }
    ]
}

使用音频文件和请求正文属性向transcriptions终结点发出多部分/表单数据 POST 请求。

以下示例演示了如何转录启用了分割聚类功能的音频文件。分割聚类可区分对话中的不同说话人。语音服务提供有关哪个讲话者在转录语音的特定部分发言的信息。

将 YourSpeechResoureKey 替换为语音资源密钥。
将 YourServiceRegion 替换为你的语音资源所在区域。
将YourAudioFile替换为音频文件的路径。

注释

启用分割聚类后，音频文件时长应小于 2 小时

重要

curl --location 'https://YourServiceRegion.api.cognitive.azure.cn/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"], 
    "diarization": {"maxSpeakers": 2,"enabled": true}}"'

根据以下说明构建形式定义:

设置可选(但建议选择)的 locales 属性，该属性应与要转录的音频数据的预期语言设置匹配。在此示例中，区域设置为 en-US。
设置 diarization 属性来识别和分隔单声道音频中的多个说话人。例如，指定 "diarization": {"maxSpeakers": 2, "enabled": true}。然后，听录文件会包含每个已转录短语的 speaker 条目。

有关快速听录 API 的 locales、diarization 和其他属性的详细信息，请参阅本指南后面的“请求配置选项”部分。

响应包括 durationMilliseconds、offsetMilliseconds等。在此示例中，已启用分割聚类，因此响应包含每个转录短语的 speaker 信息。 combinedPhrases 属性包含单声道中所有说话人的完整转录内容。

{
    "durationMilliseconds": 182439,
    "combinedPhrases": [
        {
            "channel": 0,
            "text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh. Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? Uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And. You're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
        }
    ],
    "phrases": [
        {
            "channel": 0,
            "speaker": 1,
            "offsetMilliseconds": 960,
            "durationMilliseconds": 640,
            "text": "Good afternoon.",
            "words": [
                {
                    "text": "Good",
                    "offsetMilliseconds": 960,
                    "durationMilliseconds": 240
                },
                {
                    "text": "afternoon.",
                    "offsetMilliseconds": 1200,
                    "durationMilliseconds": 400
                }
            ],
            "locale": "en-US",
            "confidence": 0.93616915
        },
        {
            "channel": 0,
            "speaker": 1,
            "offsetMilliseconds": 1600,
            "durationMilliseconds": 640,
            "text": "This is Sam.",
            "words": [
                {
                    "text": "This",
                    "offsetMilliseconds": 1600,
                    "durationMilliseconds": 240
                },
                {
                    "text": "is",
                    "offsetMilliseconds": 1840,
                    "durationMilliseconds": 120
                },
                {
                    "text": "Sam.",
                    "offsetMilliseconds": 1960,
                    "durationMilliseconds": 280
                }
            ],
            "locale": "en-US",
            "confidence": 0.93616915
        },
        {
            "channel": 0,
            "speaker": 1,
            "offsetMilliseconds": 2240,
            "durationMilliseconds": 1040,
            "text": "Thank you for calling Contoso.",
            "words": [
                {
                    "text": "Thank",
                    "offsetMilliseconds": 2240,
                    "durationMilliseconds": 200
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 2440,
                    "durationMilliseconds": 80
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 2520,
                    "durationMilliseconds": 120
                },
                {
                    "text": "calling",
                    "offsetMilliseconds": 2640,
                    "durationMilliseconds": 200
                },
                {
                    "text": "Contoso.",
                    "offsetMilliseconds": 2840,
                    "durationMilliseconds": 440
                }
            ],
            "locale": "en-US",
            "confidence": 0.93616915
        },
        {
            "channel": 0,
            "speaker": 1,
            "offsetMilliseconds": 3280,
            "durationMilliseconds": 640,
            "text": "How can I help?",
            "words": [
                {
                    "text": "How",
                    "offsetMilliseconds": 3280,
                    "durationMilliseconds": 120
                },
                {
                    "text": "can",
                    "offsetMilliseconds": 3440,
                    "durationMilliseconds": 120
                },
                {
                    "text": "I",
                    "offsetMilliseconds": 3560,
                    "durationMilliseconds": 40
                },
                {
                    "text": "help?",
                    "offsetMilliseconds": 3600,
                    "durationMilliseconds": 320
                }
            ],
            "locale": "en-US",
            "confidence": 0.93616915
        },
        {
            "channel": 0,
            "speaker": 0,
            "offsetMilliseconds": 5040,
            "durationMilliseconds": 400,
            "text": "Hi there.",
            "words": [
                {
                    "text": "Hi",
                    "offsetMilliseconds": 5040,
                    "durationMilliseconds": 240
                },
                {
                    "text": "there.",
                    "offsetMilliseconds": 5280,
                    "durationMilliseconds": 160
                }
            ],
            "locale": "en-US",
            "confidence": 0.93616915
        },
        {
            "channel": 0,
            "speaker": 0,
            "offsetMilliseconds": 5440,
            "durationMilliseconds": 800,
            "text": "My name is Mary.",
            "words": [
                {
                    "text": "My",
                    "offsetMilliseconds": 5440,
                    "durationMilliseconds": 80
                },
                {
                    "text": "name",
                    "offsetMilliseconds": 5520,
                    "durationMilliseconds": 120
                },
                {
                    "text": "is",
                    "offsetMilliseconds": 5640,
                    "durationMilliseconds": 80
                },
                {
                    "text": "Mary.",
                    "offsetMilliseconds": 5720,
                    "durationMilliseconds": 520
                }
            ],
            "locale": "en-US",
            "confidence": 0.93616915
        },
        // More transcription results...
        // Redacted for brevity
        {
            "channel": 0,
            "speaker": 0,
            "offsetMilliseconds": 180320,
            "durationMilliseconds": 680,
            "text": "Thank you for your help.",
            "words": [
                {
                    "text": "Thank",
                    "offsetMilliseconds": 180320,
                    "durationMilliseconds": 160
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 180480,
                    "durationMilliseconds": 80
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 180560,
                    "durationMilliseconds": 120
                },
                {
                    "text": "your",
                    "offsetMilliseconds": 180680,
                    "durationMilliseconds": 120
                },
                {
                    "text": "help.",
                    "offsetMilliseconds": 180800,
                    "durationMilliseconds": 200
                }
            ],
            "locale": "en-US",
            "confidence": 0.9314801
        },
        {
            "channel": 0,
            "speaker": 1,
            "offsetMilliseconds": 181960,
            "durationMilliseconds": 280,
            "text": "Thank you.",
            "words": [
                {
                    "text": "Thank",
                    "offsetMilliseconds": 181960,
                    "durationMilliseconds": 200
                },
                {
                    "text": "you.",
                    "offsetMilliseconds": 182160,
                    "durationMilliseconds": 80
                }
            ],
            "locale": "en-US",
            "confidence": 0.9314801
        }
    ]
}

使用音频文件和请求正文属性向transcriptions终结点发出多部分/表单数据 POST 请求。

以下示例演示了如何转录包含一个或两个声道的音频文件。多声道听录对于具有多个声道的音频文件非常有用，例如包含多个说话人的音频文件或有背景噪音的音频文件。默认情况下，快速听录 API 将所有输入声道合并到单个声道，然后执行听录。如果不希望这样处理，可以独立转录各个声道，而不进行合并。

将 YourSpeechResoureKey 替换为语音资源密钥。
将 YourServiceRegion 替换为你的语音资源所在区域。
将YourAudioFile替换为音频文件的路径。

重要

curl --location 'https://YourServiceRegion.api.cognitive.azure.cn/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    "locales":["en-US"], 
    "channels": [0,1]}"'

根据以下说明构建形式定义:

设置可选(但建议选择)的 locales 属性，该属性应与要转录的音频数据的预期语言设置匹配。在此示例中，区域设置为 en-US。可以指定的区域设置包括：de-DE、en-GB、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR和 zh-cn。
设置 channels 属性以指定要单独转录的通道的零起始索引。除非启用分割聚类，否则最多支持两个声道。在此示例中，指定了声道 0 和声道 1。

有关快速听录 API 的 locales、channels 和其他属性的详细信息，请参阅本指南后面的“请求配置选项”部分。

响应包括 durationMilliseconds、offsetMilliseconds等。如果音频文件包含多个声道，则 channel 属性会识别声道。 combinedPhrases 属性包含按照音频声道分隔的完整转录。查找 "channel": 0,"text" 和 "channel": 1,"text" 以识别每个频道的完整转录。

{
    "durationMilliseconds": 185079,
    "combinedPhrases": [
        {
            "channel": 0,
            "text": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
        },
        {
            "channel": 1,
            "text": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
        }
    ],
    "phrases": [
        {
            "channel": 0,
            "offsetMilliseconds": 720,
            "durationMilliseconds": 480,
            "text": "Hello.",
            "words": [
                {
                    "text": "Hello.",
                    "offsetMilliseconds": 720,
                    "durationMilliseconds": 480
                }
            ],
            "locale": "en-US",
            "confidence": 0.9177142
        },
        {
            "channel": 0,
            "offsetMilliseconds": 1200,
            "durationMilliseconds": 1120,
            "text": "Thank you for calling Contoso.",
            "words": [
                {
                    "text": "Thank",
                    "offsetMilliseconds": 1200,
                    "durationMilliseconds": 200
                },
                {
                    "text": "you",
                    "offsetMilliseconds": 1400,
                    "durationMilliseconds": 80
                },
                {
                    "text": "for",
                    "offsetMilliseconds": 1480,
                    "durationMilliseconds": 120
                },
                {
                    "text": "calling",
                    "offsetMilliseconds": 1600,
                    "durationMilliseconds": 240
                },
                {
                    "text": "Contoso.",
                    "offsetMilliseconds": 1840,
                    "durationMilliseconds": 480
                }
            ],
            "locale": "en-US",
            "confidence": 0.9177142
        },
        {
            "channel": 0,
            "offsetMilliseconds": 2320,
            "durationMilliseconds": 1120,
            "text": "Who am I speaking with today?",
            "words": [
                {
                    "text": "Who",
                    "offsetMilliseconds": 2320,
                    "durationMilliseconds": 160
                },
                {
                    "text": "am",
                    "offsetMilliseconds": 2480,
                    "durationMilliseconds": 80
                },
                {
                    "text": "I",
                    "offsetMilliseconds": 2560,
                    "durationMilliseconds": 80
                },
                {
                    "text": "speaking",
                    "offsetMilliseconds": 2640,
                    "durationMilliseconds": 320
                },
                {
                    "text": "with",
                    "offsetMilliseconds": 2960,
                    "durationMilliseconds": 160
                },
                {
                    "text": "today?",
                    "offsetMilliseconds": 3120,
                    "durationMilliseconds": 320
                }
            ],
            "locale": "en-US",
            "confidence": 0.9177142
        },
        {
            "channel": 0,
            "offsetMilliseconds": 9520,
            "durationMilliseconds": 400,
            "text": "Hi, Mary.",
            "words": [
                {
                    "text": "Hi,",
                    "offsetMilliseconds": 9520,
                    "durationMilliseconds": 80
                },
                {
                    "text": "Mary.",
                    "offsetMilliseconds": 9600,
                    "durationMilliseconds": 320
                }
            ],
            "locale": "en-US",
            "confidence": 0.9177142
        },
        // More transcription results...
        // Redacted for brevity
        {
            "channel": 1,
            "offsetMilliseconds": 4480,
            "durationMilliseconds": 1600,
            "text": "Hi, my name is Mary Rondo.",
            "words": [
                {
                    "text": "Hi,",
                    "offsetMilliseconds": 4480,
                    "durationMilliseconds": 400
                },
                {
                    "text": "my",
                    "offsetMilliseconds": 4880,
                    "durationMilliseconds": 120
                },
                {
                    "text": "name",
                    "offsetMilliseconds": 5000,
                    "durationMilliseconds": 120
                },
                {
                    "text": "is",
                    "offsetMilliseconds": 5120,
                    "durationMilliseconds": 160
                },
                {
                    "text": "Mary",
                    "offsetMilliseconds": 5280,
                    "durationMilliseconds": 240
                },
                {
                    "text": "Rondo.",
                    "offsetMilliseconds": 5520,
                    "durationMilliseconds": 560
                }
            ],
            "locale": "en-US",
            "confidence": 0.8989456
        },
        {
            "channel": 1,
            "offsetMilliseconds": 6080,
            "durationMilliseconds": 1920,
            "text": "I'm trying to enroll myself with Contuso.",
            "words": [
                {
                    "text": "I'm",
                    "offsetMilliseconds": 6080,
                    "durationMilliseconds": 160
                },
                {
                    "text": "trying",
                    "offsetMilliseconds": 6240,
                    "durationMilliseconds": 200
                },
                {
                    "text": "to",
                    "offsetMilliseconds": 6440,
                    "durationMilliseconds": 80
                },
                {
                    "text": "enroll",
                    "offsetMilliseconds": 6520,
                    "durationMilliseconds": 200
                },
                {
                    "text": "myself",
                    "offsetMilliseconds": 6720,
                    "durationMilliseconds": 360
                },
                {
                    "text": "with",
                    "offsetMilliseconds": 7080,
                    "durationMilliseconds": 120
                },
                {
                    "text": "Contuso.",
                    "offsetMilliseconds": 7200,
                    "durationMilliseconds": 800
                }
            ],
            "locale": "en-US",
            "confidence": 0.8989456
        },
        // More transcription results...
        // Redacted for brevity
    ]
}

注释

语音服务是一项弹性服务。如果收到 429 错误代码（请求过多），请按照最佳做法在自动缩放期间缓解限制。

请求配置选项

下面是一些属性选项，可用于在调用 Transcriptions - Transcribe 操作时配置转录。

资产	DESCRIPTION	必需还是可选
`channels`	要单独转录的声道的从零开始的索引列表。除非启用分割聚类，否则最多支持两个声道。默认情况下，快速听录 API 将所有输入声道合并到单个声道，然后执行听录。如果不希望这样处理，可以独立转录各个声道，而不进行合并。如果要从立体声音频文件中单独转录各个声道，需要指定 `[0,1]`、`[0]` 和 `[1]`。否则，立体声音频将合并为单声道，并且仅转录单个声道。如果音频是立体声且已启用分割聚类，则无法将 `channels` 属性设置为 `[0,1]`。语音服务不支持对多个声道进行分割聚类。对于单声道音频，系统将忽略 `channels` 属性，始终将音频作为单声道进行转录。	可选
`diarization`	分割聚类配置。分割聚类是在一个音频声道中识别和分离说话人的过程。例如，指定 `"diarization": {"maxSpeakers": 2, "enabled": true}`。然后，听录文件会包含每个已转录短语的 `speaker` 个条目(例如 `"speaker": 0` 或 `"speaker": 1`)。	可选
`locales`	语言列表应与要转录的音频数据的预期语言相匹配。如果知道音频文件的地域设置，可以指定它以提高转录准确性并减少延迟。如果指定了单个语言区域，将使用该语言区域进行转录。但是，如果不确定区域，可以指定多个区域以进行语言识别。候选语言列表越精确，语言识别可能越准确。如果未指定任何语言区域，语音服务将使用最新的多语言模型来识别语言区域，自动进行转录。可以通过转录 - 列出支持的语言区域 REST API（API 版本 2024-11-15 或更高版本）获取最新支持的语言。有关区域设置的详细信息，请参阅“语音服务语言支持”文档。	（可选）但如果你知道预期的区域设置，建议指定该区域设置。
`phraseList`	短语列表是提前提供的字词或短语列表，可帮助改进识别。添加到短语列表的短语具有较高的重要性，从而更有可能被识别。例如，指定 `phraseList":{"phrases":["Contoso","Jessie","Rehaan"]}`。 API 版本 2025-10-15 支持短语列表。有关详细信息，请参阅使用短语列表提高识别准确性。	可选
`profanityFilterMode`	指定如何处理识别结果中的不雅内容。接受的值为 `None`（禁用不雅内容筛选）、`Masked`（将不雅内容替换为星号）、`Removed`（从结果中删除所有不雅内容）或 `Tags`（添加不雅内容标记）。默认值为 `Masked`。	可选

Reference 文档 | Package （PyPi） | GitHub示例

先决条件

一个Azure订阅。创建一个试用帐户。
Python 3.9 或更高版本。如果尚未安装合适的Python版本，可以按照 VS Code Python Tutorial0 中的说明操作，以便以最简单的方式在操作系统上安装Python。
在一个受支持的区域中创建的 AI 服务资源。有关区域可用性的详细信息，请参阅区域支持。
要转录的示例 .wav 音频文件。

Microsoft Entra ID先决条件

若要使用 Microsoft Entra ID 进行推荐的无密钥身份验证，需要：

安装用于无密钥身份验证的 Azure CLI 和 Microsoft Entra ID。
将 Cognitive Services User 角色分配给用户帐户。可以在 Azure 门户中的Access control (IAM)>Add role assignment下分配角色。

Setup

使用以下命令创建一个名为 transcription-quickstart 的新文件夹，并转到快速入门文件夹：
```
mkdir transcription-quickstart && cd transcription-quickstart
```
创建并激活虚拟Python环境以安装本教程所需的包。建议在安装Python包时始终使用虚拟或 conda 环境。否则，可以中断Python的全局安装。如果已安装 Python 3.9 或更高版本，请使用以下命令创建虚拟环境：
- Windows
- Linux
- macOS
```
py -3 -m venv .venv
.venv\Scripts\Activate.ps1
```
```
python3 -m venv .venv
source .venv/bin/activate
```
```
python3 -m venv .venv
source .venv/bin/activate
```
激活Python环境时，从命令行运行 python 或 pip 使用应用程序的 .venv 文件夹中的 Python 解释器。使用 deactivate 命令退出Python虚拟环境。稍后可以根据需要重新激活它。
创建名为 requirements.txt的文件。将以下包添加到文件：
```
azure-ai-transcription
azure-identity
```
安装这些软件包：
```
pip install -r requirements.txt
```

注释

对于Microsoft Entra ID身份验证（建议用于生产），请安装 azure-identity并配置身份验证，如Microsoft Entra ID先决条件部分所述。

Code

使用以下代码创建名为 transcribe_audio_file.py 的文件：

import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.transcription import TranscriptionClient
from azure.ai.transcription.models import TranscriptionContent, TranscriptionOptions

# Get configuration from environment variables
endpoint = os.environ["AZURE_SPEECH_ENDPOINT"]
api_key = os.environ["AZURE_SPEECH_API_KEY"]

# Create the transcription client
client = TranscriptionClient(endpoint=endpoint, credential=AzureKeyCredential(api_key))

# Path to your audio file (replace with your own file path)
audio_file_path = "<path-to-your-audio-file.wav>"

# Open and read the audio file
with open(audio_file_path, "rb") as audio_file:
    # Create transcription options
    options = TranscriptionOptions(locales=["en-US"])  # Specify the language

    # Create the request content
    request_content = TranscriptionContent(definition=options, audio=audio_file)

    # Transcribe the audio
    result = client.transcribe(request_content)

    # Print the transcription result
    print(f"Transcription: {result.combined_phrases[0].text}")

    # Print detailed phrase information
    if result.phrases:
        print("\nDetailed phrases:")
        for phrase in result.phrases:
            print(
                f"  [{phrase.offset_milliseconds}ms - "
                f"{phrase.offset_milliseconds + phrase.duration_milliseconds}ms]: "
                f"{phrase.text}"
            )

参考： TranscriptionClient | TranscriptionContent | TranscriptionOptions | AzureKeyCredential

将<path-to-your-audio-file.wav>替换为音频文件的路径。该服务支持 WAV、MP3、FLAC、OGG 和其他常见音频格式。
运行Python脚本：
```
python transcribe_audio_file.py
```

输出

该脚本将听录结果输出到控制台：

Transcription: Hi there! This is a sample voice recording created for speech synthesis testing. The quick brown fox jumps over the lazy dog. Just a fun way to include every letter of the alphabet. Numbers, like 1, 2, 3, are spoken clearly. Let's see how well this voice captures tone, timing, and natural rhythm. This audio is provided by samplefiles.com.

Detailed phrases:
  [40ms - 4880ms]: Hi there! This is a sample voice recording created for speech synthesis testing.
  [5440ms - 8400ms]: The quick brown fox jumps over the lazy dog.
  [9040ms - 12240ms]: Just a fun way to include every letter of the alphabet.
  [12720ms - 16720ms]: Numbers, like 1, 2, 3, are spoken clearly.
  [17200ms - 22000ms]: Let's see how well this voice captures tone, timing, and natural rhythm.
  [22480ms - 25920ms]: This audio is provided by samplefiles.com.

请求配置选项

用 TranscriptionOptions 自定义转录行为。以下部分介绍了每个受支持的配置，并演示如何应用它。

多语言检测

传递多个区域设置候选到 locales，以启用跨语言的语言识别。服务检测所说的语言，并用检测到的语言环境标记每个短语。省略 locales 完全允许服务自动检测所有语言，而无需候选列表。

from azure.core.credentials import AzureKeyCredential
from azure.ai.transcription import TranscriptionClient
from azure.ai.transcription.models import TranscriptionContent, TranscriptionOptions

client = TranscriptionClient(
    endpoint=endpoint, credential=AzureKeyCredential(api_key)
)

with open(audio_file_path, "rb") as audio_file:
    # Provide candidate locales � the service selects the best match per phrase
    options = TranscriptionOptions(locales=["en-US", "es-ES", "fr-FR", "de-DE"])
    result = client.transcribe(TranscriptionContent(definition=options, audio=audio_file))

    for phrase in result.phrases:
        locale = phrase.locale if phrase.locale else "detected"
        print(f"[{locale}] {phrase.text}")

参考：TranscriptionOptions

说话人分割

Diarization 检测并标记单个音频通道中的不同扬声器。创建一个具有最大预期说话人数（2-35）的TranscriptionDiarizationOptions对象，然后将该对象传递给TranscriptionOptions。结果中的每个短语都包含一个 speaker 标识符。

from azure.core.credentials import AzureKeyCredential
from azure.ai.transcription import TranscriptionClient
from azure.ai.transcription.models import (
    TranscriptionContent,
    TranscriptionOptions,
    TranscriptionDiarizationOptions,
)

client = TranscriptionClient(
    endpoint=endpoint, credential=AzureKeyCredential(api_key)
)

with open(audio_file_path, "rb") as audio_file:
    diarization_options = TranscriptionDiarizationOptions(
        max_speakers=5  # Hint for maximum number of speakers (2-35)
    )
    options = TranscriptionOptions(
        locales=["en-US"], diarization_options=diarization_options
    )
    result = client.transcribe(TranscriptionContent(definition=options, audio=audio_file))

    for phrase in result.phrases:
        speaker = phrase.speaker if phrase.speaker is not None else "Unknown"
        print(f"Speaker {speaker} [{phrase.offset_milliseconds}ms]: {phrase.text}")

注释

仅单声道（单声道）音频支持分割。如果音频是立体声，请不要在启用分割时将 channels 属性 [0, 1] 设置为。

参考： TranscriptionDiarizationOptionsTranscriptionOptions

短语列表

短语列表可提升域特定术语、正确名词和不常见字词的识别准确性。设置biasing_weight在1.0和20.0之间，以控制短语的受青睐程度（较高的值会增加倾向）。

from azure.core.credentials import AzureKeyCredential
from azure.ai.transcription import TranscriptionClient
from azure.ai.transcription.models import (
    TranscriptionContent,
    TranscriptionOptions,
    PhraseListProperties,
)

client = TranscriptionClient(
    endpoint=endpoint, credential=AzureKeyCredential(api_key)
)

with open(audio_file_path, "rb") as audio_file:
    phrase_list = PhraseListProperties(
        phrases=["Contoso", "Jessie", "Rehaan"],
        biasing_weight=5.0,  # Weight between 1.0 and 20.0
    )
    options = TranscriptionOptions(locales=["en-US"], phrase_list=phrase_list)
    result = client.transcribe(TranscriptionContent(definition=options, audio=audio_file))

    print(result.combined_phrases[0].text)

有关详细信息，请参阅使用短语列表提高识别准确性。

参考： PhraseListPropertiesTranscriptionOptions

不雅内容筛选

使用 profanity_filter_mode 参数控制不雅内容在听录输出中的显示方式。以下模式可用：

模式	行为
`"None"`	不雅用语不作更改地通过。
`"Masked"`	不雅内容将替换为星号（默认值）。
`"Removed"`	完全从输出中删除不雅内容。
`"Tags"`	不雅语言被包含在 `<profanity>` XML 标记中。

from azure.core.credentials import AzureKeyCredential
from azure.ai.transcription import TranscriptionClient
from azure.ai.transcription.models import TranscriptionContent, TranscriptionOptions

client = TranscriptionClient(
    endpoint=endpoint, credential=AzureKeyCredential(api_key)
)

with open(audio_file_path, "rb") as audio_file:
    options = TranscriptionOptions(
        locales=["en-US"],
        profanity_filter_mode="Masked"  # Options: "None", "Removed", "Masked", "Tags"
    )
    result = client.transcribe(TranscriptionContent(definition=options, audio=audio_file))

    print(result.combined_phrases[0].text)

参考：TranscriptionOptions

Reference 文档 | Package （NuGet） | GitHub示例

先决条件

一个Azure订阅。创建一个试用帐户。
.NET 8.0 SDK 或更高版本。
在支持的地区创建的Azure AI 服务资源。有关区域可用性的详细信息，请参阅区域支持。
要转录的示例 .wav 音频文件。

Microsoft Entra ID先决条件

若要使用 Microsoft Entra ID 进行推荐的无密钥身份验证，需要：

安装用于无密钥身份验证的 Azure CLI 和 Microsoft Entra ID。
通过运行 az login，使用Azure CLI登录。
将 Cognitive Services User 角色分配给用户帐户。可以在 Azure 门户中的Access control (IAM)>Add role assignment下分配角色。

启动项目

使用 .NET CLI 创建新的控制台应用程序：

dotnet new console -n transcription-quickstart
cd transcription-quickstart

安装所需的包：

dotnet add package Azure.AI.Speech.Transcription --prerelease
dotnet add package Azure.Identity

转录音频

将 Program.cs 的内容替换为以下代码：

using System;
using System.ClientModel;
using System.Linq;
using System.Threading.Tasks;
using Azure.AI.Speech.Transcription;
using Azure.Identity;

Uri endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_SPEECH_ENDPOINT")
    ?? throw new InvalidOperationException("Set the AZURE_SPEECH_ENDPOINT environment variable."));

// Use DefaultAzureCredential for keyless authentication (recommended).
// To use an API key instead, replace with:
// ApiKeyCredential credential = new ApiKeyCredential("<your-api-key>");
var credential = new DefaultAzureCredential();
TranscriptionClient client = new TranscriptionClient(endpoint, credential);

string audioFilePath = "<path-to-your-audio-file.wav>";
using FileStream audioStream = File.OpenRead(audioFilePath);

TranscriptionOptions options = new TranscriptionOptions(audioStream);
ClientResult<TranscriptionResult> response = await client.TranscribeAsync(options);

var channelPhrases = response.Value.PhrasesByChannel.First();
Console.WriteLine(channelPhrases.Text);

运行应用程序：

dotnet run

音频文件中的转录文本将打印到控制台。

访问字级详细信息

若要访问时间戳、置信度分数和单个字词，请遍历短语：

var channelPhrases = response.Value.PhrasesByChannel.First();

foreach (TranscribedPhrase phrase in channelPhrases.Phrases)
{
    Console.WriteLine($"\nPhrase: {phrase.Text}");
    Console.WriteLine($"  Offset: {phrase.Offset} | Duration: {phrase.Duration}");
    Console.WriteLine($"  Confidence: {phrase.Confidence:F2}");

    foreach (TranscribedWord word in phrase.Words)
    {
        Console.WriteLine(
            $"    Word: '{word.Text}' | " +
            $"Confidence: {word.Confidence:F2} | " +
            $"Offset: {word.Offset}");
    }
}

参考： TranscribedPhraseTranscribedWord

通过说话者分离识别

说话人分割识别在多扬声器音频中说话的人：

TranscriptionOptions options = new TranscriptionOptions(audioStream)
{
    DiarizationOptions = new TranscriptionDiarizationOptions
    {
        MaxSpeakers = 4
    }
};

ClientResult<TranscriptionResult> response = await client.TranscribeAsync(options);

var channelPhrases = response.Value.PhrasesByChannel.First();
foreach (TranscribedPhrase phrase in channelPhrases.Phrases)
{
    Console.WriteLine($"Speaker {phrase.Speaker}: {phrase.Text}");
}

参考：TranscriptionDiarizationOptions

参考文档 | 包（npm） | GitHub 示例

先决条件

一个Azure订阅。创建一个试用帐户。
Node.js LTS。
在一个受支持的区域中创建的 AI 服务资源。有关区域可用性的详细信息，请参阅区域支持。
要转录的示例 .wav 音频文件。

Microsoft Entra ID先决条件

若要使用 Microsoft Entra ID 进行推荐的无密钥身份验证，需要：

安装用于无密钥身份验证的 Azure CLI 和 Microsoft Entra ID。
通过运行 az login，使用 Azure CLI 登录。
将 Cognitive Services User 角色分配给用户帐户。可以在 Azure 门户中的Access control (IAM)>Add role assignment下分配角色。

启动项目

创建新文件夹并初始化 Node.js 项目：

mkdir transcription-quickstart
cd transcription-quickstart
npm init -y

安装所需的包：

npm install @azure/ai-speech-transcription @azure/identity

通过将模块类型添加到以下 package.json项，将项目配置为使用 ES 模块：
```
npm pkg set type=module
```
您可以手动将"type": "module"添加到package.json文件中。要让示例代码中的 import 声明正常工作，这一点是必需的。

检索资源信息

需要检索资源终结点进行身份验证。

登录到 Azure 门户。
从语音或多服务资源的左侧菜单中选择 “密钥和终结点 ”。

复制 终结点 值并将其设置为环境变量：

$env:AZURE_SPEECH_ENDPOINT="<your-speech-endpoint>"

export AZURE_SPEECH_ENDPOINT="<your-speech-endpoint>"

export AZURE_SPEECH_ENDPOINT="<your-speech-endpoint>"

转录音频

使用以下代码创建名为 transcribe-audio-file.js 的文件：

import { readFileSync } from "node:fs";
import { DefaultAzureCredential } from "@azure/identity";
import { TranscriptionClient } from "@azure/ai-speech-transcription";

const endpoint = process.env.AZURE_SPEECH_ENDPOINT;
if (!endpoint) {
  throw new Error("Set the AZURE_SPEECH_ENDPOINT environment variable.");
}

// Use DefaultAzureCredential for keyless authentication (recommended).
const client = new TranscriptionClient(endpoint, new DefaultAzureCredential());

const audioFile = readFileSync("<path-to-your-audio-file.wav>");

const result = await client.transcribe(audioFile, {
  locales: ["en-US"],
});

console.log("Transcription:", result.combinedPhrases[0]?.text ?? "No text");

参考： TranscriptionClient | DefaultAzureCredential

将<path-to-your-audio-file.wav>替换为音频文件的路径。
运行应用：
```
node transcribe-audio-file.js
```

输出

应用将转录的文本打印到控制台：

Transcription: Hi there! This is a sample voice recording.

常见请求选项

通过说话者分离识别

const result = await client.transcribe(audioFile, {
  locales: ["en-US"],
  diarizationOptions: {
    maxSpeakers: 4,
  },
});

for (const phrase of result.phrases) {
  console.log(`Speaker ${phrase.speaker}: ${phrase.text}`);
}

参考： TranscriptionDiarizationOptions

设置不雅内容筛选

import {
  KnownProfanityFilterModes,
} from "@azure/ai-speech-transcription";

const result = await client.transcribe(audioFile, {
  locales: ["en-US"],
  profanityFilterMode: KnownProfanityFilterModes.Masked,
});

参考： KnownProfanityFilterModes

添加短语列表

使用短语列表改进域特定术语、适当的名词和首字母缩略词的识别：

const result = await client.transcribe(audioFile, {
  locales: ["en-US"],
  phraseList: {
    phrases: ["Contoso", "Jessie", "Rehaan"],
  },
});

console.log("Transcription:", result.combinedPhrases[0]?.text ?? "No text");

参考： PhraseListProperties

启用多语言检测

如果你不确定所说的语言是哪一种，可传递多个语言区域选项。服务会检测语言并为每个短语返回相应的区域设置：

const result = await client.transcribe(audioFile, {
  locales: ["en-US", "es-ES"],
});

for (const phrase of result.phrases) {
  console.log(`[${phrase.locale}] ${phrase.text}`);
}

参考： TranscribedPhrase

Reference 文档 | Package （Maven） | GitHub示例

先决条件

一个Azure订阅。创建一个试用帐户。
Java开发工具包（JDK） 8 或更高版本。
Apache Maven 用于依赖项管理和生成项目。
其中一个受支持区域的 AI 服务资源。有关区域可用性的详细信息，请参阅语音服务支持的区域。
要转录的示例 .wav 音频文件。

设置环境

创建一个名为transcription-quickstart的新文件夹并导航到它。
```
mkdir transcription-quickstart && cd transcription-quickstart
```

在项目目录的根目录中，创建一个pom.xml文件，并使用以下内容：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>transcription-quickstart</artifactId>
    <version>1.0.0</version>
    <packaging>jar</packaging>

    <name>Speech Transcription Quickstart</name>
    <description>Quickstart sample for Azure Speech Transcription client library.</description>
    <url>https://github.com/Azure/azure-sdk-for-java</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-ai-speech-transcription</artifactId>
            <version>1.0.0-beta.2</version>
        </dependency>
        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-identity</artifactId>
            <version>1.18.1</version>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>.</sourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.11.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>3.1.0</version>
                <configuration>
                    <mainClass>TranscriptionQuickstart</mainClass>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

注释

<sourceDirectory>.</sourceDirectory> 配置指示 Maven 在当前目录中查找Java源文件，而不是默认的 src/main/java 结构。此配置更改允许更简单的平面项目结构。

安装依赖项：
```
mvn clean install
```

设置环境变量。

必须对应用程序进行身份验证才能访问语音服务。 SDK 支持 API 密钥和Microsoft Entra ID身份验证。它根据设置的环境变量自动检测要使用的方法。

首先，设置语音资源的终结点。将 <your-speech-endpoint> 替换为您的实际资源名称：

$env:AZURE_SPEECH_ENDPOINT="<your-speech-endpoint>"

export AZURE_SPEECH_ENDPOINT="<your-speech-endpoint>"

export AZURE_SPEECH_ENDPOINT="<your-speech-endpoint>"

然后，选择以下身份验证方法之一：

选项 1：API 密钥身份验证（建议入门）

设置 API 密钥环境变量：

$env:AZURE_SPEECH_API_KEY="<your-speech-key>"

export AZURE_SPEECH_API_KEY=<your-speech-key>

export AZURE_SPEECH_API_KEY=<your-speech-key>

选项 2：Microsoft Entra ID身份验证（建议用于生产）

请不要设置 AZURE_SPEECH_API_KEY，而是配置以下凭据源之一：

Azure CLI：在开发计算机上运行 az login。
Managed Identity：对于在Azure（应用服务、Azure Functions、VM）中运行的应用）。
环境变量：设置 AZURE_TENANT_ID、AZURE_CLIENT_ID 和 AZURE_CLIENT_SECRET。
Visual Studio Code 或 IntelliJ：通过 IDE 登录。

您还需要将 认知服务用户 角色分配给您的身份：

az role assignment create --assignee <your-identity> \
    --role "Cognitive Services User" \
    --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.CognitiveServices/accounts/<speech-resource-name>

注释

在Windows设置环境变量后，重启需要读取它们的任何正在运行的程序，包括控制台窗口。在 Linux 或 macOS 上，运行 source ~/.bashrc （或等效的 shell 配置文件）以使更改生效。

创建应用程序

使用以下代码在项目目录中创建一个名为 TranscriptionQuickstart.java 的文件：

import com.azure.ai.speech.transcription.TranscriptionClient;
import com.azure.ai.speech.transcription.TranscriptionClientBuilder;
import com.azure.ai.speech.transcription.models.AudioFileDetails;
import com.azure.ai.speech.transcription.models.TranscriptionOptions;
import com.azure.ai.speech.transcription.models.TranscriptionResult;
import com.azure.core.credential.KeyCredential;
import com.azure.core.util.BinaryData;
import com.azure.identity.DefaultAzureCredentialBuilder;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;

public class TranscriptionQuickstart {
    public static void main(String[] args) {
        try {
            // Get credentials from environment variables
            String endpoint = System.getenv("AZURE_SPEECH_ENDPOINT");
            String apiKey = System.getenv("AZURE_SPEECH_API_KEY");

            // Create client with API key or Entra ID authentication
            TranscriptionClientBuilder builder = new TranscriptionClientBuilder()
                .endpoint(endpoint);

            TranscriptionClient client;
            if (apiKey != null && !apiKey.isEmpty()) {
                // Use API key authentication
                client = builder.credential(new KeyCredential(apiKey)).buildClient();
            } else {
                // Use Entra ID authentication
                client = builder.credential(new DefaultAzureCredentialBuilder().build()).buildClient();
            }

            // Load audio file
            String audioFilePath = "<path-to-your-audio-file.wav>";
            byte[] audioData = Files.readAllBytes(Paths.get(audioFilePath));

            // Create audio file details
            AudioFileDetails audioFileDetails = new AudioFileDetails(BinaryData.fromBytes(audioData));

            // Transcribe
            TranscriptionOptions options = new TranscriptionOptions(audioFileDetails);
            TranscriptionResult result = client.transcribe(options);

            // Print result
            System.out.println("Transcription:");
            result.getCombinedPhrases().forEach(phrase ->
                System.out.println(phrase.getText())
            );

        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

将<path-to-your-audio-file.wav>替换为音频文件的路径。

运行应用程序

使用 Maven 运行应用程序：

mvn compile exec:java

请求配置选项

用 TranscriptionOptions 自定义转录行为。以下部分介绍了每个受支持的配置，并演示如何应用它。

多语言检测

如果未指定区域设置，服务会自动检测并转录音频中存在的所有语言。每个返回的短语都包含一个 locale 标识检测到的语言的字段。

// No locale specified — service auto-detects all languages in the audio
TranscriptionOptions options = new TranscriptionOptions(audioFileDetails);
TranscriptionResult result = client.transcribe(options);

// Each phrase reports the detected locale
result.getPhrases().forEach(phrase ->
    System.out.println(phrase.getLocale() + ": " + phrase.getText())
);

注释

如果未指定区域设置locale，则各个短语的字段可能并不总是准确地反映该短语的具体语言。为获得最高准确度，请在知道时指定预期的区域设置。

参考： TranscriptionOptionsTranscribedPhrase.getLocale()

说话人分割

Diarization 检测并标记单个音频通道中的不同扬声器。使用 TranscriptionDiarizationOptions 来启用该功能，并设置最大预期说话人数（2-36）。结果中的每个短语都包含一个 speaker 标识符。

import com.azure.ai.speech.transcription.models.TranscriptionDiarizationOptions;

// Configure diarization with a maximum of 5 speakers
TranscriptionDiarizationOptions diarizationOptions =
    new TranscriptionDiarizationOptions()
        .setMaxSpeakers(5);

TranscriptionOptions options = new TranscriptionOptions(audioFileDetails)
    .setDiarizationOptions(diarizationOptions);

TranscriptionResult result = client.transcribe(options);

// Each phrase includes the detected speaker ID
result.getPhrases().forEach(phrase ->
    System.out.println(
        "[Speaker " + phrase.getSpeaker() + "] " + phrase.getText()
    )
);

注释

仅单声道（单声道）音频支持分割。如果音频是立体声，请不要在启用分割时将 channels 属性 [0,1] 设置为。

参考： TranscriptionDiarizationOptions，TranscriptionOptions.setDiarizationOptions()，TranscribedPhrase.getSpeaker()

短语列表

短语列表可提升域特定术语、正确名词和不常见字词的识别准确性。你添加的短语会被识别器赋予更高的权重，从而更有可能被正确地识别。

import com.azure.ai.speech.transcription.models.PhraseListOptions;
import java.util.Arrays;

// Add terms that appear in your audio to improve recognition
PhraseListOptions phraseListOptions = new PhraseListOptions()
    .setPhrases(Arrays.asList("Contoso", "Jessie", "Rehaan"));

TranscriptionOptions options = new TranscriptionOptions(audioFileDetails)
    .setPhraseListOptions(phraseListOptions);

TranscriptionResult result = client.transcribe(options);

result.getCombinedPhrases().forEach(phrase ->
    System.out.println(phrase.getText())
);

有关详细信息，请参阅使用短语列表提高识别准确性。

参考： PhraseListOptionsTranscriptionOptions.setPhraseListOptions()

不雅内容筛选

控制不雅语言在转录输出中的显示方式，使用ProfanityFilterMode。以下模式可用：

模式	行为
`NONE`	不雅用语不作更改地通过。
`MASKED`	不雅内容将替换为星号（默认值）。
`REMOVED`	完全从输出中删除不雅内容。
`TAGS`	不雅内容包装在 XML 标记中。

import com.azure.ai.speech.transcription.models.ProfanityFilterMode;

TranscriptionOptions options = new TranscriptionOptions(audioFileDetails)
    .setProfanityFilterMode(ProfanityFilterMode.MASKED);

TranscriptionResult result = client.transcribe(options);

System.out.println(result.getCombinedPhrases().get(0).getText());

参考： ProfanityFilterModeTranscriptionOptions.setProfanityFilterMode()

清理资源

完成快速入门后，可以删除项目文件夹：

rm -rf transcription-quickstart

转录错误处理

使用指数退避实现重试逻辑

调用快速听录 API 时，实现重试逻辑来处理暂时性错误和速率限制。该 API 强制实施速率限制，这可能会导致在高并发操作期间出现 HTTP 429 响应。

建议的重试配置

暂时性错误最多重试 5 次。
使用指数回退：2s、4s、8s、16s、32s。
总回退时间：62 秒。

此配置为 API 在速率限制窗口期间恢复提供了足够的时间，尤其是在运行具有多个并发工作线程的批处理操作时。

何时使用重试逻辑

为以下错误类别实现重试逻辑：

HTTP 错误 - 重试：
- HTTP 429 （速率限制）
- HTTP 500、502、503、504（服务器错误）
- status_code=None （不完整的响应下载）
Azure SDK网络错误 - 重试：
- ServiceRequestError
- ServiceResponseError 这些错误包装了低级别网络异常，例如 urllib3.exceptions.ReadTimeoutError TLS 失败和连接重置。
Python网络异常 - 重试：
- ConnectionError
- TimeoutError
- OSError

不要重试以下错误，因为它们指示需要更正的客户端问题：

HTTP 400 （错误请求）
HTTP 401 （未授权）
HTTP 422 （不可处理的实体）
其他客户端错误（4xx 状态代码）

实现说明

每次重试尝试前重置音频文件流（seek(0)）。
使用并发工作线程时，请注意，在强烈的速率限制下，默认的 HTTP 读取超时（300 秒）可能会被超出。
请注意，API 可能会接受请求，但在生成响应时超时，该响应可能显示为 SDK 包装的网络错误，而不是标准 HTTP 错误。

Last updated on 2026-04-27

将快速听录 API 与 Azure 语音配合使用

先决条件

上传音频

使用快速转录 API

请求配置选项

先决条件

Microsoft Entra ID先决条件

Setup

Code

输出

请求配置选项

多语言检测

说话人分割

短语列表

不雅内容筛选

先决条件

Microsoft Entra ID先决条件

启动项目

转录音频

访问字级详细信息

通过说话者分离识别

先决条件

Microsoft Entra ID先决条件

启动项目

检索资源信息

转录音频

输出

常见请求选项

通过说话者分离识别

设置不雅内容筛选

添加短语列表

启用多语言检测

先决条件

设置环境

设置环境变量。

选项 1：API 密钥身份验证（建议入门）

选项 2：Microsoft Entra ID身份验证（建议用于生产）

创建应用程序

运行应用程序

请求配置选项

多语言检测

说话人分割

短语列表

不雅内容筛选

清理资源

转录错误处理

使用指数退避实现重试逻辑

建议的重试配置

何时使用重试逻辑

实现说明

相关内容

其他资源