语音转文本 REST APISpeech-to-text REST API

作为语音 SDK 的一种替代方法,语音服务允许使用 REST API 转换语音转文本。As an alternative to the Speech SDK, the Speech service allows you to convert speech-to-text using a REST API. 每个可访问的终结点都与某个区域相关联。Each accessible endpoint is associated with a region. 应用程序需要所用终结点的订阅密钥。Your application requires a subscription key for the endpoint you plan to use. REST API 非常有限,只应在语音 SDK 不能使用的情况下使用。The REST API is very limited, and it should only be used in cases were the Speech SDK cannot.

使用语音转文本 REST API 之前,请先了解:Before using the speech-to-text REST API, understand:

  • 使用 REST API 并直接传输音频的请求最多只能包含 60 秒的音频。Requests that use the REST API and transmit audio directly can only contain up to 60 seconds of audio.
  • 语音转文本 REST API 仅返回最终结果。The speech-to-text REST API only returns final results. 不提供部分结果。Partial results are not provided.

如果应用程序需要发送更长的音频,请考虑使用语音 SDK 或基于文件的 REST API,如批量转录If sending longer audio is a requirement for your application, consider using the Speech SDK or a file-based REST API, like batch transcription.

AuthenticationAuthentication

每个请求都需要一个授权标头。Each request requires an authorization header. 下表列出了每个服务支持的标头:This table illustrates which headers are supported for each service:

支持的授权标头Supported authorization headers 语音转文本Speech-to-text 文本转语音Text-to-speech
Ocp-Apim-Subscription-KeyOcp-Apim-Subscription-Key Yes No
授权:持有者Authorization: Bearer Yes Yes

使用 Ocp-Apim-Subscription-Key 标头时,只需提供订阅密钥。When using the Ocp-Apim-Subscription-Key header, you're only required to provide your subscription key. 例如:For example:

'Ocp-Apim-Subscription-Key': 'YOUR_SUBSCRIPTION_KEY'

使用 Authorization: Bearer 标头时,需要向 issueToken 终结点发出请求。When using the Authorization: Bearer header, you're required to make a request to the issueToken endpoint. 在此请求中,交换有效期为 10 分钟的访问令牌的订阅密钥。In this request, you exchange your subscription key for an access token that's valid for 10 minutes. 下面的几个部分将介绍如何获取令牌、使用令牌。In the next few sections you'll learn how to get a token, and use a token.

如何获取访问令牌How to get an access token

若要获取访问令牌,需使用 Ocp-Apim-Subscription-Key 和订阅密钥向 issueToken 终结点发出请求。To get an access token, you'll need to make a request to the issueToken endpoint using the Ocp-Apim-Subscription-Key and your subscription key.

issueToken 终结点具有以下格式:The issueToken endpoint has this format:

https://<REGION_IDENTIFIER>.api.cognitive.azure.cn/sts/v1.0/issueToken

<REGION_IDENTIFIER> 替换为与下表中的订阅区域匹配的标识符:Replace <REGION_IDENTIFIER> with the identifier matching the region of your subscription from this table:

地理位置Geography 区域Region 区域标识符Region identifier
中国China 中国东部 2China East 2 chinaeast2

使用这些示例创建访问令牌请求。Use these samples to create your access token request.

HTTP 示例HTTP sample

此示例是获取令牌的简单 HTTP 请求。This example is a simple HTTP request to get a token. 请将 YOUR_SUBSCRIPTION_KEY 替换为语音服务订阅密钥。Replace YOUR_SUBSCRIPTION_KEY with your Speech Service subscription key. 如果订阅不在美国西部区域,请将 Host 标头替换为所在区域的主机名。If your subscription isn't in the West US region, replace the Host header with your region's host name.

POST /sts/v1.0/issueToken HTTP/1.1
Ocp-Apim-Subscription-Key: YOUR_SUBSCRIPTION_KEY
Host: chinaeast2.api.cognitive.azure.cn
Content-type: application/x-www-form-urlencoded
Content-Length: 0

响应正文包含 JSON Web 令牌 (JWT) 格式的访问令牌。The body of the response contains the access token in JSON Web Token (JWT) format.

PowerShell 示例PowerShell sample

此示例是获取访问令牌的简单 PowerShell 脚本。This example is a simple PowerShell script to get an access token. 请将 YOUR_SUBSCRIPTION_KEY 替换为语音服务订阅密钥。Replace YOUR_SUBSCRIPTION_KEY with your Speech Service subscription key. 请务必使用与订阅匹配的正确区域终结点。Make sure to use the correct endpoint for the region that matches your subscription. 此示例目前设置为“美国西部”。This example is currently set to West US.

$FetchTokenHeader = @{
  'Content-type'='application/x-www-form-urlencoded';
  'Content-Length'= '0';
  'Ocp-Apim-Subscription-Key' = 'YOUR_SUBSCRIPTION_KEY'
}

$OAuthToken = Invoke-RestMethod -Method POST -Uri https://chinaeast2.api.cognitive.azure.cn/sts/v1.0/issueToken
 -Headers $FetchTokenHeader

# show the token received
$OAuthToken

cURL 示例cURL sample

cURL 是 Linux(及面向 Linux 的 Windows 子系统)中提供的一种命令行工具。cURL is a command-line tool available in Linux (and in the Windows Subsystem for Linux). 此 cURL 命令演示如何获取访问令牌。This cURL command illustrates how to get an access token. 请将 YOUR_SUBSCRIPTION_KEY 替换为语音服务订阅密钥。Replace YOUR_SUBSCRIPTION_KEY with your Speech Service subscription key. 请务必使用与订阅匹配的正确区域终结点。Make sure to use the correct endpoint for the region that matches your subscription. 此示例目前设置为“美国西部”。This example is currently set to West US.

curl -v -X POST
 "https://chinaeast2.api.cognitive.azure.cn/sts/v1.0/issueToken" \
 -H "Content-type: application/x-www-form-urlencoded" \
 -H "Content-Length: 0" \
 -H "Ocp-Apim-Subscription-Key: YOUR_SUBSCRIPTION_KEY"

C# 示例C# sample

此 C# 类演示如何获取访问令牌。This C# class illustrates how to get an access token. 实例化该类时,请传递语音服务订阅密钥。Pass your Speech Service subscription key when you instantiate the class. 如果订阅不在美国西部区域,请更改 FetchTokenUri 的值,以便与订阅的区域相匹配。If your subscription isn't in the West US region, change the value of FetchTokenUri to match the region for your subscription.

public class Authentication
{
    public static readonly string FetchTokenUri =
        "https://chinaeast2.api.cognitive.azure.cn/sts/v1.0/issueToken";
    private string subscriptionKey;
    private string token;

    public Authentication(string subscriptionKey)
    {
        this.subscriptionKey = subscriptionKey;
        this.token = FetchTokenAsync(FetchTokenUri, subscriptionKey).Result;
    }

    public string GetAccessToken()
    {
        return this.token;
    }

    private async Task<string> FetchTokenAsync(string fetchUri, string subscriptionKey)
    {
        using (var client = new HttpClient())
        {
            client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);
            UriBuilder uriBuilder = new UriBuilder(fetchUri);

            var result = await client.PostAsync(uriBuilder.Uri.AbsoluteUri, null);
            Console.WriteLine("Token Uri: {0}", uriBuilder.Uri.AbsoluteUri);
            return await result.Content.ReadAsStringAsync();
        }
    }
}

Python 示例Python sample

# Request module must be installed.
# Run pip install requests if necessary.
import requests

subscription_key = 'REPLACE_WITH_YOUR_KEY'


def get_token(subscription_key):
    fetch_token_url = 'https://chinaeast2.api.cognitive.azure.cn/sts/v1.0/issueToken'
    headers = {
        'Ocp-Apim-Subscription-Key': subscription_key
    }
    response = requests.post(fetch_token_url, headers=headers)
    access_token = str(response.text)
    print(access_token)

如何使用访问令牌How to use an access token

应将访问令牌作为 Authorization: Bearer <TOKEN> 标头发送到服务。The access token should be sent to the service as the Authorization: Bearer <TOKEN> header. 每个访问令牌的有效期为 10 分钟。Each access token is valid for 10 minutes. 随时可以获取新令牌,但是,为了最大限度地减少流量和延迟,我们建议使用同一令牌 9 分钟。You can get a new token at any time, however, to minimize network traffic and latency, we recommend using the same token for nine minutes.

下面是向文本转语音 REST API 发出的示例 HTTP 请求:Here's a sample HTTP request to the text-to-speech REST API:

POST /cognitiveservices/v1 HTTP/1.1
Authorization: Bearer YOUR_ACCESS_TOKEN
Host: chinaeast2.stt.speech.azure.cn
Content-type: application/ssml+xml
Content-Length: 199
Connection: Keep-Alive

// Message body here...

区域和终结点Regions and endpoints

REST API 的终结点具有以下格式:The endpoint for the REST API has this format:

https://<REGION_IDENTIFIER>.stt.speech.azure.cn/speech/recognition/conversation/cognitiveservices/v1

<REGION_IDENTIFIER> 替换为与下表中的订阅区域匹配的标识符:Replace <REGION_IDENTIFIER> with the identifier matching the region of your subscription from this table:

地理位置Geography 区域Region 区域标识符Region identifier
中国China 中国东部 2China East 2 chinaeast2

备注

必须将语言参数追加到 URL 以避免收到 4xx HTTP 错误。The language parameter must be appended to the URL to avoid receiving an 4xx HTTP error. 例如,使用“中国东部 2”终结点设置为美国英语的语言为:https://chinaeast2.stt.speech.azure.cn/speech/recognition/conversation/cognitiveservices/v1?language=en-USFor example, the language set to US English using the China East 2 endpoint is: https://chinaeast2.stt.speech.azure.cn/speech/recognition/conversation/cognitiveservices/v1?language=en-US.

查询参数Query parameters

可将以下参数包含在 REST 请求的查询字符串中。These parameters may be included in the query string of the REST request.

参数Parameter 说明Description 必需/可选Required / Optional
language 标识所要识别的口语。Identifies the spoken language that is being recognized. 请参阅支持的语言See Supported languages. 必须Required
format 指定结果格式。Specifies the result format. 接受的值为 simpledetailedAccepted values are simple and detailed. 简单结果包括 RecognitionStatusDisplayTextOffsetDurationSimple results include RecognitionStatus, DisplayText, Offset, and Duration. Detailed 响应包括显示文本的四种不同的表示形式。Detailed responses include four different representations of display text. 默认设置为 simpleThe default setting is simple. 可选Optional
profanity 指定如何处理识别结果中的不雅内容。Specifies how to handle profanity in recognition results. 接受的值为 masked(将亵渎内容替换为星号)、removed(删除结果中的所有亵渎内容)或 raw(包含结果中的亵渎内容)。Accepted values are masked, which replaces profanity with asterisks, removed, which removes all profanity from the result, or raw, which includes the profanity in the result. 默认设置为 maskedThe default setting is masked. 可选Optional
cid 使用自定义语音门户创建自定义模型时,可以通过在“部署”页上找到的其终结点 ID 使用自定义模型。When using the Custom Speech portal to create custom models, you can use custom models via their Endpoint ID found on the Deployment page. 使用终结点 ID 作为 cid 查询字符串形式参数的实际参数。Use the Endpoint ID as the argument to the cid query string parameter. 可选Optional

请求标头Request headers

下表列出了语音转文本请求的必需和可选标头。This table lists required and optional headers for speech-to-text requests.

标头Header 说明Description 必需/可选Required / Optional
Ocp-Apim-Subscription-Key 语音服务订阅密钥。Your Speech service subscription key. 此标头或 Authorization 是必需的。Either this header or Authorization is required.
Authorization 前面带有单词 Bearer 的授权令牌。An authorization token preceded by the word Bearer. 有关详细信息,请参阅身份验证For more information, see Authentication. 此标头或 Ocp-Apim-Subscription-Key 是必需的。Either this header or Ocp-Apim-Subscription-Key is required.
Content-type 描述所提供音频数据的格式和编解码器。Describes the format and codec of the provided audio data. 接受的值为 audio/wav; codecs=audio/pcm; samplerate=16000audio/ogg; codecs=opusAccepted values are audio/wav; codecs=audio/pcm; samplerate=16000 and audio/ogg; codecs=opus. 必须Required
Transfer-Encoding 指定要发送分块的音频数据,而不是单个文件。Specifies that chunked audio data is being sent, rather than a single file. 仅当要对音频数据进行分块时才使用此标头。Only use this header if chunking audio data. 可选Optional
Expect 如果使用分块传输,则发送 Expect: 100-continueIf using chunked transfer, send Expect: 100-continue. 语音服务将确认初始请求并等待附加的数据。The Speech service acknowledges the initial request and awaits additional data. 如果发送分块的音频数据,则是必需的。Required if sending chunked audio data.
Accept 如果提供此标头,则值必须是 application/jsonIf provided, it must be application/json. 语音服务以 JSON 格式提供结果。The Speech service provides results in JSON. 某些请求框架提供不兼容的默认值。Some request frameworks provide an incompatible default value. 最好始终包含 AcceptIt is good practice to always include Accept. 可选,但建议提供。Optional, but recommended.

音频格式Audio formats

在 HTTP POST 请求的正文中发送音频。Audio is sent in the body of the HTTP POST request. 它必须采用下表中的格式之一:It must be in one of the formats in this table:

格式Format 编解码器Codec 比特率Bit rate 采样率Sample Rate
WAVWAV PCMPCM 256 kbps256 kbps 16 kHz,单声道16 kHz, mono
OGGOGG OPUSOPUS 256 kpbs256 kpbs 16 kHz,单声道16 kHz, mono

备注

通过语音服务中的 REST API 和 WebSocket 支持上述格式。The above formats are supported through REST API and WebSocket in the Speech service. 语音 SDK 当前支持使用 PCM 编解码器的 WAV 格式以及其他格式The Speech SDK currently supports the WAV format with PCM codec as well as other formats.

示例请求Sample request

以下示例包括主机名和必需的标头。The sample below includes the hostname and required headers. 必须注意,服务同时预期提供音频数据,但此示例未包括这些数据。It's important to note that the service also expects audio data, which is not included in this sample. 如前所述,建议进行分块,但不是非要这样做。As mentioned earlier, chunking is recommended, however, not required.

POST speech/recognition/conversation/cognitiveservices/v1?language=en-US&format=detailed HTTP/1.1
Accept: application/json;text/xml
Content-Type: audio/wav; codecs=audio/pcm; samplerate=16000
Ocp-Apim-Subscription-Key: YOUR_SUBSCRIPTION_KEY
Host: chinaeast2.stt.speech.azure.cn
Transfer-Encoding: chunked
Expect: 100-continue

HTTP 状态代码HTTP status codes

每个响应的 HTTP 状态代码指示成功或一般错误。The HTTP status code for each response indicates success or common errors.

HTTP 状态代码HTTP status code 说明Description 可能的原因Possible reason
100 继续Continue 已接受初始请求。The initial request has been accepted. 继续发送剩余的数据。Proceed with sending the rest of the data. (与分块传输配合使用)(Used with chunked transfer)
200 OKOK 请求成功;响应正文是一个 JSON 对象。The request was successful; the response body is a JSON object.
400 错误的请求Bad request 语言代码未提供、不是支持的语言、音频文件无效等。Language code not provided, not a supported language, invalid audio file, etc.
401 未授权Unauthorized 指定区域中的订阅密钥或授权令牌无效,或终结点无效。Subscription key or authorization token is invalid in the specified region, or invalid endpoint.
403 禁止Forbidden 缺少订阅密钥或授权令牌。Missing subscription key or authorization token.

分块传输Chunked transfer

分块传输 (Transfer-Encoding: chunked) 有助于降低识别延迟。Chunked transfer (Transfer-Encoding: chunked) can help reduce recognition latency. 它允许语音服务在传输音频文件时开始处理该文件。It allows the Speech service to begin processing the audio file while it is transmitted. REST API 不提供部分结果或临时结果。The REST API does not provide partial or interim results.

此代码示例演示如何以块的形式发送音频。This code sample shows how to send audio in chunks. 只有第一个区块应该包含音频文件的标头。Only the first chunk should contain the audio file's header. request 是连接到相应 REST 终结点的 HttpWebRequest 对象。request is an HttpWebRequest object connected to the appropriate REST endpoint. audioFile 是音频文件在磁盘上的路径。audioFile is the path to an audio file on disk.

var request = (HttpWebRequest)HttpWebRequest.Create(requestUri);
request.SendChunked = true;
request.Accept = @"application/json;text/xml";
request.Method = "POST";
request.ProtocolVersion = HttpVersion.Version11;
request.Host = host;
request.ContentType = @"audio/wav; codecs=audio/pcm; samplerate=16000";
request.Headers["Ocp-Apim-Subscription-Key"] = "YOUR_SUBSCRIPTION_KEY";
request.AllowWriteStreamBuffering = false;

using (var fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
{
    // Open a request stream and write 1024 byte chunks in the stream one at a time.
    byte[] buffer = null;
    int bytesRead = 0;
    using (var requestStream = request.GetRequestStream())
    {
        // Read 1024 raw bytes from the input audio file.
        buffer = new Byte[checked((uint)Math.Min(1024, (int)fs.Length))];
        while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) != 0)
        {
            requestStream.Write(buffer, 0, bytesRead);
        }

        requestStream.Flush();
    }
}

响应参数Response parameters

结果以 JSON 格式提供。Results are provided as JSON. simple 格式包含以下顶级字段。The simple format includes these top-level fields.

参数Parameter 说明Description
RecognitionStatus 状态,例如 Success 表示成功识别。Status, such as Success for successful recognition. 请参阅下表。See next table.
DisplayText 经过大小写转换、添加标点、执行反向文本规范化(将口头文本转换为短形式,例如,200 表示“two hundred”,或“Dr.Smith”表示“doctor smith”)和屏蔽亵渎内容之后的识别文本。The recognized text after capitalization, punctuation, inverse text normalization (conversion of spoken text to shorter forms, such as 200 for "two hundred" or "Dr. Smith" for "doctor smith"), and profanity masking. 仅在成功时提供。Present only on success.
Offset 在音频流中开始识别语音的时间(以 100 纳秒为单位)。The time (in 100-nanosecond units) at which the recognized speech begins in the audio stream.
Duration 在音频流中识别语音的持续时间(以 100 纳秒为单位)。The duration (in 100-nanosecond units) of the recognized speech in the audio stream.

RecognitionStatus 字段可包含以下值:The RecognitionStatus field may contain these values:

状态Status 说明Description
Success 识别成功并且存在 DisplayText 字段。The recognition was successful and the DisplayText field is present.
NoMatch 在音频流中检测到语音,但没有匹配目标语言的字词。Speech was detected in the audio stream, but no words from the target language were matched. 通常表示识别语言不同于讲话用户所用的语言。Usually means the recognition language is a different language from the one the user is speaking.
InitialSilenceTimeout 音频流的开始仅包含静音,并且服务在等待语音时超时。The start of the audio stream contained only silence, and the service timed out waiting for speech.
BabbleTimeout 音频流的开始仅包含噪音,并且服务在等待语音时超时。The start of the audio stream contained only noise, and the service timed out waiting for speech.
Error 识别服务遇到内部错误,无法继续。The recognition service encountered an internal error and could not continue. 如果可能,请重试。Try again if possible.

备注

如果音频仅包含亵渎内容,并且 profanity 查询参数设置为 remove,则服务不会返回语音结果。If the audio consists only of profanity, and the profanity query parameter is set to remove, the service does not return a speech result.

detailed 格式包括其他形式的已识别结果。The detailed format includes additional forms of recognized results. 使用 detailed 格式时,将以 Display 形式为 NBest 列表中的每条结果提供 DisplayTextWhen using the detailed format, DisplayText is provided as Display for each result in the NBest list.

NBest 列表中的对象可以包括:The object in the NBest list can include:

参数Parameter 说明Description
Confidence 条目的置信度评分,从 0.0(完全不可信)到 1.0(完全可信)The confidence score of the entry from 0.0 (no confidence) to 1.0 (full confidence)
Lexical 已识别文本的词法形式:识别的实际单词。The lexical form of the recognized text: the actual words recognized.
ITN 已识别文本的反向文本规范化(“规范”)形式,已应用电话号码、数字、缩写(“doctor smith”缩写为“dr smith”)和其他转换。The inverse-text-normalized ("canonical") form of the recognized text, with phone numbers, numbers, abbreviations ("doctor smith" to "dr smith"), and other transformations applied.
MaskedITN 可根据请求提供应用了亵渎内容屏蔽的 ITN 形式。The ITN form with profanity masking applied, if requested.
Display 已识别文本的显示形式,其中添加了标点符号和大小写形式。The display form of the recognized text, with punctuation and capitalization added. 此参数与将格式设置为 simple 时提供的 DisplayText 相同。This parameter is the same as DisplayText provided when format is set to simple.
AccuracyScore 指示给定语音的发音准确性的分数。The score indicating the pronunciation accuracy of the given speech.
FluencyScore 指示给定语音流畅性的分数。The score indicating the fluency of the given speech.
CompletenessScore 通过计算发音词与整个输入的比率来指示给定语音完整性的分数。The score indicating the completeness of the given speech by calculating the ratio of pronounced words towards entire input.
PronScore 指示给定语音的发音质量的总分。The overall score indicating the pronunciation quality of the given speech. 此分数根据权重 AccuracyScoreFluencyScoreCompletenessScore 进行计算。This is calculated from AccuracyScore, FluencyScore and CompletenessScore with weight.
ErrorType 此值指示与 ReferenceText 相比,是省略、插入还是错误读出字词。This value indicates whether a word is omitted, inserted or badly pronounced, compared to ReferenceText. 可能的值为 None(表示此词没有错误)、OmissionInsertionMispronunciationPossible values are None (meaning no error on this word), Omission, Insertion and Mispronunciation.

示例响应Sample responses

simple 识别的典型响应:A typical response for simple recognition:

{
  "RecognitionStatus": "Success",
  "DisplayText": "Remind me to buy 5 pencils.",
  "Offset": "1236645672289",
  "Duration": "1236645672289"
}

detailed 识别的典型响应:A typical response for detailed recognition:

{
  "RecognitionStatus": "Success",
  "Offset": "1236645672289",
  "Duration": "1236645672289",
  "NBest": [
      {
        "Confidence" : "0.87",
        "Lexical" : "remind me to buy five pencils",
        "ITN" : "remind me to buy 5 pencils",
        "MaskedITN" : "remind me to buy 5 pencils",
        "Display" : "Remind me to buy 5 pencils.",
      }
  ]
}

后续步骤Next steps