用于处理电话数据的语音服务Speech service for telephony data

通过座机、手机和无线电生成的电话数据通常质量很低,处于 8 KHz 的窄频范围内,在执行语音转文本过程中会带来挑战。Telephony data that is generated through landlines, mobile phones, and radios are typically low quality, and narrowband in the range of 8 KHz, which creates challenges when converting speech-to-text. 语音服务的最新语音识别模型非常擅长听录这些电话数据,即使是人类也难以识别的数据。The latest speech recognition models from the Speech service excel at transcribing this telephony data, even in cases when the data is difficult for a human to understand. 这些模型已使用大量的电话数据进行训练,即使在嘈杂的环境中,也能保持行业领先的识别准确度。These models are trained with large volumes of telephony data, and have best-in-market recognition accuracy, even in noisy environments.

语音转文本的一个常见应用场景是听录来自各种系统(例如交互式语音应答 (IVR))的大量电话数据。A common scenario for speech-to-text is transcribing large volumes of telephony data that may come from various systems, such as Interactive Voice Response (IVR). 这些系统提供的音频可能是立体声或单音的,音频信号只经过极少量的后期处理,甚至未经任何后期处理。The audio these systems provide can be stereo or mono, and raw with little-to-no post processing done on the signal. 无论使用哪种系统来捕获音频,企业都可以使用语音服务和统一语音模型获得高质量的听录。Using the Speech service and the Unified speech model, a business can get high-quality transcriptions, whatever systems are used to capture audio.

电话数据可用于更好地了解客户的需求,识别新的营销机会,或者评估呼叫中心座席的工作绩效。Telephony data can be used to better understand your customers' needs, identify new marketing opportunities, or evaluate the performance of call center agents. 听录数据后,企业可以使用输出内容来改进遥测功能,识别关键短语或分析客户的情绪等。After the data is transcribed, a business can use the output for purposes such as improved telemetry, identifying key phrases, or analyzing customer sentiment.

本页中所述的技术是 Microsoft 在内部针对各种处于实时和批处理模式的电话支持处理服务开发的。The technologies outlined in this page are by Microsoft internally for various support call processing services, both in real-time and batch mode.

让我们回顾一下语音服务提供的技术和相关功能。Let's review some of the technology and related features the Speech service offers.


语音服务统一模型已使用多样化的数据进行训练,可针对从听写到电话分析在内的多种方案提供单模型解决方案。The Speech service Unified model is trained with diverse data and offers a single-model solution to a number of scenario from Dictation to Telephony analytics.

适用于呼叫中心的 Azure 技术Azure Technology for Call Centers

除了语音服务功能的功能方面,当应用到呼叫中心时,它们的主要用途是改善客户体验。Beyond the functional aspect of the Speech service features, their primary purpose – when applied to the call center – is to improve the customer experience. 在此方面存在三个明确的应用领域:Three clear domains exist in this regard:

  • 呼叫后分析,它实质上是呼叫后批处理呼叫录音。Post-call analytics, which is essentially batch processing of call recordings after the call.
  • 实时分析,即在通话过程中对音频信号进行处理,以提取各种见解(最常见的用例是情绪分析)。Real-time analytics, which is processing of the audio signal to extract various insights as the call is taking place (with sentiment being a prominent use case).
  • 语音助手(机器人),推动客户与机器人之间的对话,以尝试在无需座席干预的情况下解决客户的问题,或者应用人工智能人工智能 (AI) 协议来为座席提供辅助。Voice assistants (bots), either driving the dialogue between the customer and the bot in an attempt to solve the customer's issue with no agent participation, or being the application of artificial intelligence (AI) protocols to assist the agent.

下面的呼叫中心听录体系结构插图描绘了批处理方案实现的典型体系结构示意图A typical architecture diagram of the implementation of a batch scenario is depicted in the picture below Call center transcription architecture

语音分析技术组件Speech Analytics Technology Components

无论应用领域是呼叫后处理还是实时处理,Azure 都能提供一组成熟的新潮技术来改善客户体验。Whether the domain is post-call or real-time, Azure offers a set of mature and emerging technologies to improve the customer experience.

语音转文本 (STT)Speech to text (STT)

在任何呼叫中心解决方案中,语音转文本都是最受欢迎的功能。Speech-to-text is the most sought-after feature in any call center solution. 由于许多下游分析流程都依赖于听录的文本,因此误字率 (WER) 至关重要。Because many of the downstream analytics processes rely on transcribed text, the word error rate (WER) is of utmost importance. 呼叫中心听录的主要难题之一是呼叫中心普遍存在的噪音(例如,其他座席在幕后讲话)、各种不同的语言区域性和方言,以及低质量的实际电话信号。One of the key challenges in call center transcription is the noise that’s prevalent in the call center (for example other agents speaking in the background), the rich variety of language locales and dialects as well as the low quality of the actual telephone signal. WER 与针对给定区域性训练声音和语言模型的效果密切相关,因此,能够根据区域性自定义模型非常重要。WER is highly correlated with how well the acoustic and language models are trained for a given locale, thus the ability to customize the model to your locale is important. 我们的最新 4.x 版统一模型是可以改善听录准确度和延迟的解决方案。Our latest Unified version 4.x models are the solution to both transcription accuracy and latency. 统一模型已使用数万小时的声音数据和几十亿段词法信息进行训练,是市场中可听录呼叫中心数据的最准确模型之一。Trained with tens of thousands of hours of acoustic data and billions of lexical information, Unified models are the most accurate models in the market to transcribe call center data.


衡量客户是否获得了良好的体验是在呼叫中心领域应用的语音分析服务的最重要功能之一。Gauging whether the customer had a good experience is one of the most important areas of Speech analytics when applied to the call center space. 我们的批量听录 API 基于每段言语提供情绪分析。Our Batch Transcription API offers sentiment analysis per utterance. 可以聚合在听录通话过程中获取的值集,以确定座席和客户在通话时的情绪。You can aggregate the set of values obtained as part of a call transcript to determine the sentiment of the call for both your agents and the customer.

静默(无对话)Silence (non-talk)

在支持通话中,有 35% 的时间双方无对话也是常用的事。It is not uncommon for 35 percent of a support call to be what we call non-talk time. 出现无对话的情况包括:座席正在查阅以往与客户之间展开的案例历史记录、座席正在使用工具访问客户的桌面和执行功能、客户正在等待转接电话,等等。Some scenarios for which non-talk occurs are: agents looking up prior case history with a customer, agents using tools that allow them to access the customer's desktop and perform functions, customers sitting on hold waiting for a transfer, and so on. 衡量通话中何时出现静默极其重要,因为在此类场景中,及时对重要客户灵敏做出回复以及何时可以出现静默都是有绩效分数的。It is extremely important to gauge when silence is occurring in a call as there are number of important customer sensitivities that occur around these types of scenarios and where they occur in the call.


某些公司正在试验提供从外语支持呼叫翻译的听录内容,使交付经理能够了解其客户的全球体验。Some companies are experimenting with providing translated transcripts from foreign language support calls so that delivery managers can understand the world-wide experience of their customers. 我们的翻译功能非常优秀。Our translation capabilities are unsurpassed. 我们可以针对大量区域设置将音频翻译成音频,或者将音频翻译成文本。We can translate audio-to-audio or audio-to-text for a large number of locales.

文本到语音转换Text to Speech

在实现可与客户交互的机器人时,文本转语音是另一个重要的方面。Text-to-speech is another important area in implementing bots that interact with the customers. 典型的流程是客户讲话、将客户的语音转录为文本、分析文本中的意向、基于识别的意向合成响应,然后向客户呈现一个资产,或者生成合成的语音响应。The typical pathway is that the customer speaks, their voice is transcribed to text, the text is analyzed for intents, a response is synthesized based on the recognized intent, and then an asset is either surfaced to the customer or a synthesized voice response is generated. 当然,所有这些步骤必须快速完成 - 因此,这些系统的成功与否与低延迟密切相关。Of course all of this has to occur quickly – thus low-latency is an important component in the success of these systems.

对于语音转文本LUISBot Framework文本转语音等各种技术,我们的端到端延迟非常低。Our end-to-end latency is considerably low for the various technologies involved such as Speech-to-text, LUIS, Bot Framework, Text-to-speech.

我们的新语音与人类语音没有区别。Our new voices are also indistinguishable from human voices. 我们的语音可让机器人获得独特的个性。You can use our voices to give your bot its unique personality.

分析的另一个重要方面是在发生特定的事件或体验时识别交互。Another staple of analytics is to identify interactions where a specific event or experience has occurred. 这通常是使用两种方法实现的:即席搜索 - 用户只需键入短语,系统即可做出响应;结构化程度更高的查询 - 分析人员可以创建一组逻辑语句用于识别通话中的场景,然后可以针对这组查询为每次通话编制索引。This is typically done with one of two approaches; either an ad hoc search where the user simply types a phrase and the system responds, or a more structured query where an analyst can create a set of logical statements that identify a scenario in a call, and then each call can be indexed against that set of queries. 一个典型的搜索示例是我们经常听到的合规性语句“为保证服务质量,此次通话将被录音...”。A good search example is the ubiquitous compliance statement "this call shall be recorded for quality purposes... ". 许多公司希望确保其座席在真正为通话录音之前,向客户提供这段免责声明。Many companies want to make sure that their agents are providing this disclaimer to customers before the call is actually recorded. 大多数分析系统都可以跟踪查询/搜索算法找到的行为,这种趋势报告最终是分析系统的最重要功能之一。Most analytics systems have the ability to trend the behaviors found by query/search algorithms, and this reporting of trends is ultimately one of the most important functions of an analytics system. 通过认知服务目录,可以使用索引和搜索功能显著增强端到端解决方案。Through Cognitive services directory your end to end solution can be significantly enhanced with indexing and search capabilities.

关键短语提取Key Phrase Extraction

此领域是更具挑战性的分析应用之一,它受益于人工智能和机器学习的应用。This area is one of the more challenging analytics applications and one that is benefiting from the application of AI and machine learning. 这种情况的主要应用场景是推断客户的意向。The primary scenario in this case is to infer customer intent. 客户为何来电?Why is the customer calling? 客户遇到了哪种问题?What is the customer problem? 客户为何会受到负面的体验?Why did the customer have a negative experience? 我们的文本分析服务提供一组现成的分析功能,可以快速升级端到端解决方案以提取这些重要的关键字或短语。Our text analytics service provides a set of analytics out of the box for quickly upgrading your end-to-end solution for extracting those important keywords or phrases.

现在,让我们更详细地了解一下语音识别的批处理和实时管道。Let's now have a look at the batch processing and the real-time pipelines for speech recognition in a bit more detail.

批量听录呼叫中心数据Batch transcription of call center data

为了批量听录音频,我们开发了批量听录 APIFor transcribing bulk audio we developed the Batch Transcription API. 批量听录 API 旨在以异步方式听录大量的音频数据。The Batch Transcription API was developed to transcribe large amounts of audio data asynchronously. 在听录呼叫中心数据方面,我们的解决方案基于以下支柱:With regard to transcribing call center data, our solution is based on these pillars:

  • 准确性 - 我们通过第四代统一模型提供无可比拟的听录质量。Accuracy - With fourth-generation Unified models, we offer unsurpassed transcription quality.
  • 延迟 - 我们知道,在执行批量听录时,需要保证听录的速度。Latency - We understand that when doing bulk transcriptions, the transcriptions are needed quickly. 通过批量听录 API 启动的听录作业将立即排入队列,一旦开始运行该作业,其执行速度比实时听录更快。The transcription jobs initiated via the Batch Transcription API will be queued immediately, and once the job starts running it's performed faster than real-time transcription.
  • 安全性 - 我们知道,通话中可能包含敏感数据。Security - We understand that calls may contain sensitive data. 请放心,保证安全是我们的最高优先事务之一。Rest assured that security is one of our highest priorities. 我们的服务已通过 ISO、SOC、HIPAA 和 PCI 认证。Our service has obtained ISO, SOC, HIPAA, PCI certifications.

呼叫中心每天生成大量的音频数据。Call centers generate large volumes of audio data on a daily basis. 如果你的企业将电话数据存储在某个中心位置(例如 Azure 存储),则你可以使用批量听录 API 以异步方式请求和接收听录内容。If your business stores telephony data in a central location, such as Azure Storage, you can use the Batch Transcription API to asynchronously request and receive transcriptions.

典型的解决方案使用以下服务:A typical solution uses these services:

  • 使用语音服务将语音转录为文本。The Speech service is used to transcribe speech-to-text. 若要使用批量听录 API,需要具备语音服务的标准订阅 (S0)。A standard subscription (S0) for the Speech service is required to use the Batch Transcription API. 免费订阅 (F0) 不可用。Free subscriptions (F0) will not work.
  • 使用 Azure 存储来存储电话数据以及批量听录 API 返回的听录内容。Azure Storage is used to store telephony data, and the transcripts returned by the Batch Transcription API. 此存储帐户应使用通知,特别是添加新文件时。This storage account should use notifications, specifically for when new files are added. 这些通知用于触发听录过程。These notifications are used to trigger the transcription process.
  • 使用 Azure Functions 来为每个录制内容创建共享访问签名 (SAS) URI,以及触发启动听录的 HTTP POST 请求。Azure Functions is used to create the shared access signatures (SAS) URI for each recording, and trigger the HTTP POST request to start a transcription. 此外,Azure Functions 用于创建通过批量听录 API 检索和删除听录内容的请求。Additionally, Azure Functions is used to create requests to retrieve and delete transcriptions using the Batch Transcription API.

在内部,我们使用上述技术来支持批量模式的 Microsoft 客户呼叫。Internally we are using the above technologies to support Microsoft customer calls in Batch mode. 用于支持批量模式的 Microsoft 客户呼叫的技术。

呼叫中心数据的实时听录Real-time transcription for call center data

某些企业需要实时听录对话。Some businesses are required to transcribe conversations in real-time. 实时听录可用于识别关键字、针对对话相关的内容和资源触发搜索、监视情绪、改善可访问性,或者为听不懂本地语言的客户和座席提供翻译。Real-time transcription can be used to identify key-words and trigger searches for content and resources relevant to the conversation, for monitoring sentiment, to improve accessibility, or to provide translations for customers and agents who aren't native speakers.

对于需要实时听录的场景,我们建议使用语音 SDKFor scenarios that require real-time transcription, we recommend using the Speech SDK. 目前,语音转文本支持 20 多种语言,该 SDK 适用于 C++、C#、Java、Python、Node.js、Objective-C 和 JavaScript。Currently, speech-to-text is available in more than 20 languages, and the SDK is available in C++, C#, Java, Python, Node.js, Objective-C, and JavaScript. GitHub 上提供了适用于每种语言的示例。Samples are available in each language on GitHub. 有关最新消息和更新,请参阅发行说明For the latest news and updates, see Release notes.

在内部,我们使用上述技术来分析实时发生的 Microsoft 客户呼叫,如下图所示。Internally we are using the above technologies to analyze in real-time Microsoft customer calls as they happen, as illustrated in the following diagram.

Batch 体系结构

有关 IVR 的说明A word on IVRs

使用语音 SDKREST API 可以轻松将语音服务集成到任何解决方案中。The Speech service can be easily integrated in any solution by using either the Speech SDK or the REST API. 但是,呼叫中心听录可能需要额外的技术。However, call center transcription may require additional technologies. 通常,需要在 IVR 系统与 Azure 之间建立连接。Typically, a connection between an IVR system and Azure is required. 尽管我们不提供此类组件,但提供 IVR 连接所涉及的说明。Although we do not offer such components, here is a description what a connection to an IVR entails.

有多种 IVR 或电话服务产品(例如 Genesys 或 AudioCodes)可以提供集成功能,利用这些功能可与 Azure 服务建立入站和出站音频直通连接。Several IVR or telephony service products (such as Genesys or AudioCodes) offer integration capabilities that can be leveraged to enable inbound and outbound audio pass-through to an Azure service. 简单而言,自定义的 Azure 服务可以提供一个特定的接口用于定义电话呼叫会话(例如“呼叫开始”或“呼叫结束”),并公开一个 WebSocket API 用于接收与语音服务一起使用的入站流音频。Basically, a custom Azure service might provide a specific interface to define phone call sessions (such as Call Start or Call End) and expose a WebSocket API to receive inbound stream audio that is used with the Speech service. 出站响应(例如对话听录或者与 Bot Framework 的连接)可与 Microsoft 文本转语音服务合成,并返回到 IVR 进行播放。Outbound responses, such as conversation transcription or connections with the Bot Framework, can be synthesized with Microsoft's text-to-speech service and returned to the IVR for playback.

另一种方案是与会话初始协议 (SIP) 的直接集成。Another scenario is direct integration with Session Initiation Protocol (SIP). Azure 服务连接到 SIP 服务器,因此可以获取要在语音转文本和文本转语音阶段使用的入站流和出站流。An Azure service connects to a SIP Server, thus getting an inbound stream and an outbound stream, which is used for the speech-to-text and text-to-speech phases. 若要连接到 SIP 服务器,可以使用市售的软件产品/服务,例如 Ozeki SDK,或团队呼叫和会议 API(目前为 Beta 版),它们都旨在支持此类音频呼叫方案。To connect to a SIP Server there are commercial software offerings, such as Ozeki SDK, or the Teams calling and meetings API (currently in beta), that are designed to support this type of scenario for audio calls.

自定义现有体验Customize existing experiences

语音服务能够很好地与内置模型配合工作。The Speech service works well with built-in models. 但是,建议根据自己的产品或环境,进一步自定义和优化体验。However, you may want to further customize and tune the experience for your product or environment. 自定义选项的范围从声学模型优化,到专属于自有品牌的语音字体。Customization options range from acoustic model tuning to unique voice fonts for your brand. 生成自定义模型后,可将其与实时或批量模式下的语音服务功能配合使用。After you've built a custom model, you can use it with any of the Speech service features in real-time or batch mode.

语音服务Speech service 型号Model 说明Description
语音转文本Speech-to-text 声学模型Acoustic model 为特定环境(例如汽车或工厂车间)中使用的应用程序、工具或设备创建自定义声学模型,每个模型具有特定的录制条件。Create a custom acoustic model for applications, tools, or devices that are used in particular environments like in a car or on a factory floor, each with specific recording conditions. 示例包括带有口音的讲话、特定的背景噪音,或使用特定的麦克风录制音频。Examples include accented speech, specific background noises, or using a specific microphone for recording.
语言模型Language model 创建自定义语言模型来改善行业特定的词汇和语法的听录,例如医疗术语中或 IT 行话。Create a custom language model to improve transcription of industry-specific vocabulary and grammar, such as medical terminology, or IT jargon.
发音模型Pronunciation model 借助自定义发音模型,可以定义语音形式以及字词或术语的显示。With a custom pronunciation model, you can define the phonetic form and display for a word or term. 它适用于处理自定义术语,如产品名称或首字母缩略词。It's useful for handling customized terms, such as product names or acronyms. 只需使用发音文件(简单的 .txt 文件)即可。All you need to get started is a pronunciation file, which is a simple .txt file.
文本转语音Text-to-speech 语音字体Voice font 使用自定义语音字体可为自有品牌创建可识别的独一无二的声音。Custom voice fonts allow you to create a recognizable, one-of-a-kind voice for your brand. 只需使用少量的数据即可开始创建。It only takes a small amount of data to get started. 提供的数据越多,语音字体就越自然,且越接近人类语音。The more data that you provide, the more natural and human-like your voice font will sound.

代码示例Sample code

GitHub 中提供了每个语音服务功能的示例代码。Sample code is available on GitHub for each of the Speech service features. 这些示例涵盖了常见方案,例如,从文件或流中读取音频、连续和单次识别,以及使用自定义模型。These samples cover common scenarios like reading audio from a file or stream, continuous and single-shot recognition, and working with custom models. 使用以下链接查看 SDK 和 REST 示例:Use these links to view SDK and REST samples:

参考文档Reference docs

后续步骤Next steps