在语音 SDK 中使用编解码器压缩的音频输入Use codec compressed audio input with the Speech SDK

语音服务 SDK 压缩音频输入流 API 提供了一种使用 PullStreamPushStream 将压缩音频流式传输到语音服务的方法。The Speech service SDK Compressed Audio Input Stream API provides a way to stream compressed audio to the Speech service using either a PullStream or PushStream.

目前,Windows(不支持 UWP 应用程序)和 Linux(Ubuntu 16.04、Ubuntu 18.04、Debian 9、RHEL 7/8、CentOS 7/8)上的 C#、C++、Java 和 Python 支持流式传输压缩输入音频。Streaming compressed input audio is currently supported for C#, C++, Java and Python on Windows (UWP applications aren't supported) and Linux (Ubuntu 16.04, Ubuntu 18.04, Debian 9, RHEL 7/8, CentOS 7/8). Android 中的 Java 也支持此功能。It is also supported for Java in Android.

  • RHEL 8 和 CentOS 8 需要语音 SDK 1.10.0 或更高版本Speech SDK version 1.10.0 or later is required for RHEL 8 and CentOS 8
  • Windows 需要语音 SDK 版本 1.11.0 或更高版本。Speech SDK version 1.11.0 or later is required for for Windows.

默认音频流格式为 WAV(16kHz 或 8kHz,16 位,单声道 PCM)。The default audio streaming format is WAV (16kHz or 8kHz, 16-bit, and mono PCM). 除了 WAV/PCM 外,还支持下列压缩输入格式。Outside of WAV / PCM, the compressed input formats listed below are also supported. 若要启用下列格式,需要其他配置Additional configuration is needed to enable the formats listed below.

  • MP3MP3
  • OPUS/OGGOPUS/OGG
  • FLACFLAC
  • wav 容器中的 ALAWALAW in wav container
  • wav 容器中的 MULAWMULAW in wav container

先决条件Prerequisites

处理压缩音频是使用 GStreamer 实现的。Handling compressed audio is implemented using GStreamer. 出于许可原因,GStreamer 二进制文件未编译,也未与语音 SDK 链接。For licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. 开发人员需要安装几个依赖项和插件,请参阅在 Windows 上安装Developers need to install several dependencies and plugins, see Installing on Windows. Gstreamer 二进制文件需要在系统路径中,以便语音 SDK 可以在运行时加载 Gstreamer 二进制文件。Gstreamer binaries need to be in the system path, so that the speech SDK can load gstreamer binaries during runtime. 如果语音 SDK 能够在运行时找到 libgstreamer-1.0-0.dll,这意味着 gstreamer 二进制文件在系统路径中。If speech SDK is able to find libgstreamer-1.0-0.dll during runtime it means the gstreamer binaries are in the system path.

处理压缩音频是使用 GStreamer 实现的。Handling compressed audio is implemented using GStreamer. 出于许可原因,GStreamer 二进制文件未编译,也未与语音 SDK 链接。For licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. 开发人员需要安装几个依赖项和插件。Developers need to install several dependencies and plugins.

sudo apt install libgstreamer1.0-0 \
gstreamer1.0-plugins-base \
gstreamer1.0-plugins-good \
gstreamer1.0-plugins-bad \
gstreamer1.0-plugins-ugly

处理压缩音频是使用 GStreamer 实现的。Handling compressed audio is implemented using GStreamer. 出于许可原因,GStreamer 二进制文件未编译,也未与语音 SDK 链接。For licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. 但是,需要使用适用于 Android 的预生成二进制文件。Instead, you'll need to use the prebuilt binaries for Android. 若要下载预生成库,请参阅为 Android 开发安装To download the prebuilt libraries, see installing for Android development.

需要 libgstreamer_android.solibgstreamer_android.so is required. 请确保 GStreamer 插件已在 libgstreamer_android.so 中链接。Make sure that your GStreamer plugins are linked in libgstreamer_android.so.

GSTREAMER_PLUGINS := coreelements app audioconvert mpg123 \
    audioresample audioparsers ogg opusparse \
    opus wavparse alaw mulaw flac

下面提供了一个示例 Android.mkApplication.mk 文件。An example Android.mk and Application.mk file are provided below. 按照以下步骤创建 gstreamer 共享对象:libgstreamer_android.soFollow these steps to create the gstreamer shared object:libgstreamer_android.so.

# Android.mk
LOCAL_PATH := $(call my-dir)

include $(CLEAR_VARS)

LOCAL_MODULE    := dummy
LOCAL_SHARED_LIBRARIES := gstreamer_android
LOCAL_LDLIBS := -llog
include $(BUILD_SHARED_LIBRARY)

ifndef GSTREAMER_ROOT_ANDROID
$(error GSTREAMER_ROOT_ANDROID is not defined!)
endif

ifndef APP_BUILD_SCRIPT
$(error APP_BUILD_SCRIPT is not defined!)
endif

ifndef TARGET_ARCH_ABI
$(error TARGET_ARCH_ABI is not defined!)
endif

ifeq ($(TARGET_ARCH_ABI),armeabi)
GSTREAMER_ROOT        := $(GSTREAMER_ROOT_ANDROID)/arm
else ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
GSTREAMER_ROOT        := $(GSTREAMER_ROOT_ANDROID)/armv7
else ifeq ($(TARGET_ARCH_ABI),arm64-v8a)
GSTREAMER_ROOT        := $(GSTREAMER_ROOT_ANDROID)/arm64
else ifeq ($(TARGET_ARCH_ABI),x86)
GSTREAMER_ROOT        := $(GSTREAMER_ROOT_ANDROID)/x86
else ifeq ($(TARGET_ARCH_ABI),x86_64)
GSTREAMER_ROOT        := $(GSTREAMER_ROOT_ANDROID)/x86_64
else
$(error Target arch ABI not supported: $(TARGET_ARCH_ABI))
endif

GSTREAMER_NDK_BUILD_PATH  := $(GSTREAMER_ROOT)/share/gst-android/ndk-build/
include $(GSTREAMER_NDK_BUILD_PATH)/plugins.mk
GSTREAMER_PLUGINS         :=  coreelements app audioconvert mpg123 \
    audioresample audioparsers ogg \
    opusparse opus wavparse alaw mulaw flac
GSTREAMER_EXTRA_LIBS      := -liconv
include $(GSTREAMER_NDK_BUILD_PATH)/gstreamer-1.0.mk
# Application.mk
APP_STL = c++_shared
APP_PLATFORM = android-21
APP_BUILD_SCRIPT = Android.mk

可以在 Ubuntu 16.04 或 18.04 上使用以下命令生成 libgstreamer_android.soYou can build libgstreamer_android.so using the following command on Ubuntu 16.04 or 18.04. 以下命令行仅针对使用 Android NDK b16bGStreamer Android 版本 1.14.4 进行了测试。The following command lines have only been tested for GStreamer Android version 1.14.4 with Android NDK b16b.

# Assuming wget and unzip already installed on the system
mkdir buildLibGstreamer
cd buildLibGstreamer
wget https://dl.google.com/android/repository/android-ndk-r16b-linux-x86_64.zip
unzip -q -o android-ndk-r16b-linux-x86_64.zip
export PATH=$PATH:$(pwd)/android-ndk-r16b
export NDK_PROJECT_PATH=$(pwd)/android-ndk-r16b
wget https://gstreamer.freedesktop.org/data/pkg/android/1.14.4/gstreamer-1.0-android-universal-1.14.4.tar.bz2
mkdir gstreamer_android
tar -xjf gstreamer-1.0-android-universal-1.14.4.tar.bz2 -C $(pwd)/gstreamer_android/
export GSTREAMER_ROOT_ANDROID=$(pwd)/gstreamer_android

mkdir gstreamer
# Copy the Application.mk and Android.mk from the documentation above and put it inside $(pwd)/gstreamer

# Enable only one of the following at one time to create the shared object for the targeted ABI
echo "building for armeabi-v7a. libgstreamer_android.so will be placed in $(pwd)/armeabi-v7a"
ndk-build -C $(pwd)/gstreamer "NDK_APPLICATION_MK=Application.mk" APP_ABI=armeabi-v7a NDK_LIBS_OUT=$(pwd)

#echo "building for arm64-v8a. libgstreamer_android.so will be placed in $(pwd)/arm64-v8a"
#ndk-build -C $(pwd)/gstreamer "NDK_APPLICATION_MK=Application.mk" APP_ABI=arm64-v8a NDK_LIBS_OUT=$(pwd)

#echo "building for x86_64. libgstreamer_android.so will be placed in $(pwd)/x86_64"
#ndk-build -C $(pwd)/gstreamer "NDK_APPLICATION_MK=Application.mk" APP_ABI=x86_64 NDK_LIBS_OUT=$(pwd)

#echo "building for x86. libgstreamer_android.so will be placed in $(pwd)/x86"
#ndk-build -C $(pwd)/gstreamer "NDK_APPLICATION_MK=Application.mk" APP_ABI=x86 NDK_LIBS_OUT=$(pwd)

生成共享对象 (libgstreamer_android.so) 后,应用程序开发人员需要将该共享对象放置在 Android 应用中,以便可以通过语音 SDK 加载该对象。Once the shared object (libgstreamer_android.so) is built application developer needs to place the shared object in the Android app, so that it can be loaded by speech SDK.

Handling compressed audio is implemented using GStreamer. For licensing reasons GStreamer binaries are not compiled and linked with the Speech SDK. Developers need to install several dependencies and plugins, see Installing on Windows. GStreamer binaries need to be in the system path, so that the speech SDK can load the binaries during runtime. If the Speech SDK is able to find libgstreamer-1.0-0.dll during runtime, it means the binaries are in the system path.

使用编解码器压缩的音频输入的示例代码Example code using codec compressed audio input

若要以压缩音频格式流式传输到语音服务,请创建 PullAudioInputStreamPushAudioInputStreamTo stream in a compressed audio format to the Speech service, create PullAudioInputStream or PushAudioInputStream. 然后,从流类的实例创建 AudioConfig,并指定流的压缩格式。Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream.

让我们假设你有一个名为 pushStream 的输入流类,并且使用 OPUS/OGG。Let's assume that you have an input stream class called pushStream and are using OPUS/OGG. 你的代码可能如下所示:Your code may look like this:

using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

// ... omitted for brevity

var speechConfig =
    SpeechConfig.FromSubscription(
        "YourSubscriptionKey",
        "YourServiceRegion");

// Create an audio config specifying the compressed
// audio format and the instance of your input stream class.
var audioFormat =
    AudioStreamFormat.GetCompressedFormat(
        AudioStreamContainerFormat.OGG_OPUS);
var audioConfig =
    AudioConfig.FromStreamInput(
        pushStream,
        audioFormat);

using var recognizer = new SpeechRecognizer(speechConfig, audioConfig);
var result = await recognizer.RecognizeOnceAsync();

var text = result.Text;

若要以压缩音频格式流式传输到语音服务,请创建 PullAudioInputStreamPushAudioInputStreamTo stream in a compressed audio format to the Speech service, create PullAudioInputStream or PushAudioInputStream. 然后,从流类的实例创建 AudioConfig,并指定流的压缩格式。Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream.

让我们假设你有一个名为 pushStream 的输入流类,并且使用 OPUS/OGG。Let's assume that you have an input stream class called pushStream and are using OPUS/OGG. 你的代码可能如下所示:Your code may look like this:

using namespace Microsoft::CognitiveServices::Speech;
using namespace Microsoft::CognitiveServices::Speech::Audio;

// ... omitted for brevity

 auto config =
    SpeechConfig::FromSubscription(
        "YourSubscriptionKey",
        "YourServiceRegion"
    );

auto audioFormat =
    AudioStreamFormat::GetCompressedFormat(
        AudioStreamContainerFormat::OGG_OPUS
    );
auto audioConfig =
    AudioConfig::FromStreamInput(
        pushStream,
        audioFormat
    );

auto recognizer = SpeechRecognizer::FromConfig(config, audioConfig);
auto result = recognizer->RecognizeOnceAsync().get();

auto text = result->Text;

若要以压缩音频格式流式传输到语音服务,请创建 PullAudioInputStreamPushAudioInputStreamTo stream in a compressed audio format to the Speech service, create a PullAudioInputStream or PushAudioInputStream. 然后,从流类的实例创建 AudioConfig,并指定流的压缩格式。Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream.

让我们假设你有一个名为 pullStream 的输入流类,并且使用 OPUS/OGG。Let's assume that you have an input stream class called pullStream and are using OPUS/OGG. 你的代码可能如下所示:Your code may look like this:

import com.microsoft.cognitiveservices.speech.audio.AudioConfig;
import com.microsoft.cognitiveservices.speech.audio.AudioInputStream;
import com.microsoft.cognitiveservices.speech.audio.AudioStreamFormat;
import com.microsoft.cognitiveservices.speech.audio.PullAudioInputStream;
import com.microsoft.cognitiveservices.speech.internal.AudioStreamContainerFormat;

// ... omitted for brevity

SpeechConfig speechConfig =
    SpeechConfig.fromSubscription(
        "YourSubscriptionKey",
        "YourServiceRegion");

// Create an audio config specifying the compressed
// audio format and the instance of your input stream class.
AudioStreamFormat audioFormat = 
    AudioStreamFormat.getCompressedFormat(
        AudioStreamContainerFormat.OGG_OPUS);
AudioConfig audioConfig =
    AudioConfig.fromStreamInput(
        pullStream,
        audioFormat);

SpeechRecognizer recognizer = new SpeechRecognizer(speechConfig, audioConfig);
SpeechRecognitionResult result = recognizer.recognizeOnceAsync().get()

String text = result.getText();

To stream in a compressed audio format to the Speech service, create PullAudioInputStream or PushAudioInputStream. Then, create an AudioConfig from an instance of your stream class, specifying the compression format of the stream.

Let's assume that your use case is to use PullStream for an MP3 file. Your code may look like this:


import azure.cognitiveservices.speech as speechsdk

class BinaryFileReaderCallback(speechsdk.audio.PullAudioInputStreamCallback):
    def __init__(self, filename: str):
        super().__init__()
        self._file_h = open(filename, "rb")

    def read(self, buffer: memoryview) -> int:
        print('trying to read {} frames'.format(buffer.nbytes))
        try:
            size = buffer.nbytes
            frames = self._file_h.read(size)

            buffer[:len(frames)] = frames
            print('read {} frames'.format(len(frames)))

            return len(frames)
        except Exception as ex:
            print('Exception in `read`: {}'.format(ex))
            raise

    def close(self) -> None:
        print('closing file')
        try:
            self._file_h.close()
        except Exception as ex:
            print('Exception in `close`: {}'.format(ex))
            raise

def compressed_stream_helper(compressed_format,
        mp3_file_path,
        default_speech_auth):
    callback = BinaryFileReaderCallback(mp3_file_path)
    stream = speechsdk.audio.PullAudioInputStream(stream_format=compressed_format, pull_stream_callback=callback)

    speech_config = speechsdk.SpeechConfig(**default_speech_auth)
    audio_config = speechsdk.audio.AudioConfig(stream=stream)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    done = False

    def stop_cb(evt):
        """callback that signals to stop continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        nonlocal done
        done = True

    # Connect callbacks to the events fired by the speech recognizer
    speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    # stop continuous recognition on either session stopped or canceled events
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    # Start continuous speech recognition
    speech_recognizer.start_continuous_recognition()
    while not done:
        time.sleep(.5)

    speech_recognizer.stop_continuous_recognition()
    # </SpeechContinuousRecognitionWithFile>

def pull_audio_input_stream_compressed_mp3(mp3_file_path: str,
        default_speech_auth):
    # Create a compressed format
    compressed_format = speechsdk.audio.AudioStreamFormat(compressed_stream_format=speechsdk.AudioStreamContainerFormat.MP3)
    compressed_stream_helper(compressed_format, mp3_file_path, default_speech_auth)

后续步骤Next steps