利用 Triton 推理服务器实现的高性能服务(预览)High-performance serving with Triton Inference Server (Preview)

了解如何使用 NVIDIA Triton 推理服务器提高用于模型推理的 Web 服务的性能。Learn how to use NVIDIA Triton Inference Server to improve the performance of the web service used for model inference.

用于部署推理模型的一种方法是使用 Web 服务。One of the ways to deploy a model for inference is as a web service. 例如,Azure Kubernetes 服务或 Azure 容器实例的部署。For example, a deployment to Azure Kubernetes Service or Azure Container Instances. 默认情况下,Azure 机器学习使用单线程、常规用途的 Web 框架来实现 Web 服务部署。By default, Azure Machine Learning uses a single-threaded, general purpose web framework for web service deployments.

Triton 是针对推理进行了优化的框架。Triton is a framework that is optimized for inference. 它提供更好的 GPU 利用率和更具成本效益的推理。It provides better utilization of GPUs and more cost-effective inference. 在服务器端,它对传入的请求进行批处理,并提交这些批处理以进行推理。On the server-side, it batches incoming requests and submits these batches for inference. 通过批处理能够更好地利用 GPU 资源,这是 Triton 性能的关键部分。Batching better utilizes GPU resources, and is a key part of Triton's performance.

重要

使用 Triton 从 Azure 机器学习进行部署这一功能目前处于预览状态。Using Triton for deployment from Azure Machine Learning is currently in preview. 客户支持可能不会涵盖预览版功能。Preview functionality may not be covered by customer support. 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款For more information, see the Supplemental terms of use for Microsoft Azure previews

提示

本文档中的代码片段用于说明目的,可能不会显示完整的解决方案。The code snippets in this document are for illustrative purposes and may not show a complete solution. 有关工作示例代码,请参阅 Azure 机器学习中 Triton 的端到端示例For working example code, see the end-to-end samples of Triton in Azure Machine Learning.

先决条件Prerequisites

体系结构概述Architectural overview

在尝试将 Triton 用于自己的模型之前,请务必了解它如何处理 Azure 机器学习,并将其与默认部署进行比较。Before attempting to use Triton for your own model, it's important to understand how it works with Azure Machine Learning and how it compares to a default deployment.

不使用 Triton 的默认部署Default deployment without Triton

  • 已启用多个 Gunicorn 辅助角色来并发处理传入请求。Multiple Gunicorn workers are started to concurrently handle incoming requests.
  • 这些辅助角色执行预处理、调用模型并执行后期处理。These workers handle pre-processing, calling the model, and post-processing.
  • 推理请求使用评分 URI。Inference requests use the scoring URI. 例如,https://myserevice.azureml.net/scoreFor example, https://myserevice.azureml.net/score.

正常的非 triton 部署体系结构图

设置辅助角色数Setting the number of workers

若要在部署中设置辅助角色数,请设置环境变量 WORKER_COUNTTo set the number of workers in your deployment, set the environment variable WORKER_COUNT. 假设有一个名为 envEnvironment 对象,可以执行以下操作:Given you have an Environment object called env, you can do the following:

env.environment_variables["WORKER_COUNT"] = "1"

这会告知 Azure ML 启动指定数目的辅助角色。This will tell Azure ML to spin up the number of workers you specify.

使用 Triton 的推理配置部署Inference configuration deployment with Triton

  • 已启用多个 Gunicorn 辅助角色来并发处理传入请求。Multiple Gunicorn workers are started to concurrently handle incoming requests.
  • 请求会转发到“Triton 服务器”。The requests are forwarded to the Triton server.
  • Triton 对请求进行批处理,以最大程度利用 GPU。Triton processes requests in batches to maximize GPU utilization.
  • 客户端使用“评分 URI”发出请求。The client uses the scoring URI to make requests. 例如,https://myserevice.azureml.net/scoreFor example, https://myserevice.azureml.net/score.

使用 Triton 的 Inferenceconfig 部署

将 Triton 用于模型部署的工作流为:The workflow to use Triton for your model deployment is:

  1. 验证 Triton 可以为模型提供服务。Verify that Triton can serve your model.
  2. 验证是否可以将请求发送到部署了 Triton 的模型。Verify you can send requests to your Triton-deployed model.
  3. 将特定于 Triton 的代码合并到 AML 部署中。Incorporate your Triton-specific code into your AML deployment.

验证 Triton 是否可以为模型提供服务Verify that Triton can serve your model

首先,请按照以下步骤验证 Triton 推理服务器是否可以为模型提供服务。First, follow the steps below to verify that the Triton Inference Server can serve your model.

(可选)定义模型配置文件(Optional) Define a model config file

模型配置文件告知 Triton 预期输入是多少以及这些输入将是什么维度。The model configuration file tells Triton how many inputs to expects and of what dimensions those inputs will be. 有关创建配置文件的详细信息,请参阅 NVIDIA 文档中的模型配置For more information on creating the configuration file, see Model configuration in the NVIDIA documentation.

提示

启动 Triton 推理服务器时,我们使用 --strict-model-config=false 选项,这意味着你无需为 ONNX 或 TensorFlow 模型提供 config.pbtxt 文件。We use the --strict-model-config=false option when starting the Triton Inference Server, which means you do not need to provide a config.pbtxt file for ONNX or TensorFlow models.

有关此选项的详细信息,请参阅 NVIDIA 文档中的生成的模型配置For more information on this option, see Generated model configuration in the NVIDIA documentation.

使用正确的目录结构Use the correct directory structure

向 Azure 机器学习注册模型时,可以注册单独的文件或目录结构。When registering a model with Azure Machine Learning, you can register either individual files or a directory structure. 若要使用 Triton,模型注册必须针对包含名为 triton 的目录的目录结构。To use Triton, the model registration must be for a directory structure that contains a directory named triton. 此目录的一般结构为:The general structure of this directory is:

models
    - triton
        - model_1
            - model_version
                - model_file
                - config_file
        - model_2
            ...

重要

此目录结构是一个 Triton 模型存储库,你的模型使用 Triton 时需要这个结构。This directory structure is a Triton Model Repository and is required for your model(s) to work with Triton. 有关详细信息,请参阅 NVIDIA 文档中的 Triton 模型存储库For more information, see Triton Model Repositories in the NVIDIA documentation.

用 Triton 和 Docker 进行测试Test with Triton and Docker

若要测试模型以确保它能够与 Triton 协同运行,可以使用 Docker。To test your model to make sure it runs with Triton, you can use Docker. 以下命令将 Triton 容器拉取到本地计算机,然后启动 Triton 服务器:The following commands pull the Triton container to your local computer, and then start the Triton Server:

  1. 若要将 Triton 服务器的映像拉取到本地计算机,请使用以下命令:To pull the image for the Triton server to your local computer, use the following command:

    docker pull nvcr.io/nvidia/tritonserver:20.09-py3
    
  2. 使用以下命令启动 Triton 服务器。To start the Triton server, use the following command. <path-to-models/triton> 替换为包含模型的 Triton 模型存储库的路径:Replace <path-to-models/triton> with the path to the Triton model repository that contains your models:

    docker run --rm -ti -v<path-to-models/triton>:/models nvcr.io/nvidia/tritonserver:20.09-py3 tritonserver --model-repository=/models --strict-model-config=false
    

    重要

    如果使用的是 Windows,则在首次运行命令时,系统可能会提示你允许网络连接到此进程。If you are using Windows, you may be prompted to allow network connections to this process the first time you run the command. 如果是这样,请选择启用访问权限。If so, select to enable access.

    启动后,将在命令行中记录类似于以下文本的信息:Once started, information similar to the following text is logged to the command line:

    I0923 19:21:30.582866 1 http_server.cc:2705] Started HTTPService at 0.0.0.0:8000
    I0923 19:21:30.626081 1 http_server.cc:2724] Started Metrics Service at 0.0.0.0:8002
    

    第一行指示 Web 服务的地址。The first line indicates the address of the web service. 在此示例中为 0.0.0.0:8000,与 localhost:8000 相同。In this case, 0.0.0.0:8000, which is the same as localhost:8000.

  3. 使用诸如 curl 之类的实用工具来访问运行状况终结点。Use a utility such as curl to access the health endpoint.

    curl -L -v -i localhost:8000/v2/health/ready
    

    此命令返回如下信息。This command returns information similar to the following. 请注意 200 OK;此状态意味着 Web 服务器正在运行。Note the 200 OK; this status means the web server is running.

    *   Trying 127.0.0.1:8000...
    * Connected to localhost (127.0.0.1) port 8000 (#0)
    > GET /v2/health/ready HTTP/1.1
    > Host: localhost:8000
    > User-Agent: curl/7.71.1
    > Accept: */*
    >
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    HTTP/1.1 200 OK
    

除了基本运行状况检查以外,你还可以创建一个客户端,将数据发送到 Triton 以进行推理。Beyond a basic health check, you can create a client to send data to Triton for inference. 有关创建客户端的详细信息,请参阅 NVIDIA 文档中的客户端示例For more information on creating a client, see the client examples in the NVIDIA documentation. 还可以参阅 Triton GitHub 中的 Python 示例There are also Python samples at the Triton GitHub.

有关使用 Docker 运行 Triton 的详细信息,请参阅在有 GPU 的系统上运行 Triton在没有 GPU 的系统上运行 TritonFor more information on running Triton using Docker, see Running Triton on a system with a GPU and Running Triton on a system without a GPU.

注册模型Register your model

现在,你已验证你的模型可使用 Triton,请将其注册到 Azure 机器学习。Now that you've verified that your model works with Triton, register it with Azure Machine Learning. 模型注册过程会将模型文件存储在 Azure 机器学习工作区中,并在使用 Python SDK 和 Azure CLI 进行部署时使用。Model registration stores your model files in the Azure Machine Learning workspace, and are used when deploying with the Python SDK and Azure CLI.

以下示例演示如何注册模型:The following examples demonstrate how to register the model(s):

from azureml.core.model import Model

model = Model.register(
    model_path=os.path.join("..", "triton"),
    model_name="bidaf_onnx",
    tags={'area': "Natural language processing", 'type': "Question answering"},
    description="Question answering model from ONNX model zoo",
    workspace=ws

验证是否可以调用模型Verify you can call into your model

验证 Triton 能够为模型提供服务之后,可以通过定义入口脚本来添加预处理和后处理代码。After verifying that the Triton is able to serve your model, you can add pre and post-processing code by defining an entry script. 此文件的名称为 score.pyThis file is named score.py. 有关入口脚本的详细信息,请参阅定义入口脚本For more information on entry scripts, see Define an entry script.

这两个主要步骤旨在初始化 init() 方法中的 Triton HTTP 客户端,并在 run() 函数中调用该客户端。The two main steps are to initialize a Triton HTTP client in your init() method, and to call into that client in your run() function.

初始化 Triton 客户端Initialize the Triton Client

请在 score.py 文件中包含如下所示的代码。Include code like the following example in your score.py file. Azure 机器学习中的 Triton 预期在端口 8000 的 localhost 上寻址。Triton in Azure Machine Learning expects to be addressed on localhost, port 8000. 在本例中,localhost 位于此部署的 Docker 映像中,而不在本地计算机的端口上:In this case, localhost is inside the Docker image for this deployment, not a port on your local machine:

提示

特选 AzureML-Triton 环境中包含了 tritonhttpclient pip 包,因此无需将该包指定为 pip 依赖项。The tritonhttpclient pip package is included in the curated AzureML-Triton environment, so there's no need to specify it as a pip dependency.

import tritonhttpclient

def init():
    global triton_client
    triton_client = tritonhttpclient.InferenceServerClient(url="localhost:8000")

修改评分脚本以调用 TritonModify your scoring script to call into Triton

下面的示例演示如何动态请求模型的元数据:The following example demonstrates how to dynamically request the metadata for the model:

提示

可以使用 Triton 客户端的 .get_model_metadata 方法动态请求已通过 Triton 加载的模型的元数据。You can dynamically request the metadata of models that have been loaded with Triton by using the .get_model_metadata method of the Triton client. 参阅示例笔记本,获取其用法的示例。See the sample notebook for an example of its use.

input = tritonhttpclient.InferInput(input_name, data.shape, datatype)
input.set_data_from_numpy(data, binary_data=binary_data)

output = tritonhttpclient.InferRequestedOutput(
         output_name, binary_data=binary_data, class_count=class_count)

# Run inference
res = triton_client.infer(model_name,
                          [input]
                          request_id='0',
                          outputs=[output])

使用推理配置重新部署Redeploy with an inference configuration

通过推理配置,你可以使用入口脚本,以及通过 Python SDK 或 Azure CLI 完成的 Azure 机器学习部署过程。An inference configuration allows you use an entry script, as well as the Azure Machine Learning deployment process using the Python SDK or Azure CLI.

重要

必须指定 AzureML-Triton 特选环境You must specify the AzureML-Triton curated environment.

Python 代码示例将 AzureML-Triton 克隆到另一个名为 My-Triton 的环境中。The Python code example clones AzureML-Triton into another environment called My-Triton. Azure CLI 代码也使用此环境。The Azure CLI code also uses this environment. 有关克隆环境的详细信息,请参阅 Environment.Clone() 引用。For more information on cloning an environment, see the Environment.Clone() reference.

from azureml.core.webservice import LocalWebservice
from azureml.core import Environment
from azureml.core.model import InferenceConfig


local_service_name = "triton-bidaf-onnx"
env = Environment.get(ws, "AzureML-Triton").clone("My-Triton")

for pip_package in ["nltk"]:
    env.python.conda_dependencies.add_pip_package(pip_package)

inference_config = InferenceConfig(
    entry_script="score_bidaf.py",  # This entry script is where we dispatch a call to the Triton server
    source_directory=os.path.join("..", "scripts"),
    environment=env
)

local_config = LocalWebservice.deploy_configuration(
    port=6789
)

local_service = Model.deploy(
    workspace=ws,
    name=local_service_name,
    models=[model],
    inference_config=inference_config,
    deployment_config=local_config,
    overwrite=True)

local_service.wait_for_deployment(show_output = True)
print(local_service.state)
# Print the URI you can use to call the local deployment
print(local_service.scoring_uri)

部署完成后将显示评分 URI。After deployment completes, the scoring URI is displayed. 对于此本地部署,则为 http://localhost:6789/scoreFor this local deployment, it will be http://localhost:6789/score. 如果部署到云,则可以使用 az ml service show CLI 命令来获取评分 URI。If you deploy to the cloud, you can use the az ml service show CLI command to get the scoring URI.

有关如何创建将推理请求发送到评分 URI 的客户端的信息,请参阅使用部署为 Web 服务的模型For information on how to create a client that sends inference requests to the scoring URI, see consume a model deployed as a web service.

清理资源Clean up resources

如果你打算继续使用 Azure 机器学习工作区,但要删除已部署的服务,请使用以下选项之一:If you plan on continuing to use the Azure Machine Learning workspace, but want to get rid of the deployed service, use one of the following options:

local_service.delete()

后续步骤Next steps