High-performance serving with Triton Inference Server

02/19/2025

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Learn how to use NVIDIA Triton Inference Server in Azure Machine Learning with online endpoints.

Triton is multi-framework, open-source software that is optimized for inference. It supports popular machine learning frameworks like TensorFlow, ONNX Runtime, PyTorch, NVIDIA TensorRT, and more. It can be used for your CPU or GPU workloads.

There are mainly two approaches you can take to leverage Triton models when deploying them to online endpoint: No-code deployment or full-code (Bring your own container) deployment.

No-code deployment for Triton models is a simple way to deploy them as you only need to bring Triton models to deploy.
Full-code deployment (Bring your own container) for Triton models is more advanced way to deploy them as you have full control on customizing the configurations available for Triton inference server.

For both options, Triton inference server will perform inferencing based on the Triton model as defined by NVIDIA. For instance, ensemble models can be used for more advanced scenarios.

Triton is supported in both managed online endpoints and Kubernetes online endpoints.

In this article, you will learn how to deploy a model using no-code deployment for Triton to a managed online endpoint. Information is provided on using the CLI (command line), Python SDK v2, and Azure Machine Learning studio. If you want to customize further directly using Triton inference server's configuration, refer to Use a custom container to deploy a model and the BYOC example for Triton (deployment definition and end-to-end script).

Note

Use of the NVIDIA Triton Inference Server container is governed by the NVIDIA AI Enterprise Software license agreement and can be used for 90 days without an enterprise product subscription. For more information, see NVIDIA AI Enterprise on Azure Machine Learning.

Prerequisites

The Azure CLI and the ml extension to the Azure CLI, installed and configured. For more information, see Install and set up the CLI (v2).
A Bash shell or a compatible shell, for example, a shell on a Linux system or Windows Subsystem for Linux. The Azure CLI examples in this article assume that you use this type of shell.
An Azure Machine Learning workspace. For instructions to create a workspace, see Set up.

A working Python 3.8 (or higher) environment.
You must have additional Python packages installed for scoring and may install them with the code below. They include:
- NumPy - An array and numerical computing library
- Triton Inference Server Client - Facilitates requests to the Triton Inference Server
- Pillow - A library for image operations
- Gevent - A networking library used when connecting to the Triton Server

pip install numpy
pip install tritonclient[http]
pip install pillow
pip install gevent

Access to NCv3-series VMs for your Azure subscription.

Important

You might need to request a quota increase for your subscription before you can use this series of VMs. For more information, see NCv3-series.

NVIDIA Triton Inference Server requires a specific model repository structure, where there is a directory for each model and subdirectories for the model version. The contents of each model version subdirectory is determined by the type of the model and the requirements of the backend that supports the model. To see all the model repository structure https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md#model-files

The information in this document is based on using a model stored in ONNX format, so the directory structure of the model repository is <model-repository>/<model-name>/1/model.onnx. Specifically, this model performs image identification.

The information in this article is based on code samples contained in the azureml-examples repository. To run the commands locally without having to copy/paste YAML and other files, clone the repo and then change directories to the cli directory in the repo:

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples
cd cli

If you haven't already set the defaults for the Azure CLI, save your default settings. To avoid passing in the values for your subscription, workspace, and resource group multiple times, use the following commands. Replace the following parameters with values for your specific configuration:

Replace <subscription> with your Azure subscription ID.
Replace <workspace> with your Azure Machine Learning workspace name.
Replace <resource-group> with the Azure resource group that contains your workspace.
Replace <location> with the Azure region that contains your workspace.

Tip

You can see what your current defaults are by using the az configure -l command.

az account set --subscription <subscription>
az configure --defaults workspace=<workspace> group=<resource-group> location=<location>

APPLIES TO: Python SDK azure-ai-ml v2 (current)

An Azure Machine Learning workspace. For steps for creating a workspace, see Create the workspace.
The Azure Machine Learning SDK for Python v2. To install the SDK, use the following command:
```
pip install azure-ai-ml azure-identity
```
To update an existing installation of the SDK to the latest version, use the following command:
```
pip install --upgrade azure-ai-ml azure-identity
```
For more information, see Azure Machine Learning Package client library for Python.

A working Python 3.8 (or higher) environment.
You must have additional Python packages installed for scoring and may install them with the code below. They include:
- NumPy - An array and numerical computing library
- Triton Inference Server Client - Facilitates requests to the Triton Inference Server
- Pillow - A library for image operations
- Gevent - A networking library used when connecting to the Triton Server
```
pip install numpy
pip install tritonclient[http]
pip install pillow
pip install gevent
```
Access to NCv3-series VMs for your Azure subscription.

Important

You might need to request a quota increase for your subscription before you can use this series of VMs. For more information, see NCv3-series.

The information in this article is based on the online-endpoints-triton.ipynb notebook contained in the azureml-examples repository. To run the commands locally without having to copy/paste files, clone the repo, and then change directories to the sdk/endpoints/online/triton/single-model/ directory in the repo:

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/sdk/python/endpoints/online/triton/single-model/

Define the deployment configuration

APPLIES TO: Azure CLI ml extension v2 (current)

This section shows how you can deploy to a managed online endpoint using the Azure CLI with the Machine Learning extension (v2).

Important

For Triton no-code-deployment, testing via local endpoints is currently not supported.

To avoid typing in a path for multiple commands, use the following command to set a BASE_PATH environment variable. This variable points to the directory where the model and associated YAML configuration files are located:
```
BASE_PATH=endpoints/online/triton/single-model
```
Use the following command to set the name of the endpoint that will be created. In this example, a random name is created for the endpoint:
```
export ENDPOINT_NAME=triton-single-endpt-`echo $RANDOM`
```
Create a YAML configuration file for your endpoint. The following example configures the name and authentication mode of the endpoint. The one used in the following commands is located at /cli/endpoints/online/triton/single-model/create-managed-endpoint.yml in the azureml-examples repo you cloned earlier:

create-managed-endpoint.yaml

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: my-endpoint
auth_mode: aml_token

Create a YAML configuration file for the deployment. The following example configures a deployment named blue to the endpoint defined in the previous step. The one used in the following commands is located at /cli/endpoints/online/triton/single-model/create-managed-deployment.yml in the azureml-examples repo you cloned earlier:

Important

For Triton no-code-deployment (NCD) to work, setting type to triton_model is required, type: triton_model. For more information, see CLI (v2) model YAML schema.

This deployment uses a Standard_NC6s_v3 VM. You might need to request a quota increase for your subscription before you can use this VM. For more information, see NCv3-series.

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: my-endpoint
model:
  name: sample-densenet-onnx-model
  version: 1
  path: ./models
  type: triton_model
instance_count: 1
instance_type: Standard_NC6s_v3

APPLIES TO: Python SDK azure-ai-ml v2 (current)

This section shows how you can define a Triton deployment to deploy to a managed online endpoint using the Azure Machine Learning Python SDK (v2).

Important

For Triton no-code-deployment, testing via local endpoints is currently not supported.

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name.

subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace_name = "<AML_WORKSPACE_NAME>"

Use the following command to set the name of the endpoint that will be created. In this example, a random name is created for the endpoint:
```
import random

endpoint_name = f"endpoint-{random.randint(0, 10000)}"
```
We use these details above in the MLClient from azure.ai.ml to get a handle to the required Azure Machine Learning workspace. Check the configuration notebook for more details on how to configure credentials and connect to a workspace.
```
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id,
    resource_group,
    workspace_name,
)
```
Create a ManagedOnlineEndpoint object to configure the endpoint. The following example configures the name and authentication mode of the endpoint.
```
from azure.ai.ml.entities import ManagedOnlineEndpoint

endpoint = ManagedOnlineEndpoint(name=endpoint_name, auth_mode="key")
```

Create a ManagedOnlineDeployment object to configure the deployment. The following example configures a deployment named blue to the endpoint defined in the previous step and defines a local model inline.

from azure.ai.ml.entities import ManagedOnlineDeployment, Model

model_name = "densenet-onnx-model"
model_version = 1

deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=endpoint_name,
    model=Model(
        name=model_name, 
        version=model_version,
        path="./models",
        type="triton_model"
    ),
    instance_type="Standard_NC6s_v3",
    instance_count=1,
)

This section shows how you can define a Triton deployment on a managed online endpoint using Azure Machine Learning studio.

Register your model in Triton format using the following YAML and CLI command. The YAML uses a densenet-onnx model from https://github.com/Azure/azureml-examples/tree/main/cli/endpoints/online/triton/single-model

create-triton-model.yaml
```
name: densenet-onnx-model
version: 1
path: ./models
type: triton_model
description: Registering my Triton format model.
```
```
az ml model create -f create-triton-model.yaml
```
The following screenshot shows how your registered model will look on the Models page of Azure Machine Learning studio.
From studio, select your workspace and then use either the endpoints or models page to create the endpoint deployment:
- Endpoints page
- Models page
1. From the Endpoints page, select Create.
2. Provide a name and authentication type for the endpoint, and then select Next.
3. When selecting a model, select the Triton model registered previously. Select Next to continue.
4. When you select a model registered in Triton format, in the Environment step of the wizard, you don't need scoring script and environment.
1. Select the Triton model, and then select Deploy. When prompted, select Deploy to real-time endpoint.

Deploy to Azure

APPLIES TO: Azure CLI ml extension v2 (current)

To create a new endpoint using the YAML configuration, use the following command:

az ml online-endpoint create -n $ENDPOINT_NAME -f $BASE_PATH/create-managed-endpoint.yaml

To create the deployment using the YAML configuration, use the following command:

az ml online-deployment create --name blue --endpoint $ENDPOINT_NAME -f $BASE_PATH/create-managed-deployment.yaml --all-traffic

APPLIES TO: Python SDK azure-ai-ml v2 (current)

To create a new endpoint using the ManagedOnlineEndpoint object, use the following command:
```
endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint)
```
To create the deployment using the ManagedOnlineDeployment object, use the following command:
```
ml_client.online_deployments.begin_create_or_update(deployment)
```

Once the deployment completes, its traffic value will be set to 0%. Update the traffic to 100%.

endpoint.traffic = {"blue": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)

Test the endpoint

APPLIES TO: Azure CLI ml extension v2 (current)

Once your deployment completes, use the following command to make a scoring request to the deployed endpoint.

Tip

The file /cli/endpoints/online/triton/single-model/triton_densenet_scoring.py in the azureml-examples repo is used for scoring. The image passed to the endpoint needs pre-processing to meet the size, type, and format requirements, and post-processing to show the predicted label. The triton_densenet_scoring.py uses the tritonclient.http library to communicate with the Triton inference server.

To get the endpoint scoring uri, use the following command:

scoring_uri=$(az ml online-endpoint show -n $ENDPOINT_NAME --query scoring_uri -o tsv)
scoring_uri=${scoring_uri%/*}

To get an authentication key, use the following command:

auth_token=$(az ml online-endpoint get-credentials -n $ENDPOINT_NAME --query accessToken -o tsv)

To score data with the endpoint, use the following command. It submits the image of a peacock (https://aka.ms/peacock-pic) to the endpoint:

python $BASE_PATH/triton_densenet_scoring.py --base_url=$scoring_uri --token=$auth_token --image_path $BASE_PATH/data/peacock.jpg

The response from the script is similar to the following text:

Is server ready - True
Is model ready - True
/azureml-examples/cli/endpoints/online/triton/single-model/densenet_labels.txt
84 : PEACOCK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

To get the endpoint scoring uri, use the following command:

endpoint = ml_client.online_endpoints.get(endpoint_name)
scoring_uri = endpoint.scoring_uri

To get an authentication key, use the following command: keys = ml_client.online_endpoints.list_keys(endpoint_name) auth_key = keys.primary_key

The following scoring code uses the Triton Inference Server Client to submit the image of a peacock to the endpoint. This script is available in the companion notebook to this example - Deploy a model to online endpoints using Triton.

# Test the blue deployment with some sample data
import requests
import gevent.ssl
import numpy as np
import tritonclient.http as tritonhttpclient
from pathlib import Path
import prepost

img_uri = "http://aka.ms/peacock-pic"

# We remove the scheme from the url
url = scoring_uri[8:]

# Initialize client handler
triton_client = tritonhttpclient.InferenceServerClient(
    url=url,
    ssl=True,
    ssl_context_factory=gevent.ssl._create_default_https_context,
)

# Create headers
headers = {}
headers["Authorization"] = f"Bearer {auth_key}"

# Check status of triton server
health_ctx = triton_client.is_server_ready(headers=headers)
print("Is server ready - {}".format(health_ctx))

# Check status of model
model_name = "model_1"
status_ctx = triton_client.is_model_ready(model_name, "1", headers)
print("Is model ready - {}".format(status_ctx))

if Path(img_uri).exists():
    img_content = open(img_uri, "rb").read()
else:
    agent = f"Python Requests/{requests.__version__} (https://github.com/Azure/azureml-examples)"
    img_content = requests.get(img_uri, headers={"User-Agent": agent}).content

img_data = prepost.preprocess(img_content)

# Populate inputs and outputs
input = tritonhttpclient.InferInput("data_0", img_data.shape, "FP32")
input.set_data_from_numpy(img_data)
inputs = [input]
output = tritonhttpclient.InferRequestedOutput("fc6_1")
outputs = [output]

result = triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
max_label = np.argmax(result.as_numpy("fc6_1"))
label_name = prepost.postprocess(max_label)
print(label_name)

The response from the script is similar to the following text:

Is server ready - True
Is model ready - True
/azureml-examples/sdk/endpoints/online/triton/single-model/densenet_labels.txt
84 : PEACOCK

Delete the endpoint and model

APPLIES TO: Azure CLI ml extension v2 (current)

Once you're done with the endpoint, use the following command to delete it:
```
az ml online-endpoint delete -n $ENDPOINT_NAME --yes
```

Use the following command to archive your model:

az ml model archive --name $MODEL_NAME --version $MODEL_VERSION

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Delete the endpoint. Deleting the endpoint also deletes any child deployments, however it will not archive associated Environments or Models.
```
ml_client.online_endpoints.begin_delete(name=endpoint_name)
```

Archive the model with the following code.

ml_client.models.archive(name=model_name, version=model_version)

Next steps

To learn more, review these articles:

High-performance serving with Triton Inference Server

Prerequisites

Define the deployment configuration

Deploy to Azure

Test the endpoint

Delete the endpoint and model

Next steps

Additional resources