Deploy models with Triton Inference Server

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Deploy an ONNX model to an Azure Machine Learning managed online endpoint by using NVIDIA Triton Inference Server for optimized, no-code inference. Triton handles model serving for popular frameworks like TensorFlow, ONNX Runtime, PyTorch, and NVIDIA TensorRT, and you can use it for CPU or GPU workloads.

There are two approaches for deploying Triton models to online endpoints:

**No-code deployment—Bring only Triton models. No scoring script or custom environment required.
**Full-code deployment (bring your own container)—Full control over Triton Inference Server configuration.

For both options, Triton Inference Server performs inferencing based on the Triton model repository structure defined by NVIDIA. You can use ensemble models for more advanced scenarios. Azure Machine Learning supports Triton in both managed online endpoints and Kubernetes online endpoints.

This article walks through no-code deployment by using the Azure CLI, Python SDK v2, and Azure Machine Learning studio. For full-code deployment with a custom Triton container, see Use a custom container to deploy a model and the BYOC example for Triton (deployment definition and end-to-end script).

Note

Use of the NVIDIA Triton Inference Server container is governed by the NVIDIA AI Enterprise Software license agreement and you can use it for 90 days without an enterprise product subscription. For more information, see NVIDIA AI Enterprise on Azure Machine Learning.

Prerequisites

The Azure CLI and the ml extension to the Azure CLI, installed and configured. For more information, see Install and set up the CLI (v2).
A Bash shell or a compatible shell, for example, a shell on a Linux system or Windows Subsystem for Linux. The Azure CLI examples in this article assume that you use this type of shell.
An Azure Machine Learning workspace. For instructions to create a workspace, see Set up.

Your Azure account must have the Owner or Contributor role on the Azure Machine Learning workspace, or a custom role that allows Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more information, see Manage access to Azure Machine Learning workspaces.
A working Python 3.10 or later environment.
You must have other Python packages installed for scoring. They include:
- NumPy. An array and numerical computing library.
- Triton Inference Server Client. Facilitates requests to the Triton Inference Server.
- Pillow. A library for image operations.
- Gevent. A networking library used for connecting to the Triton server.
```
pip install numpy
pip install tritonclient[http]
pip install pillow
pip install gevent
```
Access to NCasT4_v3-series VMs for your Azure subscription.

Important

You might need to request a quota increase for your subscription before you can use this series of VMs. For more information, see NCasT4_v3-series.
NVIDIA Triton Inference Server requires a specific model repository structure, where there's a directory for each model and subdirectories for the model versions. The contents of each model version subdirectory is determined by the type of the model and the requirements of the backend that supports the model. For information about the structure for all models, see Model Files.

The example in this article uses a model stored in ONNX format. The model repository follows this structure:
```
models/
└── model_1/
    └── 1/
        └── model.onnx
```
For no-code deployment, Triton autogenerates the model configuration (config.pbtxt). If you need to customize the configuration, use full-code deployment with a custom container instead.

The information in this article is based on code samples contained in the azureml-examples repository. To run the commands locally without having to copy/paste YAML and other files, clone the repo and then change directories to the cli directory in the repo:

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples
cd cli

If you haven't already set the defaults for the Azure CLI, save your default settings. To avoid passing in the values for your subscription, workspace, and resource group multiple times, use the following commands. Replace the following parameters with values for your specific configuration:

Replace <subscription> with your Azure subscription ID.
Replace <workspace> with your Azure Machine Learning workspace name.
Replace <resource-group> with the Azure resource group that contains your workspace.
Replace <location> with the Azure region that contains your workspace.

Tip

You can see what your current defaults are by using the az configure -l command.

az account set --subscription <subscription>
az configure --defaults workspace=<workspace> group=<resource-group> location=<location>

APPLIES TO: Python SDK azure-ai-ml v2 (current)

An Azure Machine Learning workspace. For steps for creating a workspace, see Create the workspace.
The Azure Machine Learning SDK for Python v2. To install the SDK, use the following command:
```
pip install azure-ai-ml azure-identity
```
To update an existing installation of the SDK to the latest version, use the following command:
```
pip install --upgrade azure-ai-ml azure-identity
```
For more information, see Azure Machine Learning Package client library for Python.

Your Azure account must have the Owner or Contributor role on the Azure Machine Learning workspace, or a custom role that allows Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more information, see Manage access to Azure Machine Learning workspaces.
A working Python 3.10 or later environment.
You must have other Python packages installed for scoring. They include:
- NumPy. An array and numerical computing library.
- Triton Inference Server Client. Facilitates requests to the Triton Inference Server.
- Pillow. A library for image operations.
- Gevent. A networking library used for connecting to the Triton server.
```
pip install numpy
pip install tritonclient[http]
pip install pillow
pip install gevent
```
Access to NCasT4_v3-series VMs for your Azure subscription.

Important

You might need to request a quota increase for your subscription before you can use this series of VMs. For more information, see NCasT4_v3-series.

The information in this article is based on the online-endpoints-triton.ipynb notebook contained in the azureml-examples repository. To run the commands locally without having to copy and paste files, clone the repo, and then change directories to the sdk/python/endpoints/online/triton/single-model/ directory in the repo:

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/sdk/python/endpoints/online/triton/single-model/

Define the deployment configuration

Configure the endpoint and deployment resources that define how Triton serves your model. The endpoint specifies the name and authentication mode, while the deployment defines the model, VM type, and instance count.

Tip

This example uses key-based authentication for simplicity. For production deployments, Microsoft recommends Microsoft Entra token-based authentication (aad_token), which provides enhanced security through identity-based access control. For more information, see Authenticate clients for online endpoints.

APPLIES TO: Azure CLI ml extension v2 (current)

Important

For Triton no-code-deployment, testing via local endpoints isn't currently supported.

Set a BASE_PATH environment variable. This variable points to the directory where the model and associated YAML configuration files are located:
```
BASE_PATH=endpoints/online/triton/single-model
```
Set the name of the endpoint. In this example, a random name is created for the endpoint:
```
export ENDPOINT_NAME=triton-single-endpt-`echo $RANDOM`
```
Create a YAML configuration file for your endpoint. The following example configures the name and authentication mode of the endpoint. The file is located at /cli/endpoints/online/triton/single-model/create-managed-endpoint.yaml in the azureml-examples repo you cloned earlier:

create-managed-endpoint.yaml

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: my-endpoint
auth_mode: aml_token

Create a YAML configuration file for the deployment. The following example configures a deployment named blue to the endpoint. The file is located at /cli/endpoints/online/triton/single-model/create-managed-deployment.yaml in the azureml-examples repo:

Important

For Triton no-code-deployment to work, set type to triton_model: type: triton_model. For more information, see CLI (v2) model YAML schema.

This deployment uses a Standard_NC4as_T4_v3 VM. You might need to request a quota increase for your subscription before you can use this VM. For more information, see NCasT4_v3-series.

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: my-endpoint
model:
  name: sample-densenet-onnx-model
  version: 1
  path: ./models
  type: triton_model
instance_count: 1
instance_type: Standard_NC6s_v3

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Important

For Triton no-code-deployment, testing via local endpoints isn't currently supported.

To connect to a workspace, you need identifier parameters: a subscription, resource group, and workspace name.

subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace_name = "<WORKSPACE_NAME>"

Set the endpoint name. This example creates a random name:

import random

endpoint_name = f"endpoint-{random.randint(0, 10000)}"

Use the identifier parameters you configured earlier in the azure.ai.ml MLClient to get a handle to the required Azure Machine Learning workspace. See the configuration notebook for more details on how to configure credentials and connect to a workspace.
```
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id,
    resource_group,
    workspace_name,
)
```

Create a ManagedOnlineEndpoint object to configure the endpoint name and authentication mode.

from azure.ai.ml.entities import ManagedOnlineEndpoint

endpoint = ManagedOnlineEndpoint(name=endpoint_name, auth_mode="key")

Create a ManagedOnlineDeployment object to configure a deployment named blue with a local model defined inline.

from azure.ai.ml.entities import ManagedOnlineDeployment, Model

model_name = "densenet-onnx-model"
model_version = "1"

deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=endpoint_name,
    model=Model(
        name=model_name, 
        version=model_version,
        path="./models",
        type="triton_model"
    ),
    instance_type="Standard_NC4as_T4_v3",
    instance_count=1,
)

Use Azure Machine Learning studio to define the Triton deployment on a managed online endpoint.

Register your model in Triton format by using the following YAML and CLI command. The YAML uses a densenet-onnx model from azureml-examples/cli/endpoints/online/triton/single-model.

create-triton-model.yaml
```
name: densenet-onnx-model
version: 1
path: ./models
type: triton_model
description: Registering my Triton format model.
```
```
az ml model create -f create-triton-model.yaml
```
The following screenshot shows how your registered model looks on the Models page of Azure Machine Learning studio.
In studio, select your workspace and then use the Endpoints or Models page to create the endpoint deployment:
- Endpoints page
- Models page
1. On the Endpoints page, select Create.
2. Select the Triton model that you registered earlier, and then select Select.
3. When you select a model registered in Triton format, you don't need a scoring script or environment.
1. Select the Triton model, and then select Use this model > Real-time endpoint.

Deploy to Azure

Create the endpoint and deployment resources in Azure by using the configuration from the previous section.

APPLIES TO: Azure CLI ml extension v2 (current)

Create the endpoint by using the YAML configuration:

az ml online-endpoint create -n $ENDPOINT_NAME -f $BASE_PATH/create-managed-endpoint.yaml

Create the deployment by using the YAML configuration:

az ml online-deployment create --name blue --endpoint $ENDPOINT_NAME -f $BASE_PATH/create-managed-deployment.yaml --all-traffic

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Create the endpoint by using the ManagedOnlineEndpoint object:

endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint)

Create the deployment by using the ManagedOnlineDeployment object:

ml_client.online_deployments.begin_create_or_update(deployment)

After the deployment finishes, its traffic value is set to 0%. Update the traffic value to 100%:

endpoint.traffic = {"blue": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)

Test the endpoint

After deployment finishes, send a scoring request to verify the endpoint returns predictions. Triton uses the Triton Client protocol instead of standard REST JSON, so you score with the tritonclient library on the client side.

APPLIES TO: Azure CLI ml extension v2 (current)

Tip

The file /cli/endpoints/online/triton/single-model/triton_densenet_scoring.py in the azureml-examples repo is used for scoring. The image you pass to the endpoint needs preprocessing to meet the size, type, and format requirements, and post-processing to show the predicted label. The triton_densenet_scoring.py file uses the tritonclient.http library to communicate with the Triton inference server. This file runs on the client side.

Get the endpoint scoring URI:

scoring_uri=$(az ml online-endpoint show -n $ENDPOINT_NAME --query scoring_uri -o tsv)
scoring_uri=${scoring_uri%/*}

Get an authentication token:

auth_token=$(az ml online-endpoint get-credentials -n $ENDPOINT_NAME --query accessToken -o tsv)

Score data with the endpoint. This command submits the image of a peacock to the endpoint:

python $BASE_PATH/triton_densenet_scoring.py --base_url=$scoring_uri --token=$auth_token --image_path $BASE_PATH/data/peacock.jpg

The response from the script is similar to the following response:

Is server ready - True
Is model ready - True
/azureml-examples/cli/endpoints/online/triton/single-model/densenet_labels.txt
84 : PEACOCK

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Get the endpoint scoring URI:

endpoint = ml_client.online_endpoints.get(endpoint_name)
scoring_uri = endpoint.scoring_uri

Get an authentication key:

keys = ml_client.online_endpoints.get_keys(endpoint_name)
auth_key = keys.primary_key

Score the endpoint by using the Triton Inference Server Client. The following code submits the image of a peacock to the endpoint. It uses a prepost helper module (prepost.py) that handles image preprocessing and label postprocessing. This file is included in the cloned repo, along with densenet_labels.txt.

Important

Run this code from the sdk/python/endpoints/online/triton/single-model/ directory in the cloned azureml-examples repo. The prepost module and densenet_labels.txt file must be in the working directory. If you didn't clone the repo, download prepost.py and densenet_labels.txt to your working directory.

# Test the blue deployment with some sample data
import requests
import gevent.ssl
import numpy as np
import tritonclient.http as tritonhttpclient
from pathlib import Path
import prepost  # Local helper from the cloned repo (prepost.py)

img_uri = "http://aka.ms/peacock-pic"

# Remove the scheme from the URL
url = scoring_uri[8:]

# Initialize the client handler
triton_client = tritonhttpclient.InferenceServerClient(
    url=url,
    ssl=True,
    ssl_context_factory=gevent.ssl._create_default_https_context,
)

# Create headers
headers = {}
headers["Authorization"] = f"Bearer {auth_key}"

# Check the status of the Triton server
health_ctx = triton_client.is_server_ready(headers=headers)
print("Is server ready - {}".format(health_ctx))

# Check the status of the model
model_name = "model_1"
status_ctx = triton_client.is_model_ready(model_name, "1", headers)
print("Is model ready - {}".format(status_ctx))

if Path(img_uri).exists():
    img_content = open(img_uri, "rb").read()
else:
    agent = f"Python Requests/{requests.__version__} (https://github.com/Azure/azureml-examples)"
    img_content = requests.get(img_uri, headers={"User-Agent": agent}).content

img_data = prepost.preprocess(img_content)

# Populate inputs and outputs
input = tritonhttpclient.InferInput("data_0", img_data.shape, "FP32")
input.set_data_from_numpy(img_data)
inputs = [input]
output = tritonhttpclient.InferRequestedOutput("fc6_1")
outputs = [output]

result = triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
max_label = np.argmax(result.as_numpy("fc6_1"))
label_name = prepost.postprocess(max_label)
print(label_name)

The response from the script is similar to the following response:

Is server ready - True
Is model ready - True
/azureml-examples/sdk/endpoints/online/triton/single-model/densenet_labels.txt
84 : PEACOCK

Delete the endpoint and model

APPLIES TO: Azure CLI ml extension v2 (current)

When you're done with the endpoint, delete it:

az ml online-endpoint delete -n $ENDPOINT_NAME --yes

Archive your model:

az ml model archive --name sample-densenet-onnx-model --version 1

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Delete the endpoint. Deleting the endpoint also deletes child deployments, but it doesn't archive associated environments or models.
```
ml_client.online_endpoints.begin_delete(name=endpoint_name)
```

Archive the model:

ml_client.models.archive(name=model_name, version=model_version)

Last updated on 2026-04-10

Deploy models with Triton Inference Server

Prerequisites

Define the deployment configuration

Deploy to Azure

Test the endpoint

Delete the endpoint and model

Related content

Additional resources