High-performance serving with Triton Inference Server

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Learn how to use NVIDIA Triton Inference Server in Azure Machine Learning with online endpoints.

Triton is multi-framework, open-source software that is optimized for inference. It supports popular machine learning frameworks like TensorFlow, ONNX Runtime, PyTorch, NVIDIA TensorRT, and more. It can be used for your CPU or GPU workloads.

There are mainly two approaches you can take to leverage Triton models when deploying them to online endpoint: No-code deployment or full-code (Bring your own container) deployment.

  • No-code deployment for Triton models is a simple way to deploy them as you only need to bring Triton models to deploy.
  • Full-code deployment (Bring your own container) for Triton models is more advanced way to deploy them as you have full control on customizing the configurations available for Triton inference server.

For both options, Triton inference server will perform inferencing based on the Triton model as defined by NVIDIA. For instance, ensemble models can be used for more advanced scenarios.

Triton is supported in both managed online endpoints and Kubernetes online endpoints.

In this article, you will learn how to deploy a model using no-code deployment for Triton to a managed online endpoint. Information is provided on using the CLI (command line), Python SDK v2, and Azure Machine Learning studio. If you want to customize further directly using Triton inference server's configuration, refer to Use a custom container to deploy a model and the BYOC example for Triton (deployment definition and end-to-end script).

Note

Use of the NVIDIA Triton Inference Server container is governed by the NVIDIA AI Enterprise Software license agreement and can be used for 90 days without an enterprise product subscription. For more information, see NVIDIA AI Enterprise on Azure Machine Learning.

Prerequisites

APPLIES TO: Python SDK azure-ai-ml v2 (current)

Before following the steps in this article, make sure you have the following prerequisites:

  • An Azure Machine Learning workspace. If you don't have one, use the steps in the Quickstart: Create workspace resources article to create one.

  • To install the Python SDK v2, use the following command:

    pip install azure-ai-ml azure-identity
    

    To update an existing installation of the SDK to the latest version, use the following command:

    pip install --upgrade azure-ai-ml azure-identity
    

    For more information, see Install the Python SDK v2 for Azure Machine Learning.

  • A working Python 3.8 (or higher) environment.

  • You must have additional Python packages installed for scoring and may install them with the code below. They include:

    • NumPy - An array and numerical computing library
    • Triton Inference Server Client - Facilitates requests to the Triton Inference Server
    • Pillow - A library for image operations
    • Gevent - A networking library used when connecting to the Triton Server
    pip install numpy
    pip install tritonclient[http]
    pip install pillow
    pip install gevent
    
  • Access to NCv3-series VMs for your Azure subscription.

    Important

    You might need to request a quota increase for your subscription before you can use this series of VMs. For more information, see NCv3-series.

The information in this article is based on the online-endpoints-triton.ipynb notebook contained in the azureml-examples repository. To run the commands locally without having to copy/paste files, clone the repo, and then change directories to the sdk/endpoints/online/triton/single-model/ directory in the repo:

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/sdk/python/endpoints/online/triton/single-model/

Define the deployment configuration

APPLIES TO: Python SDK azure-ai-ml v2 (current)

This section shows how you can define a Triton deployment to deploy to a managed online endpoint using the Azure Machine Learning Python SDK (v2).

Important

For Triton no-code-deployment, testing via local endpoints is currently not supported.

  1. To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name.

    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace_name = "<AML_WORKSPACE_NAME>"
    
  2. Use the following command to set the name of the endpoint that will be created. In this example, a random name is created for the endpoint:

    import random
    
    endpoint_name = f"endpoint-{random.randint(0, 10000)}"
    
  3. We use these details above in the MLClient from azure.ai.ml to get a handle to the required Azure Machine Learning workspace. Check the configuration notebook for more details on how to configure credentials and connect to a workspace.

    from azure.ai.ml import MLClient
    from azure.identity import DefaultAzureCredential
    
    ml_client = MLClient(
        DefaultAzureCredential(),
        subscription_id,
        resource_group,
        workspace_name,
    )
    
  4. Create a ManagedOnlineEndpoint object to configure the endpoint. The following example configures the name and authentication mode of the endpoint.

    from azure.ai.ml.entities import ManagedOnlineEndpoint
    
    endpoint = ManagedOnlineEndpoint(name=endpoint_name, auth_mode="key")
    
  5. Create a ManagedOnlineDeployment object to configure the deployment. The following example configures a deployment named blue to the endpoint defined in the previous step and defines a local model inline.

    from azure.ai.ml.entities import ManagedOnlineDeployment, Model
    
    model_name = "densenet-onnx-model"
    model_version = 1
    
    deployment = ManagedOnlineDeployment(
        name="blue",
        endpoint_name=endpoint_name,
        model=Model(
            name=model_name, 
            version=model_version,
            path="./models",
            type="triton_model"
        ),
        instance_type="Standard_NC6s_v3",
        instance_count=1,
    )
    

Deploy to Azure

APPLIES TO: Python SDK azure-ai-ml v2 (current)

  1. To create a new endpoint using the ManagedOnlineEndpoint object, use the following command:

    endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint)
    
  2. To create the deployment using the ManagedOnlineDeployment object, use the following command:

    ml_client.online_deployments.begin_create_or_update(deployment)
    
  3. Once the deployment completes, its traffic value will be set to 0%. Update the traffic to 100%.

    endpoint.traffic = {"blue": 100}
    ml_client.online_endpoints.begin_create_or_update(endpoint)
    

Test the endpoint

APPLIES TO: Python SDK azure-ai-ml v2 (current)

  1. To get the endpoint scoring uri, use the following command:

    endpoint = ml_client.online_endpoints.get(endpoint_name)
    scoring_uri = endpoint.scoring_uri
    
  2. To get an authentication key, use the following command: keys = ml_client.online_endpoints.list_keys(endpoint_name) auth_key = keys.primary_key

  3. The following scoring code uses the Triton Inference Server Client to submit the image of a peacock to the endpoint. This script is available in the companion notebook to this example - Deploy a model to online endpoints using Triton.

    # Test the blue deployment with some sample data
    import requests
    import gevent.ssl
    import numpy as np
    import tritonclient.http as tritonhttpclient
    from pathlib import Path
    import prepost
    
    img_uri = "http://aka.ms/peacock-pic"
    
    # We remove the scheme from the url
    url = scoring_uri[8:]
    
    # Initialize client handler
    triton_client = tritonhttpclient.InferenceServerClient(
        url=url,
        ssl=True,
        ssl_context_factory=gevent.ssl._create_default_https_context,
    )
    
    # Create headers
    headers = {}
    headers["Authorization"] = f"Bearer {auth_key}"
    
    # Check status of triton server
    health_ctx = triton_client.is_server_ready(headers=headers)
    print("Is server ready - {}".format(health_ctx))
    
    # Check status of model
    model_name = "model_1"
    status_ctx = triton_client.is_model_ready(model_name, "1", headers)
    print("Is model ready - {}".format(status_ctx))
    
    if Path(img_uri).exists():
        img_content = open(img_uri, "rb").read()
    else:
        agent = f"Python Requests/{requests.__version__} (https://github.com/Azure/azureml-examples)"
        img_content = requests.get(img_uri, headers={"User-Agent": agent}).content
    
    img_data = prepost.preprocess(img_content)
    
    # Populate inputs and outputs
    input = tritonhttpclient.InferInput("data_0", img_data.shape, "FP32")
    input.set_data_from_numpy(img_data)
    inputs = [input]
    output = tritonhttpclient.InferRequestedOutput("fc6_1")
    outputs = [output]
    
    result = triton_client.infer(model_name, inputs, outputs=outputs, headers=headers)
    max_label = np.argmax(result.as_numpy("fc6_1"))
    label_name = prepost.postprocess(max_label)
    print(label_name)
    
  4. The response from the script is similar to the following text:

    Is server ready - True
    Is model ready - True
    /azureml-examples/sdk/endpoints/online/triton/single-model/densenet_labels.txt
    84 : PEACOCK
    

Delete the endpoint and model

APPLIES TO: Python SDK azure-ai-ml v2 (current)

  1. Delete the endpoint. Deleting the endpoint also deletes any child deployments, however it will not archive associated Environments or Models.

    ml_client.online_endpoints.begin_delete(name=endpoint_name)
    
  2. Archive the model with the following code.

    ml_client.models.archive(name=model_name, version=model_version)
    

Next steps

To learn more, review these articles: