Deploy and score a machine learning model by using an online endpoint

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you learn to deploy your model to an online endpoint for use in real-time inferencing. You begin by deploying a model on your local machine to debug any errors. Then, you deploy and test the model in Azure, view the deployment logs, and monitor the service-level agreement (SLA). By the end of this article, you'll have a scalable HTTPS/REST endpoint that you can use for real-time inference.

Online endpoints are endpoints that are used for real-time inferencing. There are two types of online endpoints: managed online endpoints and Kubernetes online endpoints. For more information on endpoints and differences between managed online endpoints and Kubernetes online endpoints, see What are Azure Machine Learning endpoints?

Managed online endpoints help to deploy your machine learning models in a turnkey manner. Managed online endpoints work with powerful CPU and GPU machines in Azure in a scalable, fully managed way. Managed online endpoints take care of serving, scaling, securing, and monitoring your models, freeing you from the overhead of setting up and managing the underlying infrastructure.

The main example in this doc uses managed online endpoints for deployment. To use Kubernetes instead, see the notes in this document that are inline with the managed online endpoint discussion.

Prerequisites

Before following the steps in this article, make sure you have the following prerequisites:

  • An Azure subscription. If you don't have an Azure subscription, create a Trial before you begin. Try the free or paid version of Azure Machine Learning.

  • An Azure Machine Learning workspace and a compute instance. If you don't have these resources and want to create them, use the steps in the Quickstart: Create workspace resources article.

  • Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure Machine Learning. To perform the steps in this article, your user account must be assigned the owner or contributor role for the Azure Machine Learning workspace, or a custom role allowing Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more information, see Manage access to an Azure Machine Learning workspace.

  • Ensure that you have enough virtual machine (VM) quota allocated for deployment. Azure Machine Learning reserves 20% of your compute resources for performing upgrades on some VM SKUs. For example, if you request 10 instances in a deployment, you must have a quota for 12 for each number of cores for the VM SKU. Failure to account for the extra compute resources results in an error. There are some VM SKUs that are exempt from the extra quota reservation. For more information on quota allocation, see virtual machine quota allocation for deployment.

  • Alternatively, you could use quota from Azure Machine Learning's shared quota pool for a limited time. Azure Machine Learning provides a shared quota pool from which users across various regions can access quota to perform testing for a limited time, depending upon availability. When you use the studio to deploy Llama-2, Phi, Nemotron, Mistral, Dolly, and Deci-DeciLM models from the model catalog to a managed online endpoint, Azure Machine Learning allows you to access its shared quota pool for a short time so that you can perform testing. For more information on the shared quota pool, see Azure Machine Learning shared quota.

Prepare your system

If you have Git installed on your local machine, you can follow the instructions to clone the examples repository. Otherwise, follow the instructions to download files from the examples repository.

Clone the examples repository

To follow along with this article, first clone the examples repository (azureml-examples) and then change into the azureml-examples/cli/endpoints/online/model-1 directory.

git clone --depth 1 https://github.com/Azure/azureml-examples
cd azureml-examples/cli/endpoints/online/model-1

Tip

Use --depth 1 to clone only the latest commit to the repository, which reduces time to complete the operation.

Download files from the examples repository

If you cloned the examples repo, your local machine already has copies of the files for this example, and you can skip to the next section. If you didn't clone the repo, you can download it to your local machine.

  1. Go to https://github.com/Azure/azureml-examples/.
  2. Go to the <> Code button on the page, and then select Download ZIP from the Local tab.
  3. Locate the folder /cli/endpoints/online/model-1/model and the file /cli/endpoints/online/model-1/onlinescoring/score.py.

Define the endpoint

To define an online endpoint, specify the endpoint name and authentication mode. For more information on managed online endpoints, see Online endpoints.

Configure an endpoint

When you deploy to Azure from the studio, you'll create an endpoint and a deployment to add to it. At that time, you'll be prompted to provide names for the endpoint and deployment.

Define the deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing. For this example, you deploy a scikit-learn model that does regression and use a scoring script score.py to execute the model upon a given input request.

To learn about the key attributes of a deployment, see Online deployments.

Configure a deployment

Your deployment configuration uses the location of the model that you wish to deploy.

Configure a deployment

When you deploy to Azure, you'll create an endpoint and a deployment to add to it. At that time, you'll be prompted to provide names for the endpoint and deployment.

Understand the scoring script

Tip

The format of the scoring script for online endpoints is the same format that's used in the preceding version of the CLI and in the Python SDK.

The scoring script must have an init() function and a run() function.

This example uses the score.py file: score.py

import os
import logging
import json
import numpy
import joblib


def init():
    """
    This function is called when the container is initialized/started, typically after create/update of the deployment.
    You can write the logic here to perform init operations like caching the model in memory
    """
    global model
    # AZUREML_MODEL_DIR is an environment variable created during deployment.
    # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)
    # Please provide your model's folder name if there is one
    model_path = os.path.join(
        os.getenv("AZUREML_MODEL_DIR"), "model/sklearn_regression_model.pkl"
    )
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)
    logging.info("Init complete")


def run(raw_data):
    """
    This function is called for every invocation of the endpoint to perform the actual scoring/prediction.
    In the example we extract the data from the json input and call the scikit-learn model's predict()
    method and return the result back
    """
    logging.info("model 1: request received")
    data = json.loads(raw_data)["data"]
    data = numpy.array(data)
    result = model.predict(data)
    logging.info("Request processed")
    return result.tolist()

The init() function is called when the container is initialized or started. Initialization typically occurs shortly after the deployment is created or updated. The init function is the place to write logic for global initialization operations like caching the model in memory (as shown in this score.py file).

The run() function is called every time the endpoint is invoked, and it does the actual scoring and prediction. In this score.py file, the run() function extracts data from a JSON input, calls the scikit-learn model's predict() method, and then returns the prediction result.

Deploy and debug locally by using a local endpoint

We highly recommend that you test-run your endpoint locally to validate and debug your code and configuration before you deploy to Azure. Azure CLI and Python SDK support local endpoints and deployments, while Azure Machine Learning studio and ARM template don't.

To deploy locally, Docker Engine must be installed and running. Docker Engine typically starts when the computer starts. If it doesn't, you can troubleshoot Docker Engine.

Tip

You can use Azure Machine Learning inference HTTP server Python package to debug your scoring script locally without Docker Engine. Debugging with the inference server helps you to debug the scoring script before deploying to local endpoints so that you can debug without being affected by the deployment container configurations.

For more information on debugging online endpoints locally before deploying to Azure, see Online endpoint debugging.

Deploy the model locally

First create an endpoint. Optionally, for a local endpoint, you can skip this step and directly create the deployment (next step), which will, in turn, create the required metadata. Deploying models locally is useful for development and testing purposes.

The studio doesn't support local endpoints. See the Azure CLI or Python tabs for steps to test the endpoint locally.

Now, create a deployment named blue under the endpoint.

The studio doesn't support local endpoints. See the Azure CLI or Python tabs for steps to test the endpoint locally.

Tip

Use Visual Studio Code to test and debug your endpoints locally. For more information, see debug online endpoints locally in Visual Studio Code.

Verify that the local deployment succeeded

Check the deployment status to see whether the model was deployed without error:

The studio doesn't support local endpoints. See the Azure CLI or Python tabs for steps to test the endpoint locally.

The following table contains the possible values for provisioning_state:

Value Description
Creating The resource is being created.
Updating The resource is being updated.
Deleting The resource is being deleted.
Succeeded The create/update operation succeeded.
Failed The create/update/delete operation failed.

Invoke the local endpoint to score data by using your model

The studio doesn't support local endpoints. See the Azure CLI or Python tabs for steps to test the endpoint locally.

Review the logs for output from the invoke operation

In the example score.py file, the run() method logs some output to the console.

The studio doesn't support local endpoints. See the Azure CLI or Python tabs for steps to test the endpoint locally.

Deploy your online endpoint to Azure

Next, deploy your online endpoint to Azure. As a best practice for production, we recommend that you register the model and environment that you'll use in your deployment.

Register your model and environment

We recommend that you register your model and environment before deployment to Azure so that you can specify their registered names and versions during deployment. Registering your assets allows you to reuse them without the need to upload them every time you create deployments, thereby increasing reproducibility and traceability.

Note

Unlike deployment to Azure, local deployment doesn't support using registered models and environments. Rather, local deployment uses local model files and uses environments with local files only. For deployment to Azure, you can use either local or registered assets (models and environments). In this section of the article, the deployment to Azure uses registered assets, but you have the option of using local assets instead. For an example of a deployment configuration that uploads local files to use for local deployment, see Configure a deployment.

Register the model

A model registration is a logical entity in the workspace that can contain a single model file or a directory of multiple files. As a best practice for production, you should register the model and environment. Before creating the endpoint and deployment in this article, you should register the model folder that contains the model.

To register the example model, follow these steps:

  1. Go to the Azure Machine Learning studio.

  2. In the left navigation bar, select the Models page.

  3. Select Register, and then choose From local files.

  4. Select Unspecified type for the Model type.

  5. Select Browse, and choose Browse folder.

    A screenshot of the browse folder option.

  6. Select the \azureml-examples\cli\endpoints\online\model-1\model folder from the local copy of the repo you cloned or downloaded earlier. When prompted, select Upload and wait for the upload to complete.

  7. Select Next after the folder upload is completed.

  8. Enter a friendly Name for the model. The steps in this article assume the model is named model-1.

  9. Select Next, and then Register to complete registration.

For more information on working with registered models, see Register and work with models.

Create and register the environment

  1. In the left navigation bar, select the Environments page.

  2. Select Create.

  3. On the "Settings" page, provide a name, such as my-env for the environment.

  4. For "Select environment source" choose Use existing docker image with optional conda source.

    A screenshot showing how to create a custom environment.

  5. Select Next to go to the "Customize" page.

  6. Copy the contents of the \azureml-examples\cli\endpoints\online\model-1\environment\conda.yaml file from the local copy of the repo you cloned or downloaded earlier.

  7. Paste the contents into the text box.

    A screenshot showing how to customize the environment, using a conda file.

  8. Select Next until you get to the "Review" page.

  9. select Create.

For more information on creating an environment in the studio, see Create an environment.

Configure a deployment that uses registered assets

Your deployment configuration uses the registered model that you wish to deploy and your registered environment..

When you deploy from the studio, you'll create an endpoint and a deployment to add to it. At that time, you'll be prompted to provide names for the endpoint and deployment.

Use different CPU and GPU instance types and images

When using the studio to deploy to Azure, you'll be prompted to specify the compute properties (instance type and instance count) and environment to use for your deployment.

For supported general-purpose and GPU instance types, see Managed online endpoints supported VM SKUs. For more information on environments, see Manage software environments in Azure Machine Learning studio.

Next, deploy your online endpoint to Azure.

Deploy to Azure

Create a managed online endpoint and deployment

Use the studio to create a managed online endpoint directly in your browser. When you create a managed online endpoint in the studio, you must define an initial deployment. You can't create an empty managed online endpoint.

One way to create a managed online endpoint in the studio is from the Models page. This method also provides an easy way to add a model to an existing managed online deployment. To deploy the model named model-1 that you registered previously in the Register your model and environment section:

  1. Go to the Azure Machine Learning studio.

  2. In the left navigation bar, select the Models page.

  3. Select the model named model-1 by checking the circle next to its name.

  4. Select Deploy > Real-time endpoint.

    A screenshot of creating a managed online endpoint from the Models UI.

    This action opens up a window where you can specify details about your endpoint.

    A screenshot of a managed online endpoint create wizard.

  5. Enter an Endpoint name that's unique in the Azure region. For more information on the naming rules, see endpoint limits.

  6. Keep the default selection: Managed for the compute type.

  7. Keep the default selection: key-based authentication for the authentication type. For more information on authenticating, see Authenticate clients for online endpoints.

  8. Select Next, until you get to the "Deployment" page. Here, toggle Application Insights diagnostics to Enabled to allow you to view graphs of your endpoint's activities in the studio later and analyze metrics and logs using Application Insights.

  9. Select Next to go to the "Code + environment" page. Here, select the following options:

    • Select a scoring script for inferencing: Browse and select the \azureml-examples\cli\endpoints\online\model-1\onlinescoring\score.py file from the repo you cloned or downloaded earlier.
    • Select environment section: Select Custom environments and then select the my-env:1 environment that you created earlier.

    A screenshot showing selection of a custom environment for deployment.

  10. Select Next, accepting defaults, until you're prompted to create the deployment.

  11. Review your deployment settings and select the Create button.

Alternatively, you can create a managed online endpoint from the Endpoints page in the studio.

  1. Go to the Azure Machine Learning studio.

  2. In the left navigation bar, select the Endpoints page.

  3. Select + Create.

    A screenshot for creating managed online endpoint from the Endpoints tab.

This action opens up a window for you to select your model and specify details about your endpoint and deployment. Enter settings for your endpoint and deployment as described previously, then Create the deployment.

To debug errors in your deployment, see Troubleshooting online endpoint deployments.

Check the status of the endpoint

View managed online endpoints

You can view all your managed online endpoints in the Endpoints page. Go to the endpoint's Details page to find critical information including the endpoint URI, status, testing tools, activity monitors, deployment logs, and sample consumption code:

  1. In the left navigation bar, select Endpoints. Here, you can see a list of all the endpoints in the workspace.

  2. (Optional) Create a Filter on Compute type to show only Managed compute types.

  3. Select an endpoint name to view the endpoint's Details page.

    Screenshot of managed endpoint details view.

Check the status of the online deployment

Check the logs to see whether the model was deployed without error.

To view log output, select the Logs tab from the endpoint's page. If you have multiple deployments in your endpoint, use the dropdown to select the deployment whose log you want to see.

A screenshot of observing deployment logs in the studio.

By default, logs are pulled from the inference server. To see logs from the storage initializer container, use the Azure CLI or Python SDK (see each tab for details). Logs from the storage initializer container provide information on whether code and model data were successfully downloaded to the container. For more information on deployment logs, see Get container logs.

Invoke the endpoint to score data by using your model

Use the Test tab in the endpoint's details page to test your managed online deployment. Enter sample input and view the results.

  1. Select the Test tab in the endpoint's detail page.

  2. Use the dropdown to select the deployment you want to test.

  3. Enter the sample input.

  4. Select Test.

    A screenshot of testing a deployment by providing sample data, directly in your browser.

(Optional) Update the deployment

Currently, the studio allows you to make updates only to the instance count of a deployment. Use the following instructions to scale an individual deployment up or down by adjusting the number of instances:

  1. Open the endpoint's Details page and find the card for the deployment you want to update.
  2. Select the edit icon (pencil icon) next to the deployment's name.
  3. Update the instance count associated with the deployment. You can choose between Default or Target Utilization for "Deployment scale type".
    • If you select Default, you cal also specify a numerical value for the Instance count.
    • If you select Target Utilization, you can specify values to use for parameters when autoscaling the deployment.
  4. Select Update to finish updating the instance counts for your deployment.

Note

The update to the deployment in this section is an example of an in-place rolling update.

  • For a managed online endpoint, the deployment is updated to the new configuration with 20% nodes at a time. That is, if the deployment has 10 nodes, 2 nodes at a time are updated.
  • For a Kubernetes online endpoint, the system iteratively creates a new deployment instance with the new configuration and deletes the old one.
  • For production usage, you should consider blue-green deployment, which offers a safer alternative for updating a web service.

(Optional) Configure autoscaling

Autoscale automatically runs the right amount of resources to handle the load on your application. Managed online endpoints support autoscaling through integration with the Azure monitor autoscale feature. To configure autoscaling, see How to autoscale online endpoints.

(Optional) Monitor SLA by using Azure Monitor

To view metrics and set alerts based on your SLA, complete the steps that are described in Monitor online endpoints.

(Optional) Integrate with Log Analytics

The get-logs command for CLI or the get_logs method for SDK provides only the last few hundred lines of logs from an automatically selected instance. However, Log Analytics provides a way to durably store and analyze logs. For more information on using logging, see Monitor online endpoints.

Delete the endpoint and the deployment

If you aren't going use the endpoint and deployment, you should delete them. By deleting the endpoint, you also delete all its underlying deployments.

  1. Go to the Azure Machine Learning studio.
  2. In the left navigation bar, select the Endpoints page.
  3. Select an endpoint by checking the circle next to the model name.
  4. Select Delete.

Alternatively, you can delete a managed online endpoint directly by selecting the Delete icon in the endpoint details page.