Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
APPLIES TO:
Azure CLI ml extension v2 (current)
Python SDK azure-ai-ml v2 (current)
After you train machine learning models or pipelines, or find suitable models from the model catalog, you need to deploy them to production for others to use for inference. Inference is the process of applying new input data to a machine learning model or pipeline to generate outputs. While these outputs are typically called "predictions," inference can generate outputs for other machine learning tasks, such as classification and clustering. In Azure Machine Learning, you perform inference by using endpoints.
Endpoints and deployments
An endpoint is a stable and durable URL that can be used to request or invoke a model. You provide the required inputs to the endpoint and receive the outputs. Azure Machine Learning supports standard deployments, online endpoints, and batch endpoints. An endpoint provides:
- A stable and durable URL (such as endpoint-name.region.inference.studio.ml.azure.cn)
- An authentication mechanism
- An authorization mechanism
A deployment is a set of resources and compute required to host the model or component that performs the actual inference. An endpoint contains a deployment. For online and batch endpoints, one endpoint can contain several deployments. The deployments can host independent assets and consume different resources based on the needs of the assets. An endpoint also has a routing mechanism that can direct requests to any of its deployments.
Some types of endpoints in Azure Machine Learning consume dedicated resources on their deployments. For these endpoints to run, you must have compute quota on your Azure subscription. However, certain models support a serverless deployment, which allows them to consume no quota from your subscription. For serverless deployments, you're billed based on usage.
Intuition
Suppose you're working on an application that predicts the type and color of a car from a photo. For this application, a user with certain credentials makes an HTTP request to a URL and provides a picture of a car as part of the request. In return, the user receives a response that includes the type and color of the car as string values. In this scenario, the URL serves as an endpoint.
Now suppose that a data scientist, Alice, is implementing the application. Alice has extensive TensorFlow experience and decides to implement the model using a Keras sequential classifier with a ResNet architecture from the TensorFlow Hub. After testing the model, Alice is satisfied with its results and decides to use the model to solve the car prediction problem. The model is large and requires 8 GB of memory with 4 cores to run. In this scenario, Alice's model and the resources—such as the code and the compute—that are required to run the model make up a deployment under the endpoint.
After a few months, the organization discovers that the application performs poorly on images with poor lighting conditions. Bob, another data scientist, has expertise in data augmentation techniques that help models build robustness for this factor. However, Bob prefers using PyTorch to implement the model and trains a new model with PyTorch. Bob wants to test this model in production gradually until the organization is ready to retire the old model. The new model also performs better when deployed to GPU, so the deployment needs to include a GPU. In this scenario, Bob's model and the resources—such as the code and the compute—that are required to run the model make up another deployment under the same endpoint.
Endpoints: online, and batch
online endpoints are designed for real-time inference. Whenever you invoke the endpoint, the results are returned in the endpoint's response. Serverless API endpoints don't consume quota from your subscription; rather, they're billed with pay-as-you-go billing.
Batch endpoints are designed for long-running batch inference. When you invoke a batch endpoint, you generate a batch job that performs the actual work.
When to use online, and batch endpoints
Online endpoints:
Use online endpoints to operationalize models for real-time inference in synchronous low-latency requests. We recommend using them when:
- You have low-latency requirements.
- Your model can answer the request in a relatively short amount of time.
- Your model's inputs fit in the HTTP payload of the request.
- You need to scale up in terms of number of requests.
Batch endpoints:
Use batch endpoints to operationalize models or pipelines for long-running asynchronous inference. We recommend using them when:
- You have expensive models or pipelines that require a longer time to run.
- You want to operationalize machine learning pipelines and reuse components.
- You need to perform inference over large amounts of data that are distributed in multiple files.
- You don't have low latency requirements.
- Your model's inputs are stored in a storage account or in an Azure Machine Learning data asset.
- You can take advantage of parallelization.
Comparison of online and batch endpoints
Both online and batch endpoints are based on the idea of endpoints, therefore, you can transition easily from one to the other. Online and batch endpoints are also capable of managing multiple deployments for the same endpoint.
Endpoints
The following table shows a summary of the different features available to online and batch endpoints.
Feature | Online Endpoints | Batch endpoints |
---|---|---|
Stable invocation URL | Yes | Yes |
Support for multiple deployments | Yes | Yes |
Deployment's routing | Traffic split | Switch to default |
Mirror traffic for safe rollout | Yes | No |
Swagger support | Yes | No |
Authentication | Key and Microsoft Entra ID (preview) | Microsoft Entra ID |
Private network support (legacy) | Yes | Yes |
Managed network isolation | Yes | Yes (see required additional configuration) |
Customer-managed keys | Yes | Yes |
Cost basis | None | None |
Deployments
The following table shows a summary of the different features available to online and batch endpoints at the deployment level. These concepts apply to each deployment under the endpoint.
Feature | Online Endpoints | Batch endpoints |
---|---|---|
Deployment types | Models | Models and Pipeline components |
MLflow model deployment | Yes | Yes |
Custom model deployment | Yes, with scoring script | Yes, with scoring script |
Inference server 2 | - Azure Machine Learning Inferencing Server - Triton - Custom (using BYOC) |
Batch Inference |
Compute resource consumed | Instances or granular resources | Cluster instances |
Compute type | Managed compute and Kubernetes | Managed compute and Kubernetes |
Low-priority compute | No | Yes |
Scaling compute to zero | No | Yes |
Autoscaling compute3 | Yes, based on resource use | Yes, based on job count |
Overcapacity management | Throttling | Queuing |
Cost basis4 | Per deployment: compute instances running | Per job: compute instanced consumed in the job (capped to the maximum number of instances of the cluster). |
Local testing of deployments | Yes | No |
2 Inference server refers to the serving technology that takes requests, processes them, and creates responses. The inference server also dictates the format of the input and the expected outputs.
3 Autoscaling is the ability to dynamically scale up or scale down the deployment's allocated resources based on its load. Online and batch deployments use different strategies for autoscaling. While online deployments scale up and down based on the resource utilization (like CPU, memory, requests, etc.), batch endpoints scale up or down based on the number of jobs created.
4 Both online and batch deployments charge by the resources consumed. In online deployments, resources are provisioned at deployment time. In batch deployments, resources aren't consumed at deployment time but at the time that the job runs. Hence, there's no cost associated with the batch deployment itself. Likewise, queued jobs don't consume resources either.
Developer interfaces
Endpoints are designed to help organizations operationalize production-level workloads in Azure Machine Learning. Endpoints are robust and scalable resources, and they provide the best capabilities to implement MLOps workflows.
You can create and manage batch and online endpoints with several developer tools:
- Azure CLI and Python SDK
- Azure Resource Manager/REST API
- Azure Machine Learning studio web portal
- Azure portal (IT/Admin)
- Support for CI/CD MLOps pipelines using the Azure CLI interface & REST/ARM interfaces