Deploy language models in batch endpoints

2025-07-31

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Batch Endpoints can be used to deploy expensive models, like language models, over text data. In this tutorial, you learn how to deploy a model that can perform text summarization of long sequences of text using a model from HuggingFace. It also shows how to do inference optimization using HuggingFace optimum and accelerate libraries.

About this sample

The model we are going to work with was built using the popular library transformers. It was introduced in the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation. This model has the following constraints which are important to keep in mind for deployment:

It can work with sequences up to 1,024 tokens.
It's trained for summarization of text in English.
We're going to use Torch as a backend.

The example in this article is based on code samples contained in the azureml-examples repository. To run the commands locally without having to copy or paste YAML and other files, use the following commands to clone the repository and go to the folder for your coding language:

Azure CLI
Python

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/cli

git clone https://github.com/Azure/azureml-examples --depth 1
cd azureml-examples/sdk/python

The files for this example are in:

cd endpoints/batch/deploy-models/huggingface-text-summarization

Follow along in Jupyter Notebooks

You can follow along this sample in a Jupyter Notebook. In the cloned repository, open the notebook: text-summarization-batch.ipynb.

Prerequisites

An Azure subscription. If you don't have an Azure subscription, create a Trial before you begin.
An Azure Machine Learning workspace. To create a workspace, see Manage Azure Machine Learning workspaces.
The following permissions in the Azure Machine Learning workspace:
- For creating or managing batch endpoints and deployments: Use an Owner, Contributor, or custom role that has been assigned the Microsoft.MachineLearningServices/workspaces/batchEndpoints/* permissions.
- For creating Azure Resource Manager deployments in the workspace resource group: Use an Owner, Contributor, or custom role that has been assigned the Microsoft.Resources/deployments/write permission in the resource group where the workspace is deployed.
The Azure Machine Learning CLI or the Azure Machine Learning SDK for Python:
- Azure CLI
- Python
Run the following command to install the Azure CLI and the ml extension for Azure Machine Learning:
```
az extension add -n ml
```
Pipeline component deployments for batch endpoints are introduced in version 2.7 of the ml extension for the Azure CLI. Use the az extension update --name ml command to get the latest version.
Run the following command to install the Azure Machine Learning SDK for Python:
```
pip install azure-ai-ml
```
The ModelBatchDeployment and PipelineComponentBatchDeployment classes are introduced in version 1.7.0 of the SDK. Use the pip install -U azure-ai-ml command to get the latest version.

Connect to your workspace

The workspace is the top-level resource for Azure Machine Learning. It provides a centralized place to work with all artifacts you create when you use Azure Machine Learning. In this section, you connect to the workspace where you perform your deployment tasks.

Azure CLI
Python

In the following command, enter your subscription ID, workspace name, resource group name, and location:

az account set --subscription <subscription>
az configure --defaults workspace=<workspace> group=<resource-group> location=<location>

Import the required libraries:

from azure.ai.ml import MLClient, Input, load_component
from azure.ai.ml.entities import BatchEndpoint, ModelBatchDeployment, ModelBatchDeploymentSettings, PipelineComponentBatchDeployment, Model, AmlCompute, Data, BatchRetrySettings, CodeConfiguration, Environment, Data
from azure.ai.ml.constants import AssetTypes, BatchDeploymentOutputAction
from azure.ai.ml.dsl import pipeline
from azure.identity import DefaultAzureCredential

Configure the workspace details and get a handle to the workspace:

In the following command, enter your subscription ID, resource group name, and workspace name:

subscription_id = "<subscription>"
resource_group = "<resource-group>"
workspace = "<workspace>"

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace)

Registering the model

Due to the size of the model, it isn't included in this repository. Instead, you can download a copy from the HuggingFace model's hub. You need the packages transformers and torch installed in the environment you're using.

%pip install transformers torch

Use the following code to download the model to a folder model:

from transformers import pipeline

model = pipeline("summarization", model="facebook/bart-large-cnn")

<!--NOT AVAILABLE ON FEATURE facebook-->


model_local_path = 'model'
summarizer.save_pretrained(model_local_path)

We can now register this model in the Azure Machine Learning registry:

Azure CLI
Python

MODEL_NAME='bart-text-summarization'
az ml model create --name $MODEL_NAME --path "model"

model_name = 'bart-text-summarization'
model = ml_client.models.create_or_update(
    Model(name=model_name, path='model', type=AssetTypes.CUSTOM_MODEL)
)

Creating the endpoint

We're going to create a batch endpoint named text-summarization-batch where to deploy the HuggingFace model to run text summarization on text files in English.

Decide on the name of the endpoint. The name of the endpoint ends-up in the URI associated with your endpoint. Because of that, batch endpoint names need to be unique within an Azure region. For example, there can be only one batch endpoint with the name mybatchendpoint in westus2.
- Azure CLI
- Python
In this case, let's place the name of the endpoint in a variable so we can easily reference it later.
```
ENDPOINT_NAME="text-summarization-batch"
```
In this case, let's place the name of the endpoint in a variable so we can easily reference it later.
```
endpoint_name="text-summarization-batch"
```

Configure your batch endpoint

Azure CLI
Python

The following YAML file defines a batch endpoint:

endpoint.yml

 $schema: https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json
 name: text-summarization-batch
 description: A batch endpoint for summarizing text using a HuggingFace transformer model.
 auth_mode: aad_token

endpoint = BatchEndpoint(
    name=endpoint_name,
    description="A batch endpoint for summarizing text using a HuggingFace transformer model.",
)

Create the endpoint:

Azure CLI
Python

az ml batch-endpoint create --file endpoint.yml  --name $ENDPOINT_NAME

ml_client.batch_endpoints.begin_create_or_update(endpoint)

Creating the deployment

Let's create the deployment that hosts the model:

We need to create a scoring script that can read the CSV files provided by the batch deployment and return the scores of the model with the summary. The following script performs these actions:
- Indicates an init function that detects the hardware configuration (CPU vs GPU) and loads the model accordingly. Both the model and the tokenizer are loaded in global variables. We are not using a pipeline object from HuggingFace to account for the limitation in the sequence lenghs of the model we are currently using.
- Notice that we are doing performing model optimizations to improve the performance using optimum and accelerate libraries. If the model or hardware doesn't support it, we will run the deployment without such optimizations.
- Indicates a run function that is executed for each mini-batch the batch deployment provides.
- The run function read the entire batch using the datasets library. The text we need to summarize is on the column text.
- The run method iterates over each of the rows of the text and run the prediction. Since this is a very expensive model, running the prediction over entire files will result in an out-of-memory exception. Notice that the model is not execute with the pipeline object from transformers. This is done to account for long sequences of text and the limitation of 1024 tokens in the underlying model we are using.
- It returns the summary of the provided text.
code/batch_driver.py

import os
import time
import torch
import subprocess
import mlflow
from pprint import pprint
from transformers import AutoTokenizer, BartForConditionalGeneration
from optimum.bettertransformer import BetterTransformer
from datasets import load_dataset


def init():
    global model
    global tokenizer
    global device

    cuda_available = torch.cuda.is_available()
    device = "cuda" if cuda_available else "cpu"

    if cuda_available:
        print(f"[INFO] CUDA version: {torch.version.cuda}")
        print(f"[INFO] ID of current CUDA device: {torch.cuda.current_device()}")
        print("[INFO] nvidia-smi output:")
        pprint(
            subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE).stdout.decode(
                "utf-8"
            )
        )
    else:
        print(
            "[WARN] CUDA acceleration is not available. This model takes hours to run on medium size data."
        )

    # AZUREML_MODEL_DIR is an environment variable created during deployment
    model_path = os.path.join(os.environ["AZUREML_MODEL_DIR"], "model")

    # load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        model_path, truncation=True, max_length=1024
    )

    # Load the model
    try:
        model = BartForConditionalGeneration.from_pretrained(
            model_path, device_map="auto"
        )
    except Exception as e:
        print(
            f"[ERROR] Error happened when loading the model on GPU or the default device. Error: {e}"
        )
        print("[INFO] Trying on CPU.")
        model = BartForConditionalGeneration.from_pretrained(model_path)
        device = "cpu"

    # Optimize the model
    if device != "cpu":
        try:
            model = BetterTransformer.transform(model, keep_original_model=False)
            print("[INFO] BetterTransformer loaded.")
        except Exception as e:
            print(
                f"[ERROR] Error when converting to BetterTransformer. An unoptimized version of the model will be used.\n\t> {e}"
            )

    mlflow.log_param("device", device)
    mlflow.log_param("model", type(model).__name__)


def run(mini_batch):
    resultList = []

    print(f"[INFO] Reading new mini-batch of {len(mini_batch)} file(s).")
    ds = load_dataset("csv", data_files={"score": mini_batch})

    start_time = time.perf_counter()
    for idx, text in enumerate(ds["score"]["text"]):
        # perform inference
        inputs = tokenizer.batch_encode_plus(
            [text], truncation=True, padding=True, max_length=1024, return_tensors="pt"
        )
        input_ids = inputs["input_ids"].to(device)
        summary_ids = model.generate(
            input_ids, max_length=130, min_length=30, do_sample=False
        )
        summaries = tokenizer.batch_decode(
            summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )

        # Get results:
        resultList.append(summaries[0])
        rps = idx / (time.perf_counter() - start_time + 00000.1)
        print("Rows per second:", rps)

    mlflow.log_metric("rows_per_second", rps)
    return resultList

Tip

Although files are provided in mini-batches by the deployment, this scoring script processes one row at a time. This is a common pattern when dealing with expensive models (like transformers), because trying to load the entire batch and send it to the model at once can result in high-memory pressure on the batch executor (OOM exceptions).

We need to indicate over which environment we're going to run the deployment. In our case, our model runs on Torch and it requires the libraries transformers, accelerate, and optimum from HuggingFace. Azure Machine Learning already has an environment with Torch and GPU support available. We're just going to add a couple of dependencies in a conda.yaml file.

environment/torch200-conda.yaml
```
name: huggingface-env
channels:
  - conda-forge
dependencies:
  - python=3.8.5
  - pip
  - pip:
    - torch==2.0
    - transformers
    - accelerate
   - optimum
   - datasets
    - mlflow
    - azureml-mlflow
    - azureml-core
    - azureml-dataset-runtime[fuse]
```
We can use the conda file mentioned before as follows:
- Azure CLI
- Python
The environment definition is included in the deployment file.

deployment.yml
```
compute: azureml:gpu-cluster
environment:
  name: torch200-transformers-gpu
  image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04:latest
```
Let's get a reference to the environment:
```
environment = Environment(
    name="torch200-transformers-gpu",
    conda_file="environment/torch200-conda.yaml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04:latest",
)
```
Important

The environment torch200-transformers-gpu we've created requires a CUDA 11.8 compatible hardware device to run Torch 2.0 and Ubuntu 20.04. If your GPU device doesn't support this version of CUDA, you can check the alternative torch113-conda.yaml conda environment (also available on the repository), which runs Torch 1.3 over Ubuntu 18.04 with CUDA 10.1. However, acceleration using the optimum and accelerate libraries isn't supported on this configuration.
Each deployment runs on compute clusters. They support both Azure Machine Learning Compute clusters (AmlCompute) or Kubernetes clusters. In this example, our model can benefit from GPU acceleration, which is why we use a GPU cluster.
- Azure CLI
- Python
```
az ml compute create -n gpu-cluster --type amlcompute --size STANDARD_NV6 --min-instances 0 --max-instances 2
```
```
compute_name = "gpu-cluster"
compute_cluster = AmlCompute(
    name=compute_name,
    description="GPU cluster compute",
    size="Standard_NV6",
    min_instances=0,
    max_instances=2,
)
ml_client.begin_create_or_update(compute_cluster)
```
Note

You aren't charged for compute at this point, because the cluster remains at zero nodes until a batch endpoint is invoked and a batch scoring job is submitted. Learn more about manage and optimize cost for AmlCompute.

Now, let's create the deployment.

Azure CLI
Python

To create a new deployment under the created endpoint, create a YAML configuration like the following. You can check the full batch endpoint YAML schema for extra properties.

deployment.yml

$schema: https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json
endpoint_name: text-summarization-batch
name: text-summarization-optimum
description: A text summarization deployment implemented with HuggingFace and BART architecture with GPU optimization using Optimum.
type: model
model: azureml:bart-text-summarization@latest
compute: azureml:gpu-cluster
environment:
  name: torch200-transformers-gpu
  image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04:latest
  conda_file: environment/torch200-conda.yaml
code_configuration:
  code: code
  scoring_script: batch_driver.py
resources:
  instance_count: 2
settings:
  max_concurrency_per_instance: 1
  mini_batch_size: 1
  output_action: append_row
  output_file_name: predictions.csv
  retry_settings:
    max_retries: 1
    timeout: 3000
  error_threshold: -1
  logging_level: info

Then, create the deployment with the following command:

az ml batch-deployment create --file deployment.yml --endpoint-name $ENDPOINT_NAME --set-default

To create a new deployment with the indicated environment and scoring script, use the following code:

deployment = BatchDeployment(
    name="text-summarization-hfbart",
    description="A text summarization deployment implemented with HuggingFace and BART architecture",
    endpoint_name=endpoint.name,
    model=model,
    environment=environment,
    code_configuration=CodeConfiguration(
        code="code",
        scoring_script="batch_driver.py",
    ),
    compute=compute_name,
    instance_count=2,
    max_concurrency_per_instance=1,
    mini_batch_size=1,
    output_action=BatchDeploymentOutputAction.APPEND_ROW,
    output_file_name="predictions.csv",
    retry_settings=BatchRetrySettings(max_retries=3, timeout=3000),
    logging_level="info",
)

Then, create the deployment with the following command:

ml_client.batch_deployments.begin_create_or_update(deployment)

Important

You'll notice in this deployment a high value in timeout in the parameter retry_settings. The reason involves the nature of the model we're running. This is a very expensive model and inference on a single row can take up to 60 seconds. The timeout parameter controls how much time the Batch Deployment should wait for the scoring script to finish processing each mini-batch. Since our model runs predictions row by row, processing a long file can take time. Also notice that the number of files per batch is set to 1 (mini_batch_size=1). This is again related to the nature of the work we're doing. Processing one file at a time per batch is expensive enough to justify it. You'll notice this being a pattern in NLP processing.

Although you can invoke a specific deployment inside of an endpoint, you usually want to invoke the endpoint itself and let the endpoint decide which deployment to use. Such deployment is named the "default" deployment. This gives you the possibility of changing the default deployment and hence changing the model serving the deployment without changing the contract with the user invoking the endpoint. Use the following instruction to update the default deployment:
- Azure CLI
- Python
```
DEPLOYMENT_NAME="text-summarization-hfbart"
az ml batch-endpoint update --name $ENDPOINT_NAME --set defaults.deployment_name=$DEPLOYMENT_NAME
```
```
endpoint.defaults.deployment_name = deployment.name
ml_client.batch_endpoints.begin_create_or_update(endpoint)
```
At this point, our batch endpoint is ready to be used.

Testing out the deployment

For testing our endpoint, we're going to use a sample of the dataset BillSum: A Corpus for Automatic Summarization of US Legislation. This sample is included in the repository in the folder data. Notice that the format of the data is CSV and the content to be summarized is under the column text, as expected by the model.

Let's invoke the endpoint:
- Azure CLI
- Python
```
JOB_NAME=$(az ml batch-endpoint invoke --name $ENDPOINT_NAME --input data --input-type uri_folder --query name -o tsv)
```
Note

The utility jq might not be installed on every installation. You can get instructions in this link.
Tip

What's the difference between the inputs and input parameter when you invoke an endpoint?

In general, you can use a dictionary inputs = {} parameter with the invoke method to provide an arbitrary number of required inputs to a batch endpoint that contains a model deployment or a pipeline deployment.

For a model deployment, you can use the input parameter as a shorter way to specify the input data location for the deployment. This approach works because a model deployment always takes only one data input.
```
input = Input(type=AssetTypes.URI_FOLDER, path="data")
job = ml_client.batch_endpoints.invoke(
   endpoint_name=endpoint.name,
   input=input,
)
```
Tip

Notice that by indicating a local path as an input, the data is uploaded to Azure Machine Learning default's storage account.
A batch job is started as soon as the command returns. You can monitor the status of the job until it finishes:
- Azure CLI
- Python
```
az ml job show -n $JOB_NAME --web
```
```
ml_client.jobs.get(job.name)
```

Once the deployment is finished, we can download the predictions:

Azure CLI
Python

To download the predictions, use the following command:

az ml job download --name $JOB_NAME --output-name score --download-path .

ml_client.jobs.download(name=job.name, output_name='score', download_path='./')

Considerations when deploying models that process text

As mentioned in some of the notes along this tutorial, processing text can have some peculiarities that require specific configuration for batch deployments. Take the following consideration when designing the batch deployment:

Some NLP models may be very expensive in terms of memory and compute time. If this is the case, consider decreasing the number of files included on each mini-batch. In the previous example, the number was taken to the minimum, 1 file per batch. While this may not be your case, take into consideration how many files your model can score at each time. Have in mind that the relationship between the size of the input and the memory footprint of your model may not be linear for deep learning models.
If your model can't even handle one file at a time (like in this example), consider reading the input data in rows/chunks. Implement batching at the row level if you need to achieve higher throughput or hardware utilization.
Set the timeout value of your deployment accordly to how expensive your model is and how much data you expect to process. Remember that the timeout indicates the time the batch deployment would wait for your scoring script to run for a given batch. If your batch have many files or files with many rows, this impacts the right value of this parameter.

Considerations for MLflow models that process text

The same considerations mentioned earlier apply to MLflow models. However, since you aren't required to provide a scoring script for your MLflow model deployment, some of the recommendations mentioned might require a different approach.

MLflow models in Batch Endpoints support reading tabular data as input data, which might contain long sequences of text. See File's types support for details about which file types are supported.
Batch deployments calls your MLflow model's predict function with the content of an entire file in as Pandas dataframe. If your input data contains many rows, chances are that running a complex model (like the one presented in this tutorial) results in an out-of-memory exception. If this is your case, you can consider:
- Customize how your model runs predictions and implement batching. To learn how to customize MLflow model's inference, see Log custom models.
- Author a scoring script and load your model using mlflow.<flavor>.load_model(). See Using MLflow models with a scoring script for details.