Train a model by using a custom Docker image

Article
11/22/2024

This article describes how to use a custom Docker image to train models with Azure Machine Learning. Example scripts show how to classify images by creating a convolutional neural network.

Azure Machine Learning provides a default Docker base image. You can also use Azure Machine Learning environments to specify a different base image, such as a maintained Azure Machine Learning base image or your own custom image. Custom base images allow you to closely manage your dependencies and maintain tighter control over component versions when you run training jobs.

Prerequisites

To run the example code, your configuration must include one of the following environments:

Azure Machine Learning compute instance with a dedicated notebook server preloaded with the Machine Learning SDK and Samples repository.

This configuration requires no downloads or other installation. To prepare this environment, see Create resources to get started.
Jupyter Notebook server. The following resources provide instructions to help you prepare this environment:
- Create a workspace configuration file.
- Install the Azure Machine Learning SDK.
- Create an Azure container registry or other Docker registry available on the internet.

Set up a training experiment

The first task is to set up your training experiment by initializing a Machine Learning workspace, defining your environment, and configuring a compute target.

Initialize a workspace

The Azure Machine Learning workspace is the top-level resource for the service. It gives you a centralized place to work with all the artifacts you create. In the Python SDK, you can access the workspace artifacts by creating a Workspace object.

As needed, create a Workspace object from the config.json file that you created as a prerequisite.

from azureml.core import Workspace

ws = Workspace.from_config()

Define your environment

Create an Environment object.

from azureml.core import Environment

fastai_env = Environment("fastai2")

The specified base image in the following code supports the fast.ai library, which allows for distributed deep-learning capabilities. For more information, see the fast.ai Docker Hub repository.

When you use your custom Docker image, you might already have your Python environment properly set up. In that case, set the user_managed_dependencies flag to True to use your custom image's built-in Python environment. By default, Azure Machine Learning builds a Conda environment with dependencies that you specified. The service runs the script in that environment instead of using any Python libraries that you installed on the base image.

fastai_env.docker.base_image = "fastdotai/fastai2:latest"
fastai_env.python.user_managed_dependencies = True

Use a private container registry (optional)

To use an image from a private container registry that isn't in your workspace, use docker.base_image_registry to specify the address of the repository and a username and password:

# Set the container registry information
fastai_env.docker.base_image_registry.address = "myregistry.azurecr.cn"
fastai_env.docker.base_image_registry.username = "username"
fastai_env.docker.base_image_registry.password = "password"

Use a custom Dockerfile (optional)

It's also possible to use a custom Dockerfile. Use this approach if you need to install non-Python packages as dependencies. Remember to set the base image to None.

# Specify Docker steps as a string
dockerfile = r"""
FROM mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210615.v1
RUN echo "Hello from custom container!"
"""

# Set the base image to None, because the image is defined by Dockerfile
fastai_env.docker.base_image = None
fastai_env.docker.base_dockerfile = dockerfile

# Alternatively, load the string from a file
fastai_env.docker.base_image = None
fastai_env.docker.base_dockerfile = "./Dockerfile"

Important

Azure Machine Learning only supports Docker images that provide the following software:

Ubuntu 18.04 or greater
Conda 4.7.# or greater
Python 3.7+
A POSIX compliant shell available at /bin/sh is required in any container image used for training

For more information about creating and managing Azure Machine Learning environments, see Create and use software environments.

Create or attach a compute target

You need to create a compute target for training your model. In this tutorial, you create AmlCompute as your training compute resource.

Creation of AmlCompute takes a few minutes. If the AmlCompute resource is already in your workspace, this code skips the creation process.

As with other Azure services, there are limits on certain resources (for example, AmlCompute) associated with the Azure Machine Learning service. For more information, see Default limits and how to request a higher quota.

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your cluster
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=4)

    # Create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# Use get_status() to get a detailed status for the current AmlCompute
print(compute_target.get_status().serialize())

Important

Use CPU SKUs for any image build on compute.

Configure your training job

For this tutorial, use the training script train.py on GitHub. In practice, you can take any custom training script and run it, as is, with Azure Machine Learning.

Create a ScriptRunConfig resource to configure your job for running on the desired compute target.

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory='fastai-example',
                      script='train.py',
                      compute_target=compute_target,
                      environment=fastai_env)

Submit your training job

When you submit a training run by using a ScriptRunConfig object, the submit method returns an object of type ScriptRun. The returned ScriptRun object gives you programmatic access to information about the training run.

from azureml.core import Experiment

run = Experiment(ws,'Tutorial-fastai').submit(src)
run.wait_for_completion(show_output=True)

Warning

Azure Machine Learning runs training scripts by copying the entire source directory. If you have sensitive data that you don't want to upload, use an .ignore file or don't include it in the source directory. Instead, access your data by using a datastore.