Introduction to Kubernetes compute target in Azure Machine Learning

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

With Azure Machine Learning CLI/Python SDK v2, Azure Machine Learning introduced a new compute target - Kubernetes compute target. You can easily enable an existing Azure Kubernetes Service (AKS) cluster or Azure Arc-enabled Kubernetes (Arc Kubernetes) cluster to become a Kubernetes compute target in Azure Machine Learning, and use it to train or deploy models.

Diagram illustrating how Azure Machine Learning connects to Kubernetes.

In this article, you learn about:

  • How it works
  • Usage scenarios
  • Recommended best practices
  • KubernetesCompute and legacy AksCompute

How it works

Azure Machine Learning Kubernetes compute supports two kinds of Kubernetes cluster:

  • AKS cluster in Azure. With your self-managed AKS cluster in Azure, you can gain security and controls to meet compliance requirement and flexibility to manage teams' ML workload.

With a simple cluster extension deployment on AKS, Kubernetes cluster is seamlessly supported inAzure Machine Learning to run training or inference workload. It's easy to enable and use an existing Kubernetes cluster forAzure Machine Learning workload with the following simple steps:

  1. Prepare an Azure Kubernetes Service cluster.
  2. Deploy the Azure Machine Learning extension.
  3. Attach Kubernetes cluster to your Azure Machine Learning workspace.
  4. Use the Kubernetes compute target from CLI v2, SDK v2, and the Studio UI.

IT-operation team. The IT-operation team is responsible for the first three steps: prepare an AKS , deploy Azure Machine Learning cluster extension, and attach Kubernetes cluster to Azure Machine Learning workspace. In addition to these essential compute setup steps, IT-operation team also uses familiar tools such as Azure CLI or kubectl to take care of the following tasks for the data-science team:

  • Network and security configurations, such as outbound proxy server connection or Azure firewall configuration, inference router (azureml-fe) setup, SSL/TLS termination, and virtual network configuration.
  • Create and manage instance types for different ML workload scenarios and gain efficient compute resource utilization.
  • Trouble shooting workload issues related to Kubernetes cluster.

Data-science team. Once the IT-operations team finishes compute setup and compute target(s) creation, the data-science team can discover a list of available compute targets and instance types in Azure Machine Learning workspace. These compute resources can be used for training or inference workload. Data science specifies compute target name and instance type name using their preferred tools or APIs. For example, these names could be Azure Machine Learning CLI v2, Python SDK v2, or Studio UI.

Separation of responsibilities between the IT-operations team and data-science team. As we mentioned in the previous section, managing your own compute and infrastructure for ML workload is a complex task. It's best to be done by IT-operations team so data-science team can focus on ML models for organizational efficiency.

Create and manage instance types for different ML workload scenarios. Each ML workload uses different amounts of compute resources such as CPU/GPU and memory. Azure Machine Learning implements instance type as Kubernetes custom resource definition (CRD) with properties of nodeSelector and resource request/limit. With a carefully curated list of instance types, IT-operations can target ML workload on specific node(s) and manage compute resource utilization efficiently.

Multiple Azure Machine Learning workspaces share the same Kubernetes cluster. You can attach Kubernetes cluster multiple times to the same Azure Machine Learning workspace or different Azure Machine Learning workspaces, creating multiple compute targets in one workspace or multiple workspaces. Since many customers organize data science projects around Azure Machine Learning workspace, multiple data science projects can now share the same Kubernetes cluster. This significantly reduces ML infrastructure management overheads and IT cost saving.

Team/project workload isolation using Kubernetes namespace. When you attach Kubernetes cluster to Azure Machine Learning workspace, you can specify a Kubernetes namespace for the compute target. All workloads run by the compute target are placed under the specified namespace.

KubernetesCompute and legacy AksCompute

With Azure Machine Learning CLI/Python SDK v1, you can deploy models on AKS using AksCompute target. Both KubernetesCompute target and AksCompute target support AKS integration, however they support it differently. The following table shows their key differences:

Capabilities AKS integration with AksCompute (legacy) AKS integration with KubernetesCompute
CLI/SDK v1 Yes No
CLI/SDK v2 No Yes
Training No Yes
Real-time inference Yes Yes
Batch inference No Yes
Real-time inference new features No new features development Active roadmap

With these key differences and overall Azure Machine Learning evolution to use SDK/CLI v2, Azure Machine Learning recommends you to use Kubernetes compute target to deploy models if you decide to use AKS for model deployment.

Other resources

Examples

All Azure Machine Learning examples can be found in https://github.com/Azure/azureml-examples.git.

For any Azure Machine Learning example, you only need to update the compute target name to your Kubernetes compute target, then you're all done.

Next steps