创建 Azure 机器学习计算群集Create an Azure Machine Learning compute cluster

了解如何在 Azure 机器学习工作区中创建和管理计算群集Learn how to create and manage a compute cluster in your Azure Machine Learning workspace.

可以使用 Azure 机器学习计算群集在云中的 CPU 或 GPU 计算节点群集之间分配训练或批量推理过程。You can use Azure Machine Learning compute cluster to distribute a training or batch inference process across a cluster of CPU or GPU compute nodes in the cloud. 有关包括 GPU 的 VM 大小的详细信息,请参阅 GPU 优化的虚拟机大小For more information on the VM sizes that include GPUs, see GPU-optimized virtual machine sizes.

在本文中,将学习以下内容:In this article, learn how to:

  • 创建计算群集Create a compute cluster
  • 降低计算群集成本Lower your compute cluster cost
  • 为群集设置托管标识Set up a managed identity for the cluster

先决条件Prerequisites

什么是计算群集?What is a compute cluster?

Azure 机器学习计算群集是一个托管的计算基础结构,可让你轻松创建单节点或多节点计算。Azure Machine Learning compute cluster is a managed-compute infrastructure that allows you to easily create a single or multi-node compute. 该计算是在工作区区域内部创建的,是可与工作区中的其他用户共享的资源。The compute is created within your workspace region as a resource that can be shared with other users in your workspace. 提交作业时,计算会自动扩展,并可以放入 Azure 虚拟网络。The compute scales up automatically when a job is submitted, and can be put in an Azure Virtual Network. 计算在容器化环境中执行,将模型的依赖项打包在 Docker 容器中。The compute executes in a containerized environment and packages your model dependencies in a Docker container.

计算群集可以在虚拟网络环境中安全地运行作业,无需企业打开 SSH 端口。Compute clusters can run jobs securely in a virtual network environment, without requiring enterprises to open up SSH ports. 作业在容器化环境中执行,并将模型依赖项打包到 Docker 容器中。The job executes in a containerized environment and packages your model dependencies in a Docker container.

限制Limitations

  • 请勿在工作区中为同一计算创建多个同步附件。Do not create multiple, simultaneous attachments to the same compute from your workspace. 例如,使用两个不同的名称将一个计算群集附加到工作区。For example, attaching one compute cluster to a workspace using two different names. 每个新附件都会破坏先前存在的附件。Each new attachment will break the previous existing attachment(s).

    如果要重新附加计算目标来实现某个目的(例如,更改群集配置设置),则必须先删除现有附件。If you want to re-attach a compute target, for example to change cluster configuration settings, you must first remove the existing attachment.

  • 本文档中列出的某些场景标记为“预览”。Some of the scenarios listed in this document are marked as preview. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负载。Preview functionality is provided without a service level agreement, and it's not recommended for production workloads. 某些功能可能不受支持或者受限。Certain features might not be supported or might have constrained capabilities. 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

  • Azure 机器学习计算对可以分配的核心数等属性实施默认限制。Azure Machine Learning Compute has default limits, such as the number of cores that can be allocated. 有关详细信息,请参阅管理和请求 Azure 资源的配额For more information, see Manage and request quotas for Azure resources.

  • Azure 允许你在资源上放置锁,这样这些资源就无法被删除,或者会处于只读状态。Azure allows you to place locks on resources, so that they cannot be deleted or are read only. 请勿将资源锁应用于包含工作区的资源组Do not apply resource locks to the resource group that contains your workspace. 将锁应用于包含工作区的资源组会阻止对 Azure ML 计算群集进行缩放操作。Applying a lock to the resource group that contains your workspace will prevent scaling operations for Azure ML compute clusters. 若要详细了解如何锁定资源,请参阅锁定资源以防止意外更改For more information on locking resources, see Lock resources to prevent unexpected changes.

提示

一般情况下,只要所需核心数方面的配额足够,群集就可以扩展到多达 100 个节点。Clusters can generally scale up to 100 nodes as long as you have enough quota for the number of cores required. 默认情况下,设置群集时会启用群集节点之间的通信(例如,为了支持 MPI 作业)。By default clusters are setup with inter-node communication enabled between the nodes of the cluster to support MPI jobs for example. 但是,可以将群集扩展到数千个节点,只需提交支持票证并请求将你的订阅、工作区或特定群集加入允许列表以禁用节点间通信即可。However you can scale your clusters to 1000s of nodes by simply raising a support ticket, and requesting to allow list your subscription, or workspace, or a specific cluster for disabling inter-node communication.

创建Create

时间估计:大约 5 分钟。Time estimate: Approximately 5 minutes.

可在不同的运行中重复使用 Azure 机器学习计算。Azure Machine Learning Compute can be reused across runs. 计算可与工作区中的其他用户共享,并在每次运行之后保留,它会根据提交的运行数以及群集上设置的 max_nodes 自动纵向扩展或缩减节点。The compute can be shared with other users in the workspace and is retained between runs, automatically scaling nodes up or down based on the number of runs submitted, and the max_nodes set on your cluster. min_nodes 设置控制可用节点数的下限。The min_nodes setting controls the minimum nodes available.

每个区域每个 VM 系列配额和创建计算群集时应用的区域总配额的专用内核是统一的,并与 Azure 机器学习训练计算实例配额共享。The dedicated cores per region per VM family quota and total regional quota, which applies to compute cluster creation, is unified and shared with Azure Machine Learning training compute instance quota.

重要

若要避免在没有作业运行时产生费用,请将最小节点数设置为 0。To avoid charges when no jobs are running, set the minimum nodes to 0. 此设置允许 Azure 机器学习在不使用节点时取消分配这些节点。This setting allows Azure Machine Learning to de-allocate the nodes when they aren't in use. 值大于 0 将使该数量的节点保持运行状态,即使它们未被使用也是如此。Any value larger than 0 will keep that number of nodes running, even if they are not in use.

计算在不使用时自动缩减为零个节点。The compute autoscales down to zero nodes when it isn't used. 按需创建专用 VM 来运行作业。Dedicated VMs are created to run your jobs as needed.

若要在 Python 中创建持久性 Azure 机器学习计算资源,请指定 vm_sizemax_nodes 属性。To create a persistent Azure Machine Learning Compute resource in Python, specify the vm_size and max_nodes properties. 然后,Azure 机器学习将对其他属性使用智能默认值。Azure Machine Learning then uses smart defaults for the other properties.

  • vm_size:Azure 机器学习计算创建的节点的 VM 系列。vm_size: The VM family of the nodes created by Azure Machine Learning Compute.
  • max_nodes:在 Azure 机器学习计算中运行作业时自动扩展到的最大节点数。max_nodes: The max number of nodes to autoscale up to when you run a job on Azure Machine Learning Compute.
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

还可以在创建 Azure 机器学习计算时配置多个高级属性。You can also configure several advanced properties when you create Azure Machine Learning Compute. 使用这些属性可以创建固定大小的持久性群集,或者在订阅中的现有 Azure 虚拟网络内创建持久性群集。The properties allow you to create a persistent cluster of fixed size, or within an existing Azure Virtual Network in your subscription. 有关详细信息,请参阅 AmlCompute 类See the AmlCompute class for details.

设置托管标识Set up managed identity

Azure 机器学习计算群集还支持使用托管标识来验证对 Azure 资源的访问,而不需要在代码中包含凭据。Azure Machine Learning compute clusters also support managed identities to authenticate access to Azure resources without including credentials in your code. 托管标识分为两种类型:There are two types of managed identities:

  • 系统分配的托管标识将在 Azure 机器学习计算群集上直接启用。A system-assigned managed identity is enabled directly on the Azure Machine Learning compute cluster. 系统分配的标识的生命周期将直接绑定到计算群集。The life cycle of a system-assigned identity is directly tied to the compute cluster. 如果计算群集遭删除,Azure 会自动清理 Azure AD 中的凭据和标识。If the compute cluster is deleted, Azure automatically cleans up the credentials and the identity in Azure AD.
  • 用户分配的托管标识是通过 Azure 托管标识服务提供的独立 Azure 资源。A user-assigned managed identity is a standalone Azure resource provided through Azure Managed Identity service. 可以将一个用户分配的托管标识分配给多个资源,并根据需要将其保留任意长的时间。You can assign a user-assigned managed identity to multiple resources, and it persists for as long as you want.
  • 在预配配置中配置托管标识:Configure managed identity in your provisioning configuration:

    • 系统分配的托管标识:System assigned managed identity:

      # configure cluster with a system-assigned managed identity
      compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                              max_nodes=5,
                                                              identity_type="SystemAssigned",
                                                              )
      
    • 用户分配的托管标识:User-assigned managed identity:

      # configure cluster with a user-assigned managed identity
      compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                              max_nodes=5,
                                                              identity_type="UserAssigned",
                                                              identity_id=['/subscriptions/<subcription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'])
      
      cpu_cluster_name = "cpu-cluster"
      cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
      
  • 将托管标识添加到现有计算群集Add managed identity to an existing compute cluster

    • 系统分配的托管标识:System-assigned managed identity:

      # add a system-assigned managed identity
      cpu_cluster.add_identity(identity_type="SystemAssigned")
      
    • 用户分配的托管标识:User-assigned managed identity:

      # add a user-assigned managed identity
      cpu_cluster.add_identity(identity_type="UserAssigned", 
                                  identity_id=['/subscriptions/<subcription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'])
      

备注

Azure 机器学习计算群集只支持一个系统分配的标识或支持多个用户分配的标识,而不能同时支持这二者。Azure Machine Learning compute clusters support only one system-assigned identity or multiple user-assigned identities, not both concurrently.

托管标识用法Managed identity usage

默认托管标识是系统分配的托管标识或第一个用户分配的托管标识。The default managed identity is the system-assigned managed identity or the first user-assigned managed identity.

在运行期间,一个标识有两种应用:During a run there are two applications of an identity:

  1. 系统使用标识来设置用户的存储装载、容器注册表和数据存储。The system uses an identity to set up the user's storage mounts, container registry, and datastores.

    • 在这种情况下,系统将使用默认托管标识。In this case, the system will use the default-managed identity.
  2. 用户应用标识以便从已提交运行的代码中访问资源The user applies an identity to access resources from within the code for a submitted run

    • 在这种情况下,请提供与要用于检索凭据的托管标识对应的 client_id。In this case, provide the client_id corresponding to the managed identity you want to use to retrieve a credential.
    • 或者,通过 DEFAULT_IDENTITY_CLIENT_ID 环境变量获取用户分配的标识的客户端 ID。Alternatively, get the user-assigned identity's client ID through the DEFAULT_IDENTITY_CLIENT_ID environment variable.

    例如,若要使用默认托管标识检索数据存储的令牌,请执行以下操作:For example, to retrieve a token for a datastore with the default-managed identity:

    client_id = os.environ.get('DEFAULT_IDENTITY_CLIENT_ID')
    credential = ManagedIdentityCredential(client_id=client_id)
    token = credential.get_token('https://storage.azure.com/')
    

疑难解答Troubleshooting

如果用户在 GA 发布之前已通过 Azure 门户创建了自己的 Azure 机器学习工作区,则他们可能无法在该工作区中创建 AmlCompute。There is a chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create AmlCompute in that workspace. 可对服务提出支持请求,也可通过门户或 SDK 创建新的工作区以立即解除锁定。You can either raise a support request against the service or create a new workspace through the portal or the SDK to unblock yourself immediately.

如果 Azure 机器学习计算群集在根据节点状态重设大小时卡住 (0 -> 0),可能是由于 Azure 资源锁定而导致的。If your Azure Machine Learning compute cluster appears stuck at resizing (0 -> 0) for the node state, this may be caused by Azure resource locks.

Azure 允许你在资源上放置锁,这样这些资源就无法被删除,或者会处于只读状态。Azure allows you to place locks on resources, so that they cannot be deleted or are read only. 锁定资源可能会导致意外结果。Locking a resource can lead to unexpected results. 某些操作看似不会修改资源,但实际上需要执行被锁阻止的操作。Some operations that don't seem to modify the resource actually require actions that are blocked by the lock.

例如,将删除锁应用于工作区的资源组会阻止对 Azure ML 计算群集进行缩放操作。For example, applying a delete lock to the resource group for your workspace will prevent scaling operations for Azure ML compute clusters.

若要详细了解如何锁定资源,请参阅锁定资源以防止意外更改For more information on locking resources, see Lock resources to prevent unexpected changes.

后续步骤Next steps

使用计算群集执行以下操作:Use your compute cluster to: