什么是 Azure 机器学习中的计算目标?What are compute targets in Azure Machine Learning?

计算目标是指定的计算资源或环境,用来运行训练脚本或托管服务部署。A compute target is a designated compute resource or environment where you run your training script or host your service deployment. 此位置可以是你的本地计算机,也可以是基于云的计算资源。This location might be your local machine or a cloud-based compute resource. 如果使用计算目标,以后无需更改代码即可轻松更改计算环境。Using compute targets makes it easy for you to later change your compute environment without having to change your code.

在典型的模型开发生命周期中,你可以:In a typical model development lifecycle, you might:

  1. 首先,基于少量数据进行开发和试验。Start by developing and experimenting on a small amount of data. 在此阶段,请使用本地环境(如本地计算机或基于云的虚拟机 (VM))作为计算目标。At this stage, use your local environment, such as a local computer or cloud-based virtual machine (VM), as your compute target.
  2. 通过使用其中一种训练计算目标,纵向扩展到更多的数据或进行分布式训练。Scale up to larger data, or do distributed training by using one of these training compute targets.
  3. 在模型准备就绪后,请使用其中一种部署计算目标将该模型部署到 Web 托管环境或 IoT 设备。After your model is ready, deploy it to a web hosting environment or IoT device with one of these deployment compute targets.

你用于计算目标的计算资源附加到工作区The compute resources you use for your compute targets are attached to a workspace. 本地计算机以外的计算资源由工作区的用户共享。Compute resources other than the local machine are shared by users of the workspace.

训练计算目标Training compute targets

Azure 机器学习为不同的计算目标提供不同的支持。Azure Machine Learning has varying support across different compute targets. 典型的模型开发生命周期从针对少量数据进行开发或试验开始。A typical model development lifecycle starts with development or experimentation on a small amount of data. 在此阶段,请使用本地环境,如本地计算机或基于云的 VM。At this stage, use a local environment like your local computer or a cloud-based VM. 在针对更大的数据集纵向扩展训练或执行分布式训练时,请使用 Azure 机器学习计算来创建可在每次提交运行时自动缩放的单节点或多节点群集。As you scale up your training on larger datasets or perform distributed training, use Azure Machine Learning compute to create a single- or multi-node cluster that autoscales each time you submit a run. 你也可以附加自己的计算资源,不过,为不同方案提供的支持可能会有所不同。You can also attach your own compute resource, although support for different scenarios might vary.

一个训练作业的计算目标可以在下一个训练作业中重复使用Compute targets can be reused from one training job to the next. 例如,将远程 VM 附加到你的工作区后,可以将其重复用于多个作业。For example, once you attach a remote VM to your workspace, you can reuse it for multiple jobs. 对于机器学习管道,请对每个计算目标使用适当的管道步骤For machine learning pipelines, use the appropriate pipeline step for each compute target.

训练目标Training  targets 自动化 MLAutomated ML ML 管道ML pipelines Azure 机器学习设计器Azure Machine Learning designer
本地计算机Local computer yes    
Azure 机器学习计算群集Azure Machine Learning compute cluster 是且yes &
超参数优化hyperparameter tuning
yes yes
Azure 机器学习计算实例Azure Machine Learning compute instance 是且yes &
超参数优化hyperparameter tuning
yes
远程 VMRemote VM 是且yes &
超参数优化hyperparameter tuning
yes  
Azure HDInsightAzure HDInsight   yes  
Azure BatchAzure Batch   yes  

详细了解如何将训练运行提交到计算目标Learn more about how to submit a training run to a compute target.

用于推理的计算目标Compute targets for inference

以下计算资源可用来托管模型部署。The following compute resources can be used to host your model deployment.

计算目标Compute target 用于Used for GPU 支持GPU support FPGA 支持FPGA support 说明Description
本地 web 服务Local web service 测试/调试Testing/debugging     用于有限的测试和故障排除。Use for limited testing and troubleshooting. 硬件加速依赖于本地系统中库的使用情况。Hardware acceleration depends on use of libraries in the local system.
Azure 机器学习计算实例 web 服务Azure Machine Learning compute instance web service 测试/调试Testing/debugging     用于有限的测试和故障排除。Use for limited testing and troubleshooting.
Azure 容器实例Azure Container Instances 测试或开发Testing or development     用于需要小于 48 GB RAM 的基于 CPU 的小规模工作负载。Use for low-scale CPU-based workloads that require less than 48 GB of RAM.
Azure 机器学习计算群集Azure Machine Learning compute clusters (预览)批处理 推理(Preview) Batch inference (机器学习管道)Yes (machine learning pipeline)   对无服务器计算运行批量评分。Run batch scoring on serverless compute. 支持普通 VM 和低优先级 VM。Supports normal and low-priority VMs.
Azure FunctionsAzure Functions (预览)实时推理(Preview) Real-time inference      
Azure IoT EdgeAzure IoT Edge (预览)IoT 模块(Preview) IoT module     在 IoT 设备上部署和提供 ML 模型。Deploy and serve ML models on IoT devices.

备注

尽管计算目标(例如本地、Azure 机器学习计算实例和 Azure 机器学习计算群集)支持用于训练和试验的 GPU,但在__部署为 Web 服务__时,使用 GPU 进行推理仅在 Azure Kubernetes 服务中受支持。Although compute targets like local, Azure Machine Learning compute instance, and Azure Machine Learning compute clusters support GPU for training and experimentation, using GPU for inference when deployed as a web service is supported only on Azure Kubernetes Service.

只有在 Azure 机器学习计算上,才能__在通过机器学习管道评分时__使用 GPU 进行推理。Using a GPU for inference when scoring with a machine learning pipeline is supported only on Azure Machine Learning Compute.

执行推理时,Azure 机器学习会创建托管模型和使用该模型所需的关联资源的 Docker 容器。When performing inference, Azure Machine Learning creates a Docker container that hosts the model and associated resources needed to use it. 然后,在以下任一部署场景中使用此容器:This container is then used in one of the following deployment scenarios:

了解在何处以及如何将模型部署到计算目标Learn where and how to deploy your model to a compute target.

Azure 机器学习计算(托管)Azure Machine Learning compute (managed)

托管计算资源是由 Azure 机器学习创建和管理的。A managed compute resource is created and managed by Azure Machine Learning. 此计算针对机器学习工作负荷进行了优化。This compute is optimized for machine learning workloads. Azure 机器学习计算群集和计算实例是仅有的托管计算。Azure Machine Learning compute clusters and compute instances are the only managed computes.

可以通过以下方法创建 Azure 机器学习计算实例或计算群集:You can create Azure Machine Learning compute instances or compute clusters from:

在创建时,这些计算资源会自动成为工作区的一部分,这一点与其他类型的计算目标不同。When created, these compute resources are automatically part of your workspace, unlike other kinds of compute targets.

功能Capability 计算群集Compute cluster 计算实例Compute instance
单节点或多节点群集Single- or multi-node cluster
每次提交运行时自动缩放Autoscales each time you submit a run
自动化群集管理和作业计划Automatic cluster management and job scheduling
为 CPU 和 GPU 资源提供支持Support for both CPU and GPU resources

备注

计算群集在处于空闲状态时会自动缩放到 0 个节点,因此,在未使用它时无需付费。When a compute cluster is idle, it autoscales to 0 nodes, so you don't pay when it's not in use. 计算实例始终处于启用状态,并且不会自动缩放。A compute instance is always on and doesn't autoscale. 在未使用计算实例时,应停止计算实例,以免产生额外费用。You should stop the compute instance when you aren't using it to avoid extra cost.

支持的 VM 系列和大小Supported VM series and sizes

为 Azure 机器学习中的托管计算资源选择节点大小时,可以从 Azure 提供的选定 VM 大小中进行选择。When you select a node size for a managed compute resource in Azure Machine Learning, you can choose from among select VM sizes available in Azure. Azure 针对不同工作负载为 Linux 和 Windows 提供了一系列大小。Azure offers a range of sizes for Linux and Windows for different workloads. 若要了解详细信息,请参阅 VM 类型和大小To learn more, see VM types and sizes.

选择 VM 大小时有几个例外和限制:There are a few exceptions and limitations to choosing a VM size:

  • Azure 机器学习不支持某些 VM 系列。Some VM series aren't supported in Azure Machine Learning.
  • 某些 VM 系列是受限制的。Some VM series are restricted. 若要使用受限制的系列,请与支持团队联系并请求为该系列增加配额。To use a restricted series, contact support and request a quota increase for the series. 若要了解如何联系支持人员,请参阅 Azure 支持选项For information on how to contact support, see Azure support options.

请查看下表,了解有关支持的系列和限制的详细信息。See the following table to learn more about supported series and restrictions.

支持的 VM 系列Supported VM series 限制Restrictions
DD 无。None.
Dv2Dv2 无。None.
Dv3Dv3 无。None.
DSv2DSv2 无。None.
DSv3DSv3 无。None.
FSv2FSv2 无。None.
HBv2HBv2 需要审批。Requires approval.
HCSHCS 需要审批。Requires approval.
MM 需要审批。Requires approval.
NCNC 无。None.
NCsv2NCsv2 需要审批。Requires approval.
NCsv3NCsv3 需要审批。Requires approval.
NDsNDs 需要审批。Requires approval.
NDv2NDv2 需要审批。Requires approval.
NVNV 无。None.
NVv3NVv3 需要审批。Requires approval.

虽然 Azure 机器学习支持这些 VM 系列,但它们可能并非在所有 Azure 区域中均可用。While Azure Machine Learning supports these VM series, they might not be available in all Azure regions. 若要检查 VM 系列是否可用,请参阅可用产品(按区域)To check whether VM series are available, see Products available by region.

备注

Azure 机器学习不支持 Azure 计算支持的所有 VM 大小。Azure Machine Learning doesn't support all VM sizes that Azure Compute supports. 若要列出可用的 VM 大小,请使用以下某种方法:To list the available VM sizes, use one of the following methods:

计算隔离Compute isolation

Azure 机器学习计算提供已隔离到特定硬件类型并专用于单个客户的 VM 大小。Azure Machine Learning compute offers VM sizes that are isolated to a specific hardware type and dedicated to a single customer. 独立 VM 大小最适合为满足合规性和监管要求等原因而需要与其他客户的工作负载高度隔离的工作负载。Isolated VM sizes are best suited for workloads that require a high degree of isolation from other customers' workloads for reasons that include meeting compliance and regulatory requirements. 使用独立大小可保证你的 VM 将是该特定服务器实例上唯一运行的 VM。Utilizing an isolated size guarantees that your VM will be the only one running on that specific server instance.

当前的独立 VM 产品/服务包括:The current isolated VM offerings include:

  • Standard_M128msStandard_M128ms
  • Standard_F72s_v2Standard_F72s_v2
  • Standard_NC24s_v3Standard_NC24s_v3
  • Standard_NC24rs_v3*Standard_NC24rs_v3*

*支持 RDMA*RDMA capable

若要了解有关隔离的详细信息,请参阅 Azure 公有云中的隔离To learn more about isolation, see Isolation in the Azure public cloud.

非托管计算Unmanaged compute

非托管计算目标不是由 Azure 机器学习托管的。An unmanaged compute target is not managed by Azure Machine Learning. 请在 Azure 机器学习外部创建此类型的计算目标,然后将其附加到工作区。You create this type of compute target outside Azure Machine Learning and then attach it to your workspace. 对于非托管计算资源,可能需要执行额外的步骤才能保持或提高机器学习工作负荷的性能。Unmanaged compute resources can require additional steps for you to maintain or to improve performance for machine learning workloads.

后续步骤Next steps

了解如何:Learn how to: