使用虚拟网络保护 Azure 机器学习训练环境Secure an Azure Machine Learning training environment with virtual networks

本文介绍如何在 Azure 机器学习中使用虚拟网络保护训练环境。In this article, you learn how to secure training environments with a virtual network in Azure Machine Learning.

本文是由五部分组成的系列文章的第三部分,指导你如何保护 Azure 机器学习工作流。This article is part three of a five-part series that walks you through securing an Azure Machine Learning workflow.

请参阅本系列中的其他文章:See the other articles in this series:

1.保护工作区 > 2.保护训练环境 > 3.保护推理环境 > 4.启用工作室功能1.Secure the workspace > 2. Secure the training environment > 3. Secure the inferencing environment > 4. Enable studio functionality

本文介绍如何在虚拟网络中保护以下训练计算资源:In this article you learn how to secure the following training compute resources in a virtual network:

  • Azure 机器学习计算群集Azure Machine Learning compute cluster
  • Azure 机器学习计算实例Azure Machine Learning compute instance
  • 虚拟机Virtual Machine
  • HDInsight 群集HDInsight cluster

先决条件Prerequisites

  • 用于计算资源的现有虚拟网络和子网。An existing virtual network and subnet to use with your compute resources.

  • 若要将资源部署到虚拟网络或子网中,你的用户帐户必须在 Azure 基于角色的访问控制 (RBAC) 中具有以下操作的权限:To deploy resources into a virtual network or subnet, your user account must have permissions to the following actions in Azure role-based access controls (RBAC):

    • “Microsoft.Network/virtualNetworks/join/action”(在虚拟网络资源上)。"Microsoft.Network/virtualNetworks/join/action" on the virtual network resource.
    • “Microsoft.Network/virtualNetworks/subnet/join/action”(在子网资源上)。"Microsoft.Network/virtualNetworks/subnet/join/action" on the subnet resource.

    若要详细了解如何将 RBAC 与网络配合使用,请参阅网络内置角色For more information on RBAC with networking, see the Networking built-in roles

计算群集和实例Compute clusters & instances

若要在虚拟网络中使用托管 Azure 机器学习计算目标Azure 机器学习计算实例,必须满足以下网络要求:To use either a managed Azure Machine Learning compute target or an Azure Machine Learning compute instance in a virtual network, the following network requirements must be met:

  • 虚拟网络必须与 Azure 机器学习工作区位于同一订阅和区域中。The virtual network must be in the same subscription and region as the Azure Machine Learning workspace.
  • 为计算实例或群集指定的子网必须具有足够的未分配 IP 地址,以容纳目标 VM 数目。The subnet that's specified for the compute instance or cluster must have enough unassigned IP addresses to accommodate the number of VMs that are targeted. 如果该子网没有足够的未分配 IP 地址,则只会为计算群集分配一部分资源。If the subnet doesn't have enough unassigned IP addresses, a compute cluster will be partially allocated.
  • 检查对虚拟网络的订阅或资源组实施的安全策略或锁定是否限制了管理虚拟网络所需的权限。Check to see whether your security policies or locks on the virtual network's subscription or resource group restrict permissions to manage the virtual network. 如果你打算通过限制流量来保护虚拟网络,请为计算服务保持打开某些端口。If you plan to secure the virtual network by restricting traffic, leave some ports open for the compute service. 有关详细信息,请参阅所需的端口部分。For more information, see the Required ports section.
  • 若要将多个计算实例或群集放入一个虚拟网络,可能需要请求提高一个或多个资源的配额。If you're going to put multiple compute instances or clusters in one virtual network, you might need to request a quota increase for one or more of your resources.
  • 如果工作区的一个或多个 Azure 存储帐户也在虚拟网络中受保护,它们必须与 Azure 机器学习计算实例或群集位于同一虚拟网络中。If the Azure Storage Account(s) for the workspace are also secured in a virtual network, they must be in the same virtual network as the Azure Machine Learning compute instance or cluster.
  • 为了让计算实例 Jupyter 功能可以正常运行,请确保没有禁用 Web 套接字通信。For compute instance Jupyter functionality to work, ensure that web socket communication is not disabled. 请确保网络允许到 *.instances.azureml.net 和 *.instances.azureml.ms 的 websocket 连接。Please ensure your network allows websocket connections to *.instances.azureml.net and *.instances.azureml.ms.
  • 在专用链接工作区中部署计算实例时,只能从虚拟网络内部访问。When compute instance is deployed in a private link workspace it can be only be accessed from within virtual network. 如果使用自定义 DNS 或主机文件,请为 <instance-name>.<region>.instances.azureml.ms 添加一个条目,该条目具有工作区专用终结点的专用 IP 地址。If you are using custom DNS or hosts file please add an entry for <instance-name>.<region>.instances.azureml.ms with private IP address of workspace private endpoint. 有关详细信息,请参阅自定义 DNS 一文。For more information see the custom DNS article.

提示

机器学习计算实例或群集自动在包含虚拟网络的资源组中分配更多网络资源。The Machine Learning compute instance or cluster automatically allocates additional networking resources in the resource group that contains the virtual network. 对于每个计算实例或群集,此服务分配以下资源:For each compute instance or cluster, the service allocates the following resources:

  • 一个网络安全组One network security group
  • 一个公共 IP 地址One public IP address
  • 一个负载均衡器One load balancer

对于群集,每当群集纵向缩减为 0 个节点时,这些资源都会被删除(并重新创建);但对于实例,这些资源会一直保留到实例完全删除(停止并不会删除资源)。In the case of clusters these resources are deleted (and recreated) every time the cluster scales down to 0 nodes, however for an instance the resources are held onto till the instance is completely deleted (stopping does not remove the resources). 这些资源受订阅的资源配额限制。These resources are limited by the subscription's resource quotas.

所需端口Required ports

如果你计划通过限制进出公共 Internet 的网络流量来保护虚拟网络,则必须允许来自 Azure Batch 服务的入站通信。If you plan on securing the virtual network by restricting network traffic to/from the public internet, you must allow inbound communications from the Azure Batch service.

Batch 服务在附加到 VM 的网络接口 (NIC) 级别添加网络安全组 (NSG)。The Batch service adds network security groups (NSGs) at the level of network interfaces (NICs) that are attached to VMs. 这些 NSG 自动配置允许以下流量的入站和出站规则:These NSGs automatically configure inbound and outbound rules to allow the following traffic:

  • 端口 29876 和 29877 上的来自 BatchNodeManagement 服务标记的入站 TCP 流量。Inbound TCP traffic on ports 29876 and 29877 from a Service Tag of BatchNodeManagement.

    使用 BatchNodeManagement 服务标记的入站规则

  • (可选)端口 22 上允许远程访问的入站 TCP 流量。(Optional) Inbound TCP traffic on port 22 to permit remote access. 仅当要在公共 IP 上使用 SSH 进行连接时,才使用此端口。Use this port only if you want to connect by using SSH on the public IP.

  • 任何端口上通往虚拟网络的出站流量。Outbound traffic on any port to the virtual network.

  • 任何端口上通往 Internet 的出站流量。Outbound traffic on any port to the internet.

  • 对于端口 44224 上的来自 AzureMachineLearning 服务标记的计算实例入站 TCP 流量。For compute instance inbound TCP traffic on port 44224 from a Service Tag of AzureMachineLearning.

重要

在 Batch 配置的 NSG 中修改或添加入站或出站规则时,请务必小心。Exercise caution if you modify or add inbound or outbound rules in Batch-configured NSGs. 如果 NSG 阻止与计算节点通信,则计算服务会将计算节点的状态设置为不可用。If an NSG blocks communication to the compute nodes, the compute service sets the state of the compute nodes to unusable.

不需要在子网级别指定 NSG,因为 Azure Batch 会配置自身的 NSG。You don't need to specify NSGs at the subnet level, because the Azure Batch service configures its own NSGs. 但是,如果包含 Azure 机器学习计算的子网具有关联的 NSG 或防火墙,则还必须允许前面列出的流量。However, if the subnet that contains the Azure Machine Learning compute has associated NSGs or a firewall, you must also allow the traffic listed earlier.

下图显示了 Azure 门户中的 NSG 规则配置:The NSG rule configuration in the Azure portal is shown in the following images:

用于机器学习计算的入站 NSG 规则

机器学习计算的入站 NSG 规则

限制来自虚拟网络的出站连接Limit outbound connectivity from the virtual network

如果你不想要使用默认的出站规则,同时想要限制虚拟网络的出站访问,请执行以下步骤:If you don't want to use the default outbound rules and you do want to limit the outbound access of your virtual network, use the following steps:

  • 使用 NSG 规则来拒绝出站 Internet 连接。Deny outbound internet connection by using the NSG rules.

  • 对于 计算实例计算群集,请将出站流量限制为以下各项:For a compute instance or a compute cluster, limit outbound traffic to the following items:

    • Azure 存储 - 使用 服务标记 Storage.RegionNameAzure Storage, by using Service Tag of Storage.RegionName. 其中 {RegionName} 是 Azure 区域的名称。Where {RegionName} is the name of an Azure region.
    • Azure 容器注册表 - 使用 服务标记 AzureContainerRegistry.RegionNameAzure Container Registry, by using Service Tag of AzureContainerRegistry.RegionName. 其中 {RegionName} 是 Azure 区域的名称。Where {RegionName} is the name of an Azure region.
    • Azure 机器学习,通过使用服务标记 AzureMachineLearningAzure Machine Learning, by using Service Tag of AzureMachineLearning
    • Azure 资源管理器,通过使用服务标记 AzureResourceManagerAzure Resource Manager, by using Service Tag of AzureResourceManager
    • Azure Active Directory - 使用 服务标记 AzureActiveDirectoryAzure Active Directory, by using Service Tag of AzureActiveDirectory

下图展示了 Azure 门户中的 NSG 规则配置:The NSG rule configuration in the Azure portal is shown in the following image:

机器学习计算的出站 NSG 规则The outbound NSG rules for Machine Learning Compute

备注

如果你计划使用 Microsoft 提供的默认 Docker 映像并启用用户管理的依赖项,则还必须使用以下服务标记:If you plan on using default Docker images provided by Microsoft, and enabling user managed dependencies, you must also use the following Service Tags:

  • MicrosoftContainerRegistryMicrosoftContainerRegistry
  • AzureFrontDoor.FirstPartyAzureFrontDoor.FirstParty

当你的训练脚本中有类似于以下代码片段的代码时,需要此配置:This configuration is needed when you have code similar to the following snippets as part of your training scripts:

RunConfig 训练RunConfig training

# create a new runconfig object
run_config = RunConfiguration()

# configure Docker 
run_config.environment.docker.enabled = True
# For GPU, use DEFAULT_GPU_IMAGE
run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE 
run_config.environment.python.user_managed_dependencies = True

Estimator 训练Estimator training

est = Estimator(source_directory='.',
                script_params=script_params,
                compute_target='local',
                entry_script='dummy_train.py',
                user_managed=True)
run = exp.submit(est)

强制隧道Forced tunneling

若要将强制隧道与机器学习计算配合使用,必须允许从包含计算资源的子网与公共 Internet 进行通信。If you're using forced tunneling with Azure Machine Learning compute, you must allow communication with the public internet from the subnet that contains the compute resource. 此通信用于计划和访问 Azure 存储的任务。This communication is used for task scheduling and accessing Azure Storage.

可以通过两种方式来实现此目的:There are two ways that you can accomplish this:

  • 使用虚拟网络 NATUse a Virtual Network NAT. NAT 网关为虚拟网络中的一个或多个子网提供出站 Internet 连接。A NAT gateway provides outbound internet connectivity for one or more subnets in your virtual network. 有关信息,请参阅设计使用 NAT 网关资源的虚拟网络For information, see Designing virtual networks with NAT gateway resources.

  • 用户定义的路由 (UDR) 添加到包含计算资源的子网。Add user-defined routes (UDRs) to the subnet that contains the compute resource. 为资源所在区域中的 Azure Batch 服务使用的每个 IP 地址建立一个 UDR。Establish a UDR for each IP address that's used by the Azure Batch service in the region where your resources exist. 借助这些 UDR,Batch 服务可以与计算节点进行通信,以便进行任务计划编制。These UDRs enable the Batch service to communicate with compute nodes for task scheduling. 还要添加资源所在的 Azure 机器学习服务 IP 地址,因为这是访问计算实例所必需的。Also add the IP address for the Azure Machine Learning service where the resources exist, as this is required for access to Compute Instances. 若要获取 Batch 服务和 Azure 机器学习服务的 IP 地址列表,请使用以下方法之一:To get a list of IP addresses of the Batch service and Azure Machine Learning service, use one of the following methods:

    • 下载 Azure IP 范围和服务标记,并在文件中搜索 BatchNodeManagement.<region>AzureMachineLearning.<region>(其中 <region> 是你的 Azure 区域)。Download the Azure IP Ranges and Service Tags and search the file for BatchNodeManagement.<region> and AzureMachineLearning.<region>, where <region> is your Azure region.

    • 使用 Azure CLI 下载信息。Use the Azure CLI to download the information. 以下示例下载 IP 地址信息,并筛选出“中国东部 2”区域的信息:The following example downloads the IP address information and filters out the information for the China East 2 region:

      az network list-service-tags -l "China East 2" --query "values[?starts_with(id, 'Batch')] | [?properties.region=='chinaeast2']"
      az network list-service-tags -l "China East 2" --query "values[?starts_with(id, 'AzureMachineLearning')] | [?properties.region=='chinaeast2']"
      

      提示

      如果你使用的是美国-弗吉尼亚、美国-亚利桑那区域或中国东部 2 区域,则这些命令不会返回任何 IP 地址。If you are using the US-Virginia, US-Arizona regions, or China-East-2 regions, these commands return no IP addresses. 而如果使用以下链接之一下载 IP 地址列表:Instead, use one of the following links to download a list of IP addresses:

    添加 UDR 时,请为每个相关的 Batch IP 地址前缀定义路由,并将“下一跃点类型”设置为“Internet”。 When you add the UDRs, define the route for each related Batch IP address prefix and set Next hop type to Internet. 下图显示了 Azure 门户中此 UDR 的示例:The following image shows an example of this UDR in the Azure portal:

    地址前缀的 UDR 示例

    重要

    IP 地址可能会随时间推移而改变。The IP addresses may change over time.

    除了定义的任何 UDR,还必须通过本地网络设备允许流向 Azure 存储的出站流量。In addition to any UDRs that you define, outbound traffic to Azure Storage must be allowed through your on-premises network appliance. 具体而言,此流量的 URL 采用以下格式:<account>.table.core.windows.net<account>.queue.core.windows.net<account>.blob.core.windows.netSpecifically, the URLs for this traffic are in the following forms: <account>.table.core.windows.net, <account>.queue.core.windows.net, and <account>.blob.core.windows.net.

    有关详细信息,请参阅在虚拟网络中创建 Azure Batch 池For more information, see Create an Azure Batch pool in a virtual network.

在虚拟网络中创建计算群集Create a compute cluster in a virtual network

若要创建机器学习计算群集,请按照以下步骤操作:To create a Machine Learning Compute cluster, use the following steps:

  1. 登录 Azure 机器学习工作室,然后选择你的订阅和工作区。Sign in to Azure Machine Learning studio, and then select your subscription and workspace.

  2. 选择左侧的“计算”。Select Compute on the left.

  3. 在中心内选择“训练群集”,然后选择“+”。Select Training clusters from the center, and then select +.

  4. 在“新建训练群集”对话框中,展开“高级设置”部分。In the New Training Cluster dialog, expand the Advanced settings section.

  5. 若要将此计算资源配置为使用虚拟网络,请在“配置虚拟网络”部分中执行以下操作:To configure this compute resource to use a virtual network, perform the following actions in the Configure virtual network section:

    1. 在“资源组”下拉列表中,选择包含虚拟网络的资源组。In the Resource group drop-down list, select the resource group that contains the virtual network.
    2. 在“虚拟网络”下拉列表中,选择包含子网的虚拟网络。In the Virtual network drop-down list, select the virtual network that contains the subnet.
    3. 在“子网”下拉列表中,选择要使用的子网。In the Subnet drop-down list, select the subnet to use.

    机器学习计算的虚拟网络设置

也可以使用 Azure 机器学习 SDK 创建机器学习计算群集。You can also create a Machine Learning Compute cluster by using the Azure Machine Learning SDK. 以下代码在名为 mynetwork 的虚拟网络的 default 子网中创建新的机器学习计算群集:The following code creates a new Machine Learning Compute cluster in the default subnet of a virtual network named mynetwork:

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# The Azure virtual network name, subnet, and resource group
vnet_name = 'mynetwork'
subnet_name = 'default'
vnet_resourcegroup_name = 'mygroup'

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print("Found existing cpucluster")
except ComputeTargetException:
    print("Creating new cpucluster")

    # Specify the configuration for the new cluster
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2",
                                                           min_nodes=0,
                                                           max_nodes=4,
                                                           vnet_resourcegroup_name=vnet_resourcegroup_name,
                                                           vnet_name=vnet_name,
                                                           subnet_name=subnet_name)

    # Create the cluster with the specified name and configuration
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

    # Wait for the cluster to be completed, show the output log
    cpu_cluster.wait_for_completion(show_output=True)

创建过程完成后,请在试验中使用该群集训练模型。When the creation process finishes, you train your model by using the cluster in an experiment. 有关详细信息,请参阅选择并使用用于训练的计算目标For more information, see Select and use a compute target for training.

访问计算实例笔记本中的数据Access data in a Compute Instance notebook

如果要在 Azure 计算实例上使用笔记本,则必须确保笔记本在与数据相同的虚拟网络和子网后的计算资源上运行。If you're using notebooks on an Azure Compute instance, you must ensure that your notebook is running on a compute resource behind the same virtual network and subnet as your data.

在创建过程中,你必须在“高级设置” > “配置虚拟网络”下将你的计算实例配置为位于同一虚拟网络中。 。You must configure your Compute Instance to be in the same virtual network during creation under Advanced settings > Configure virtual network. 无法将现有计算实例添加到虚拟网络中。You cannot add an existing Compute Instance to a virtual network.

Azure DatabricksAzure Databricks

若要将虚拟网络中的 Azure Databricks 用于工作区,必须满足以下要求:To use Azure Databricks in a virtual network with your workspace, the following requirements must be met:

  • 该虚拟网络必须与 Azure 机器学习工作区位于同一订阅和区域。The virtual network must be in the same subscription and region as the Azure Machine Learning workspace.
  • 如果工作区的 Azure 存储帐户也在虚拟网络中受保护,则它们必须与 Azure Databricks 群集位于同一虚拟网络中。If the Azure Storage Account(s) for the workspace are also secured in a virtual network, they must be in the same virtual network as the Azure Databricks cluster.
  • 除了 Azure Databricks 使用的 databricks-privatedatabricks-public 子网以外,还需要为虚拟网络创建 default 子网。In addition to the databricks-private and databricks-public subnets used by Azure Databricks, the default subnet created for the virtual network is also required.

若要详细了解如何结合使用 Azure Databricks 和虚拟网络,请参阅在 Azure 虚拟网络中部署 Azure DatabricksFor specific information on using Azure Databricks with a virtual network, see Deploy Azure Databricks in your Azure Virtual Network.

虚拟机或 HDInsight 群集Virtual machine or HDInsight cluster

重要

Azure 机器学习只支持运行 Ubuntu 的虚拟机。Azure Machine Learning supports only virtual machines that are running Ubuntu.

本部分介绍如何将虚拟网络中的虚拟机或 Azure HDInsight 群集用于工作区。In this section you learn how to use a virtual machine or Azure HDInsight cluster in a virtual network with your workspace.

创建 VM 或 HDInsight 群集Create the VM or HDInsight cluster

使用 Azure 门户或 Azure CLI 创建 VM 或 HDInsight 群集,然后将群集置于 Azure 虚拟网络中。Create a VM or HDInsight cluster by using the Azure portal or the Azure CLI, and put the cluster in an Azure virtual network. 有关详细信息,请参阅以下文章:For more information, see the following articles:

配置网络端口Configure network ports

允许 Azure 机器学习与 VM 或群集上的 SSH 端口进行通信,为网络安全组配置源条目。Allow Azure Machine Learning to communicate with the SSH port on the VM or cluster, configure a source entry for the network security group. SSH 端口通常是端口 22。The SSH port is usually port 22. 若要允许来自此源的流量,请执行以下操作:To allow traffic from this source, do the following actions:

  1. 在“源”下拉列表中,选择“服务标记”。In the Source drop-down list, select Service Tag.

  2. 在“源服务标记”下拉列表中,选择“AzureMachineLearning”。 In the Source service tag drop-down list, select AzureMachineLearning.

    用于在虚拟网络中的 VM 或 HDInsight 群集上执行试验的入站规则

  3. 在“源端口范围”下拉列表中,选择 *In the Source port ranges drop-down list, select *.

  4. 在“目标”下拉列表中,选择“任何”。 In the Destination drop-down list, select Any.

  5. 在“目标端口范围”下拉列表中,选择“22”。 In the Destination port ranges drop-down list, select 22.

  6. 在“协议”下,选择“任何”。 Under Protocol, select Any.

  7. 在“操作”下,选择“允许”。 Under Action, select Allow.

保留网络安全组的默认出站规则。Keep the default outbound rules for the network security group. 有关详细信息,请参阅安全组中的“默认安全规则”。For more information, see the default security rules in Security groups.

如果你不想要使用默认出站规则,同时想要限制虚拟网络的出站访问,请参阅限制来自虚拟网络的出站连接部分。If you don't want to use the default outbound rules and you do want to limit the outbound access of your virtual network, see the Limit outbound connectivity from the virtual network section.

附加 VM 或 HDInsight 群集Attach the VM or HDInsight cluster

将 VM 或 HDInsight 群集附加到 Azure 机器学习工作区。Attach the VM or HDInsight cluster to your Azure Machine Learning workspace. 有关详细信息,请参阅设置模型训练的计算目标For more information, see Set up compute targets for model training.

后续步骤Next steps

本文是由三部分构成的虚拟网络系列文章中的第 3 部分。This article is part three in a three-part virtual network series. 若要了解如何保护虚拟网络,请参阅其余文章:See the rest of the articles to learn how to secure a virtual network: