创建并附加 Azure Kubernetes 服务群集Create and attach an Azure Kubernetes Service cluster

Azure 机器学习可以将经过训练的机器学习模型部署到 Azure Kubernetes 服务。Azure Machine Learning can deploy trained machine learning models to Azure Kubernetes Service. 但是,必须首先从 Azure ML 工作区创建 Azure Kubernetes 服务 (AKS) 群集,或者附加现有 AKS 群集。However, you must first either create an Azure Kubernetes Service (AKS) cluster from your Azure ML workspace, or attach an existing AKS cluster. 本文提供了有关创建和附加群集的信息。This article provides information on both creating and attaching a cluster.

先决条件Prerequisites

限制Limitations

  • 如果群集中需要部署的是标准负载均衡器 (SLB),而不是基本负载均衡器 (BLB),请在 AKS 门户/CLI/SDK 中创建群集,然后将该群集附加到 AML 工作区 。If you need a Standard Load Balancer(SLB) deployed in your cluster instead of a Basic Load Balancer(BLB), create a cluster in the AKS portal/CLI/SDK and then attach it to the AML workspace.

  • 如果你的 Azure Policy 限制创建公共 IP 地址,则无法创建 AKS 群集。If you have an Azure Policy that restricts the creation of Public IP addresses, then AKS cluster creation will fail. AKS 需要一个公共 IP 用于出口流量AKS requires a Public IP for egress traffic. 出口流量一文还指导如何通过公共 IP(几个完全限定域名的 IP 除外)锁定来自群集的出口流量。Th egress traffic article also provides guidance to lock down egress traffic from the cluster through the Public IP, except for a few fully qualified domain names. 启用公共 IP 有两种方法:There are 2 ways to enable a Public IP:

    • 群集可以使用在默认情况下与 BLB 或 SLB 一起创建的公共 IP,或者The cluster can use the Public IP created by default with the BLB or SLB, Or
    • 可以在没有公共 IP 的情况下创建群集,然后为公共 IP 配置一个带有用户定义路由的防火墙。The cluster can be created without a Public IP and then a Public IP is configured with a firewall with a user defined route. 有关详细信息,请参阅使用用户定义的路由自定义群集出口For more information, see Customize cluster egress with a user-defined-route.

    AML 控制平面不会与此公共 IP 通信。The AML control plane does not talk to this Public IP. 它与 AKS 控制平面通信以便进行部署。It talks to the AKS control plane for deployments.

  • 如果附加 AKS 群集(已启用授权 IP 范围来访问 API 服务器),请为该 AKS 群集启用 AML 控制平面 IP 范围。If you attach an AKS cluster, which has an Authorized IP range enabled to access the API server, enable the AML control plane IP ranges for the AKS cluster. AML 控制平面是跨配对区域部署的,并且会在 AKS 群集上部署推理 Pod。The AML control plane is deployed across paired regions and deploys inference pods on the AKS cluster. 如果无法访问 API 服务器,则无法部署推理 Pod。Without access to the API server, the inference pods cannot be deployed. 在 AKS 群集中启用 IP 范围时,请对两个配对区域都使用 IP 范围Use the IP ranges for both the paired regions when enabling the IP ranges in an AKS cluster.

    授权 IP 范围仅适用于标准负载均衡器。Authorized IP ranges only works with Standard Load Balancer.

  • AKS 群集的计算名称在 Azure ML 工作区中必须是唯一的。The compute name for the AKS cluster MUST be unique within your Azure ML workspace.

    • 名称是必须提供的,且长度必须介于 3 到 24 个字符之间。Name is required and must be between 3 to 24 characters long.
    • 有效字符为大小写字母、数字和 - 字符。Valid characters are upper and lower case letters, digits, and the - character.
    • 名称必须以字母开头。Name must start with a letter.
    • 名称必须在 Azure 区域内的全部现有计算中都是唯一的。Name needs to be unique across all existing computes within an Azure region. 如果选择的名称不是唯一的,则会显示警报。You will see an alert if the name you choose is not unique.
  • 如果要将模型部署到 GPU 节点或 FPGA 节点(或任何特定 SKU),则必须使用该特定 SKU 创建群集。If you want to deploy models to GPU nodes or FPGA nodes (or any specific SKU), then you must create a cluster with the specific SKU. 不支持在现有群集中创建辅助节点池以及在辅助节点池中部署模型。There is no support for creating a secondary node pool in an existing cluster and deploying models in the secondary node pool.

  • 创建或附加群集时,可以选择为开发/测试还是生产创建群集。When creating or attaching a cluster, you can select whether to create the cluster for dev-test or production . 如果要创建 AKS 群集以用于开发、验证和测试而非生产,请将“群集用途”设置为“开发/测试” 。If you want to create an AKS cluster for development , validation , and testing instead of production, set the cluster purpose to dev-test . 如果未指定群集用途,则会创建生产群集。If you do not specify the cluster purpose, a production cluster is created.

    重要

    开发/测试群集不适用于生产级别的流量,并且可能会增加推理时间。A dev-test cluster is not suitable for production level traffic and may increase inference times. 开发/测试群集也不保证容错能力。Dev/test clusters also do not guarantee fault tolerance.

  • 创建或附加群集时,如果该群集将用于生产,则它必须包含至少 12 个虚拟 CPU。When creating or attaching a cluster, if the cluster will be used for production , then it must contain at least 12 virtual CPUs . 虚拟 CPU 数量的计算公式为群集中的节点数乘以所选 VM 大小提供的核心数。The number of virtual CPUs can be calculated by multiplying the number of nodes in the cluster by the number of cores provided by the VM size selected. 例如,如果使用的 VM 大小为“Standard_D3_v2”(具有 4 个虚拟核心),则应该为节点数选择 3 个或更大的数字。For example, if you use a VM size of "Standard_D3_v2", which has 4 virtual cores, then you should select 3 or greater as the number of nodes.

    对于开发/测试群集,建议至少拥有 2 个虚拟 CPU。For a dev-test cluster, we recommand at least 2 virtual CPUs.

  • Azure 机器学习 SDK 不支持缩放 AKS 群集。The Azure Machine Learning SDK does not provide support scaling an AKS cluster. 要缩放群集中的节点,请在 Azure 机器学习工作室中使用 AKS 群集的 UI。To scale the nodes in the cluster, use the UI for your AKS cluster in the Azure Machine Learning studio. 只能更改节点计数,不能更改群集的 VM 大小。You can only change the node count, not the VM size of the cluster. 有关缩放 AKS 群集中节点的详细信息,请参阅以下文章:For more information on scaling the nodes in an AKS cluster, see the following articles:

Azure Kubernetes 服务版本Azure Kubernetes Service version

Azure Kubernetes 服务允许使用各种 Kubernetes 版本创建群集。Azure Kubernetes Service allows you to create a cluster using a variety of Kubernetes versions. 有关可用版本的详细信息,请参阅 Azure Kubernetes 服务支持的 Kubernetes 版本For more information on available versions, see supported Kubernetes versions in Azure Kubernetes Service.

使用以下方法之一创建 Azure Kubernetes 服务群集时,无法选择创建的群集的版本:When creating an Azure Kubernetes Service cluster using one of the following methods, you do not have a choice in the version of the cluster that is created:

  • Azure 机器学习工作室,或 Azure 门户的“Azure 机器学习”部分。Azure Machine Learning studio, or the Azure Machine Learning section of the Azure portal.
  • 适用于 Azure CLI 的机器学习扩展。Machine Learning extension for Azure CLI.
  • Azure 机器学习 SDK。Azure Machine Learning SDK.

这些创建 AKS 群集的方法使用默认的群集版本。These methods of creating an AKS cluster use the default version of the cluster. 当有新的 Kubernetes 版本可用时,默认版本会随时间的推移而改变。The default version changes over time as new Kubernetes versions become available.

附加现有 AKS 群集时,我们为当前受支持的所有 AKS 版本提供支持。When attaching an existing AKS cluster, we support all currently supported AKS versions.

备注

可能会出现旧群集不再受支持的极端情况。There may be edge cases where you have an older cluster that is no longer supported. 在这种情况下,附加操作会返回一个错误,并会列出当前受支持的版本。In this case, the attach operation will return an error and list the currently supported versions.

你可以附加预览版。You can attach preview versions. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负载。Preview functionality is provided without a service level agreement, and it's not recommended for production workloads. 某些功能可能不受支持或者受限。Certain features might not be supported or might have constrained capabilities. 对使用预览版的支持可能会受到限制。Support for using preview versions may be limited.

可用版本和默认版本Available and default versions

若要查找可用的和默认的 AKS 版本,请使用 Azure CLI 命令 az aks get-versionsTo find the available and default AKS versions, use the Azure CLI command az aks get-versions. 例如,以下命令返回“中国东部”区域中可用的版本:For example, the following command returns the versions available in the China East region:

az aks get-versions -l chinaeast -o table

此命令的输出类似于以下文本:The output of this command is similar to the following text:

KubernetesVersion    Upgrades
-------------------  ----------------------------------------
1.18.6(preview)      None available
1.18.4(preview)      1.18.6(preview)
1.17.9               1.18.4(preview), 1.18.6(preview)
1.17.7               1.17.9, 1.18.4(preview), 1.18.6(preview)
1.16.13              1.17.7, 1.17.9
1.16.10              1.16.13, 1.17.7, 1.17.9
1.15.12              1.16.10, 1.16.13
1.15.11              1.15.12, 1.16.10, 1.16.13

若要查找通过 Azure 机器学习创建群集时使用的默认版本,可以使用 --query 参数选择默认版本:To find the default version that is used when creating a cluster through Azure Machine Learning, you can use the --query parameter to select the default version:

az aks get-versions -l chinaeast --query "orchestrators[?default == `true`].orchestratorVersion" -o table

此命令的输出类似于以下文本:The output of this command is similar to the following text:

Result
--------
1.16.13

若要以编程方式检查可用版本,请使用容器服务客户端 - 列出业务流程协调程序 REST API。If you'd like to programmatically check the available versions , use the Container Service Client - List Orchestrators REST API. 若要查找可用版本,请查看 orchestratorTypeKubernetes 的条目。To find the available versions, look at the entries where orchestratorType is Kubernetes. 关联的 orchestrationVersion 条目包含可附加到你的工作区的可用版本。The associated orchestrationVersion entries contain the available versions that can be attached to your workspace.

若要查找通过 Azure 机器学习创建群集时使用的默认版本,请找到其中的 orchestratorTypeKubernetesdefaulttrue 的条目。To find the default version that is used when creating a cluster through Azure Machine Learning, find the entry where orchestratorType is Kubernetes and default is true. 关联的 orchestratorVersion 值为默认版本。The associated orchestratorVersion value is the default version. 下面的 JSON 代码片段显示了一个示例条目:The following JSON snippet shows an example entry:

...
 {
        "orchestratorType": "Kubernetes",
        "orchestratorVersion": "1.16.13",
        "default": true,
        "upgrades": [
          {
            "orchestratorType": "",
            "orchestratorVersion": "1.17.7",
            "isPreview": false
          }
        ]
      },
...

创建新的 AKS 群集Create a new AKS cluster

时间估计 :大约 10 分钟。Time estimate : Approximately 10 minutes.

对于工作区而言,创建或附加 AKS 群集是一次性过程。Creating or attaching an AKS cluster is a one time process for your workspace. 可以将此群集重复用于多个部署。You can reuse this cluster for multiple deployments. 如果删除该群集或包含该群集的资源组,则在下次需要进行部署时必须创建新群集。If you delete the cluster or the resource group that contains it, you must create a new cluster the next time you need to deploy. 可将多个 AKS 群集附加到工作区。You can have multiple AKS clusters attached to your workspace.

以下示例演示如何使用 SDK 和 CLI 创建新的 AKS 群集:The following example demonstrates how to create a new AKS cluster using the SDK and CLI:

from azureml.core.compute import AksCompute, ComputeTarget

# Use the default configuration (you can also provide parameters to customize this).
# For example, to create a dev/test cluster, use:
# prov_config = AksCompute.provisioning_configuration(cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
prov_config = AksCompute.provisioning_configuration()

# Example configuration to use an existing virtual network
# prov_config.vnet_name = "mynetwork"
# prov_config.vnet_resourcegroup_name = "mygroup"
# prov_config.subnet_name = "default"
# prov_config.service_cidr = "10.0.0.0/16"
# prov_config.dns_service_ip = "10.0.0.10"
# prov_config.docker_bridge_cidr = "172.17.0.1/16"

aks_name = 'myaks'
# Create the cluster
aks_target = ComputeTarget.create(workspace = ws,
                                    name = aks_name,
                                    provisioning_configuration = prov_config)

# Wait for the create process to complete
aks_target.wait_for_completion(show_output = True)

有关此示例中使用的类、方法和参数的详细信息,请参阅以下参考文档:For more information on the classes, methods, and parameters used in this example, see the following reference documents:

附加现有的 AKS 群集Attach an existing AKS cluster

时间估计 :大约 5 分钟。Time estimate: Approximately 5 minutes.

如果 Azure 订阅中已有 AKS 群集并且其版本为 1.17 或更低版本,则可以使用该群集来部署映像。If you already have AKS cluster in your Azure subscription, and it is version 1.17 or lower, you can use it to deploy your image.

提示

现有的 AKS 群集除了位于 Azure 机器学习工作区,还可位于 Azure 区域中。The existing AKS cluster can be in a Azure region other than your Azure Machine Learning workspace.

警告

请勿在工作区中为同一 AKS 群集创建多个同步附件。Do not create multiple, simultaneous attachments to the same AKS cluster from your workspace. 例如,使用两个不同的名称将一个 AKS 群集附加到工作区。For example, attaching one AKS cluster to a workspace using two different names. 每个新附件都会破坏先前存在的附件。Each new attachment will break the previous existing attachment(s).

如果要重新附加 AKS 群集(例如,更改 TLS 或其他群集配置设置),则必须先使用 AksCompute.detach() 删除现有附件。If you want to re-attach an AKS cluster, for example to change TLS or other cluster configuration setting, you must first remove the existing attachment by using AksCompute.detach().

有关如何使用 Azure CLI 或门户创建 AKS 群集的详细信息,请参阅以下文章:For more information on creating an AKS cluster using the Azure CLI or portal, see the following articles:

以下示例演示如何将现有 AKS 群集附加到工作区:The following example demonstrates how to attach an existing AKS cluster to your workspace:

from azureml.core.compute import AksCompute, ComputeTarget
# Set the resource group that contains the AKS cluster and the cluster name
resource_group = 'myresourcegroup'
cluster_name = 'myexistingcluster'

# Attach the cluster to your workgroup. If the cluster has less than 12 virtual CPUs, use the following instead:
# attach_config = AksCompute.attach_configuration(resource_group = resource_group,
#                                         cluster_name = cluster_name,
#                                         cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
attach_config = AksCompute.attach_configuration(resource_group = resource_group,
                                         cluster_name = cluster_name)
aks_target = ComputeTarget.attach(ws, 'myaks', attach_config)

# Wait for the attach process to complete
aks_target.wait_for_completion(show_output = True)

有关此示例中使用的类、方法和参数的详细信息,请参阅以下参考文档:For more information on the classes, methods, and parameters used in this example, see the following reference documents:

后续步骤Next steps