创建并附加 Azure Kubernetes 服务群集Create and attach an Azure Kubernetes Service cluster

Azure 机器学习可以将经过训练的机器学习模型部署到 Azure Kubernetes 服务。Azure Machine Learning can deploy trained machine learning models to Azure Kubernetes Service. 但是,必须首先从 Azure ML 工作区创建 Azure Kubernetes 服务 (AKS) 群集,或者附加现有 AKS 群集。However, you must first either create an Azure Kubernetes Service (AKS) cluster from your Azure ML workspace, or attach an existing AKS cluster. 本文提供了有关创建和附加群集的信息。This article provides information on both creating and attaching a cluster.

先决条件Prerequisites

限制Limitations

  • 如果群集中需要部署的是标准负载均衡器 (SLB),而不是基本负载均衡器 (BLB),请在 AKS 门户/CLI/SDK 中创建群集,然后将该群集附加到 AML 工作区 。If you need a Standard Load Balancer(SLB) deployed in your cluster instead of a Basic Load Balancer(BLB), create a cluster in the AKS portal/CLI/SDK and then attach it to the AML workspace.

  • 如果你的 Azure Policy 限制创建公共 IP 地址,则无法创建 AKS 群集。If you have an Azure Policy that restricts the creation of Public IP addresses, then AKS cluster creation will fail. AKS 需要一个公共 IP 用于出口流量AKS requires a Public IP for egress traffic. 出口流量一文还指导如何通过公共 IP(几个完全限定的域名的 IP 除外)锁定来自群集的出口流量。The egress traffic article also provides guidance to lock down egress traffic from the cluster through the Public IP, except for a few fully qualified domain names. 启用公共 IP 有两种方法:There are 2 ways to enable a Public IP:

    • 群集可以使用在默认情况下与 BLB 或 SLB 一起创建的公共 IP,或者The cluster can use the Public IP created by default with the BLB or SLB, Or
    • 可以在没有公共 IP 的情况下创建群集,然后为公共 IP 配置一个带有用户定义路由的防火墙。The cluster can be created without a Public IP and then a Public IP is configured with a firewall with a user defined route. 有关详细信息,请参阅使用用户定义的路由自定义群集出口For more information, see Customize cluster egress with a user-defined-route.

    AML 控制平面不会与此公共 IP 通信。The AML control plane does not talk to this Public IP. 它与 AKS 控制平面通信以便进行部署。It talks to the AKS control plane for deployments.

  • 如果附加 AKS 群集(已启用授权 IP 范围来访问 API 服务器),请为该 AKS 群集启用 AML 控制平面 IP 范围。If you attach an AKS cluster, which has an Authorized IP range enabled to access the API server, enable the AML control plane IP ranges for the AKS cluster. AML 控制平面是跨配对区域部署的,并且会在 AKS 群集上部署推理 Pod。The AML control plane is deployed across paired regions and deploys inference pods on the AKS cluster. 如果无法访问 API 服务器,则无法部署推理 Pod。Without access to the API server, the inference pods cannot be deployed. 在 AKS 群集中启用 IP 范围时,请对两个配对区域都使用 IP 范围Use the IP ranges for both the paired regions when enabling the IP ranges in an AKS cluster.

    授权 IP 范围仅适用于标准负载均衡器。Authorized IP ranges only works with Standard Load Balancer.

  • 附加 AKS 群集时,它必须与 Azure 机器学习工作区位于同一 Azure 订阅中。When attaching an AKS cluster, it must be in the same Azure subscription as your Azure Machine Learning workspace.

  • 如果要使用专用 AKS 群集(使用 Azure 专用链接),则必须先创建群集,然后再将其附加到工作区。If you want to use a private AKS cluster (using Azure Private Link), you must create the cluster first, and then attach it to the workspace. 有关详细信息,请参阅创建专用 Azure Kubernetes 服务群集For more information, see Create a private Azure Kubernetes Service cluster.

  • AKS 群集的计算名称在 Azure ML 工作区中必须是唯一的。The compute name for the AKS cluster MUST be unique within your Azure ML workspace.

    • 名称是必须提供的,且长度必须介于 3 到 24 个字符之间。Name is required and must be between 3 to 24 characters long.
    • 有效字符为大小写字母、数字和 - 字符。Valid characters are upper and lower case letters, digits, and the - character.
    • 名称必须以字母开头。Name must start with a letter.
    • 名称必须在 Azure 区域内的全部现有计算中都是唯一的。Name needs to be unique across all existing computes within an Azure region. 如果选择的名称不是唯一的,则会显示警报。You will see an alert if the name you choose is not unique.
  • 如果要将模型部署到 GPU 节点或 FPGA 节点(或任何特定 SKU),则必须使用该特定 SKU 创建群集。If you want to deploy models to GPU nodes or FPGA nodes (or any specific SKU), then you must create a cluster with the specific SKU. 不支持在现有群集中创建辅助节点池以及在辅助节点池中部署模型。There is no support for creating a secondary node pool in an existing cluster and deploying models in the secondary node pool.

  • 创建或附加群集时,可以选择为开发/测试还是生产创建群集。When creating or attaching a cluster, you can select whether to create the cluster for dev-test or production. 如果要创建 AKS 群集以用于开发、验证和测试而非生产,请将“群集用途”设置为“开发/测试” 。If you want to create an AKS cluster for development, validation, and testing instead of production, set the cluster purpose to dev-test. 如果未指定群集用途,则会创建生产群集。If you do not specify the cluster purpose, a production cluster is created.

    重要

    开发/测试群集不适用于生产级别的流量,并且可能会增加推理时间。A dev-test cluster is not suitable for production level traffic and may increase inference times. 开发/测试群集也不保证容错能力。Dev/test clusters also do not guarantee fault tolerance.

  • 创建或附加群集时,如果该群集将用于生产,则它必须包含至少 12 个虚拟 CPU。When creating or attaching a cluster, if the cluster will be used for production, then it must contain at least 12 virtual CPUs. 虚拟 CPU 数量的计算公式为群集中的节点数乘以所选 VM 大小提供的核心数。The number of virtual CPUs can be calculated by multiplying the number of nodes in the cluster by the number of cores provided by the VM size selected. 例如,如果使用的 VM 大小为“Standard_D3_v2”(具有 4 个虚拟核心),则应该为节点数选择 3 个或更大的数字。For example, if you use a VM size of "Standard_D3_v2", which has 4 virtual cores, then you should select 3 or greater as the number of nodes.

    对于开发/测试群集,建议至少拥有 2 个虚拟 CPU。For a dev-test cluster, we recommand at least 2 virtual CPUs.

  • Azure 机器学习 SDK 不支持缩放 AKS 群集。The Azure Machine Learning SDK does not provide support scaling an AKS cluster. 要缩放群集中的节点,请在 Azure 机器学习工作室中使用 AKS 群集的 UI。To scale the nodes in the cluster, use the UI for your AKS cluster in the Azure Machine Learning studio. 只能更改节点计数,不能更改群集的 VM 大小。You can only change the node count, not the VM size of the cluster. 有关缩放 AKS 群集中节点的详细信息,请参阅以下文章:For more information on scaling the nodes in an AKS cluster, see the following articles:

Azure Kubernetes 服务版本Azure Kubernetes Service version

Azure Kubernetes 服务允许使用各种 Kubernetes 版本创建群集。Azure Kubernetes Service allows you to create a cluster using a variety of Kubernetes versions. 有关可用版本的详细信息,请参阅 Azure Kubernetes 服务支持的 Kubernetes 版本For more information on available versions, see supported Kubernetes versions in Azure Kubernetes Service.

使用以下方法之一创建 Azure Kubernetes 服务群集时,无法选择创建的群集的版本:When creating an Azure Kubernetes Service cluster using one of the following methods, you do not have a choice in the version of the cluster that is created:

  • Azure 机器学习工作室,或 Azure 门户的“Azure 机器学习”部分。Azure Machine Learning studio, or the Azure Machine Learning section of the Azure portal.
  • 适用于 Azure CLI 的机器学习扩展。Machine Learning extension for Azure CLI.
  • Azure 机器学习 SDK。Azure Machine Learning SDK.

这些创建 AKS 群集的方法使用默认的群集版本。These methods of creating an AKS cluster use the default version of the cluster. 当有新的 Kubernetes 版本可用时,默认版本会随时间的推移而改变。The default version changes over time as new Kubernetes versions become available.

附加现有 AKS 群集时,我们为当前受支持的所有 AKS 版本提供支持。When attaching an existing AKS cluster, we support all currently supported AKS versions.

备注

可能会出现旧群集不再受支持的极端情况。There may be edge cases where you have an older cluster that is no longer supported. 在这种情况下,附加操作会返回一个错误,并会列出当前受支持的版本。In this case, the attach operation will return an error and list the currently supported versions.

你可以附加预览版。You can attach preview versions. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负载。Preview functionality is provided without a service level agreement, and it's not recommended for production workloads. 某些功能可能不受支持或者受限。Certain features might not be supported or might have constrained capabilities. 对使用预览版的支持可能会受到限制。Support for using preview versions may be limited.

可用版本和默认版本Available and default versions

若要查找可用的和默认的 AKS 版本,请使用 Azure CLI 命令 az aks get-versionsTo find the available and default AKS versions, use the Azure CLI command az aks get-versions. 例如,以下命令返回“中国东部”区域中可用的版本:For example, the following command returns the versions available in the China East region:

az aks get-versions -l chinaeast -o table

此命令的输出类似于以下文本:The output of this command is similar to the following text:

KubernetesVersion    Upgrades
-------------------  ----------------------------------------
1.18.6(preview)      None available
1.18.4(preview)      1.18.6(preview)
1.17.9               1.18.4(preview), 1.18.6(preview)
1.17.7               1.17.9, 1.18.4(preview), 1.18.6(preview)
1.16.13              1.17.7, 1.17.9
1.16.10              1.16.13, 1.17.7, 1.17.9
1.15.12              1.16.10, 1.16.13
1.15.11              1.15.12, 1.16.10, 1.16.13

若要查找通过 Azure 机器学习创建群集时使用的默认版本,可以使用 --query 参数选择默认版本:To find the default version that is used when creating a cluster through Azure Machine Learning, you can use the --query parameter to select the default version:

az aks get-versions -l chinaeast --query "orchestrators[?default == `true`].orchestratorVersion" -o table

此命令的输出类似于以下文本:The output of this command is similar to the following text:

Result
--------
1.16.13

若要以编程方式检查可用版本,请使用容器服务客户端 - 列出业务流程协调程序 REST API。If you'd like to programmatically check the available versions, use the Container Service Client - List Orchestrators REST API. 若要查找可用版本,请查看 orchestratorTypeKubernetes 的条目。To find the available versions, look at the entries where orchestratorType is Kubernetes. 关联的 orchestrationVersion 条目包含可附加到你的工作区的可用版本。The associated orchestrationVersion entries contain the available versions that can be attached to your workspace.

若要查找通过 Azure 机器学习创建群集时使用的默认版本,请找到其中的 orchestratorTypeKubernetesdefaulttrue 的条目。To find the default version that is used when creating a cluster through Azure Machine Learning, find the entry where orchestratorType is Kubernetes and default is true. 关联的 orchestratorVersion 值为默认版本。The associated orchestratorVersion value is the default version. 下面的 JSON 代码片段显示了一个示例条目:The following JSON snippet shows an example entry:

...
 {
        "orchestratorType": "Kubernetes",
        "orchestratorVersion": "1.16.13",
        "default": true,
        "upgrades": [
          {
            "orchestratorType": "",
            "orchestratorVersion": "1.17.7",
            "isPreview": false
          }
        ]
      },
...

创建新的 AKS 群集Create a new AKS cluster

时间估计:大约 10 分钟。Time estimate: Approximately 10 minutes.

对于工作区而言,创建或附加 AKS 群集是一次性过程。Creating or attaching an AKS cluster is a one time process for your workspace. 可以将此群集重复用于多个部署。You can reuse this cluster for multiple deployments. 如果删除该群集或包含该群集的资源组,则在下次需要进行部署时必须创建新群集。If you delete the cluster or the resource group that contains it, you must create a new cluster the next time you need to deploy. 可将多个 AKS 群集附加到工作区。You can have multiple AKS clusters attached to your workspace.

以下示例演示如何使用 SDK 和 CLI 创建新的 AKS 群集:The following example demonstrates how to create a new AKS cluster using the SDK and CLI:

from azureml.core.compute import AksCompute, ComputeTarget

# Use the default configuration (you can also provide parameters to customize this).
# For example, to create a dev/test cluster, use:
# prov_config = AksCompute.provisioning_configuration(cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
prov_config = AksCompute.provisioning_configuration()

# Example configuration to use an existing virtual network
# prov_config.vnet_name = "mynetwork"
# prov_config.vnet_resourcegroup_name = "mygroup"
# prov_config.subnet_name = "default"
# prov_config.service_cidr = "10.0.0.0/16"
# prov_config.dns_service_ip = "10.0.0.10"
# prov_config.docker_bridge_cidr = "172.17.0.1/16"

aks_name = 'myaks'
# Create the cluster
aks_target = ComputeTarget.create(workspace = ws,
                                    name = aks_name,
                                    provisioning_configuration = prov_config)

# Wait for the create process to complete
aks_target.wait_for_completion(show_output = True)

有关此示例中使用的类、方法和参数的详细信息,请参阅以下参考文档:For more information on the classes, methods, and parameters used in this example, see the following reference documents:

附加现有的 AKS 群集Attach an existing AKS cluster

时间估计:大约 5 分钟。Time estimate: Approximately 5 minutes.

如果 Azure 订阅中已有 AKS 群集并且其版本为 1.17 或更低版本,则可以使用该群集来部署映像。If you already have AKS cluster in your Azure subscription, and it is version 1.17 or lower, you can use it to deploy your image.

提示

现有的 AKS 群集除了位于 Azure 机器学习工作区,还可位于 Azure 区域中。The existing AKS cluster can be in a Azure region other than your Azure Machine Learning workspace.

警告

请勿在工作区中为同一 AKS 群集创建多个同步附件。Do not create multiple, simultaneous attachments to the same AKS cluster from your workspace. 例如,使用两个不同的名称将一个 AKS 群集附加到工作区。For example, attaching one AKS cluster to a workspace using two different names. 每个新附件都会破坏先前存在的附件。Each new attachment will break the previous existing attachment(s).

如果要重新附加 AKS 群集(例如,更改 TLS 或其他群集配置设置),则必须先使用 AksCompute.detach() 删除现有附件。If you want to re-attach an AKS cluster, for example to change TLS or other cluster configuration setting, you must first remove the existing attachment by using AksCompute.detach().

有关如何使用 Azure CLI 或门户创建 AKS 群集的详细信息,请参阅以下文章:For more information on creating an AKS cluster using the Azure CLI or portal, see the following articles:

以下示例演示如何将现有 AKS 群集附加到工作区:The following example demonstrates how to attach an existing AKS cluster to your workspace:

from azureml.core.compute import AksCompute, ComputeTarget
# Set the resource group that contains the AKS cluster and the cluster name
resource_group = 'myresourcegroup'
cluster_name = 'myexistingcluster'

# Attach the cluster to your workgroup. If the cluster has less than 12 virtual CPUs, use the following instead:
# attach_config = AksCompute.attach_configuration(resource_group = resource_group,
#                                         cluster_name = cluster_name,
#                                         cluster_purpose = AksCompute.ClusterPurpose.DEV_TEST)
attach_config = AksCompute.attach_configuration(resource_group = resource_group,
                                         cluster_name = cluster_name)
aks_target = ComputeTarget.attach(ws, 'myaks', attach_config)

# Wait for the attach process to complete
aks_target.wait_for_completion(show_output = True)

有关此示例中使用的类、方法和参数的详细信息,请参阅以下参考文档:For more information on the classes, methods, and parameters used in this example, see the following reference documents:

创建或附加带有 TLS 终止的 AKS 群集Create or attach an AKS cluster with TLS termination

创建或附加 AKS 群集时,可以使用 AksCompute.provisioning_configuration()AksCompute.provisioning_configuration() 配置对象来启用 TLS 终止。When you create or attach an AKS cluster, you can enable TLS termination with AksCompute.provisioning_configuration() and AksCompute.attach_configuration() configuration objects. 两种方法都返回具有 enable_ssl 方法的配置对象,并且你可以使用 enable_ssl 方法来启用 TLS 。Both method return a configuration object that has an enable_ssl method, and you can use enable_ssl method to enable TLS.

以下示例显示了如何在后台使用 Microsoft 证书通过自动 TLS 证书生成和配置来启用 TLS 终止。Following example shows how to enable TLS termination with automatic TLS certificate generation and configuration by using Microsoft certificate under the hood.

   from azureml.core.compute import AksCompute, ComputeTarget
   
   # Enable TLS termination when you create an AKS cluster by using provisioning_config object enable_ssl method

   # Leaf domain label generates a name using the formula
   # "<leaf-domain-label>######.<azure-region>.cloudapp.azure.net"
   # where "######" is a random series of characters
   provisioning_config.enable_ssl(leaf_domain_label = "contoso")
   
   # Enable TLS termination when you attach an AKS cluster by using attach_config object enable_ssl method

   # Leaf domain label generates a name using the formula
   # "<leaf-domain-label>######.<azure-region>.cloudapp.azure.net"
   # where "######" is a random series of characters
   attach_config.enable_ssl(leaf_domain_label = "contoso")


以下示例显示如何使用自定义证书和自定义域名启用 TLS 终止。Following example shows how to enable TLS termination with custom certificate and custom domain name. 对于自定义域和证书,必须更新 DNS 记录以指向评分终结点的 IP 地址,请参阅更新 DNS With custom domain and certificate, you must update your DNS record to point to the IP address of scoring endpoint, please see Update your DNS

   from azureml.core.compute import AksCompute, ComputeTarget

   # Enable TLS termination with custom certificate and custom domain when creating an AKS cluster
   
   provisioning_config.enable_ssl(ssl_cert_pem_file="cert.pem",
                                        ssl_key_pem_file="key.pem", ssl_cname="www.contoso.com")
    
   # Enable TLS termination with custom certificate and custom domain when attaching an AKS cluster

   attach_config.enable_ssl(ssl_cert_pem_file="cert.pem",
                                        ssl_key_pem_file="key.pem", ssl_cname="www.contoso.com")


备注

有关如何保护 AKS 群集上的模型部署的详细信息,请参阅使用 TLS 通过 Azure 机器学习保护 Web 服务For more information about how to secure model deployment on AKS cluster, please see use TLS to secure a web service through Azure Machine Learning

创建或附加 AKS 群集以将内部负载均衡器与专用 IP 结合使用Create or attach an AKS cluster to use Internal Load Balancer with private IP

创建或附加 AKS 群集时,可以将群集配置为使用内部负载均衡器。When you create or attach an AKS cluster, you can configure the cluster to use an Internal Load Balancer. 使用内部负载均衡器,向 AKS 的部署的评分终结点将在虚拟网络中使用专用 IP。With an Internal Load Balancer, scoring endpoints for your deployments to AKS will use a private IP within the virtual network. 以下代码片段显示了如何为 AKS 集群配置内部负载均衡器。Following code snippets show how to configure an Internal Load Balancer for an AKS cluster.

   
   from azureml.core.compute.aks import AksUpdateConfiguration
   from azureml.core.compute import AksCompute, ComputeTarget
   
   # When you create an AKS cluster, you can specify Internal Load Balancer to be created with provisioning_config object
   provisioning_config = AksCompute.provisioning_configuration(load_balancer_type = 'InternalLoadBalancer')

   # when you attach an AKS cluster, you can update the cluster to use internal load balancer after attach
   aks_target = AksCompute(ws,"myaks")

   # Change to the name of the subnet that contains AKS
   subnet_name = "default"
   # Update AKS configuration to use an internal load balancer
   update_config = AksUpdateConfiguration(None, "InternalLoadBalancer", subnet_name)
   aks_target.update(update_config)
   # Wait for the operation to complete
   aks_target.wait_for_completion(show_output = True)
   
   

重要

Azure 机器学习不支持使用内部负载均衡器进行 TLS 终止。Azure Machine Learning does not support TLS termination with Internal Load Balancer. 内部负载均衡器具有专用 IP,该专用 IP 可以位于另一个网络上,并且可以撤回证书。Internal Load Balancer has a private IP and that private IP could be on another network and certificate can be recused.

备注

有关如何保护推理环境的详细信息,请参阅保护 Azure 机器学习推理环境For more information about how to secure inferencing environment, please see Secure an Azure Machine Learning Inferencing Environment

拆离 AKS 群集Detach an AKS cluster

若要从工作区拆离群集,请使用以下方法之一:To detach a cluster from your workspace, use one of the following methods:

警告

使用 Azure 机器学习工作室、SDK 或适用于机器学习的 Azure CLI 扩展来拆离 AKS 群集不会删除 AKS 群集。Using the Azure Machine Learning studio, SDK, or the Azure CLI extension for machine learning to detach an AKS cluster does not delete the AKS cluster. 若要删除群集,请参阅将 Azure CLI 与 AKS 配合使用To delete the cluster, see Use the Azure CLI with AKS.

aks_target.detach()

故障排除Troubleshooting

更新群集Update the cluster

必须手动应用对 Azure Kubernetes 服务群集中安装的 Azure 机器学习组件的更新。Updates to Azure Machine Learning components installed in an Azure Kubernetes Service cluster must be manually applied.

可以通过从 Azure 机器学习工作区分离群集,然后将群集重新附加到工作区,来应用这些更新。You can apply these updates by detaching the cluster from the Azure Machine Learning workspace, and then reattaching the cluster to the workspace. 如果在群集中启用了 TLS,则重新附加群集时需要提供 TLS/SSL 证书和私钥。If TLS is enabled in the cluster, you will need to supply the TLS/SSL certificate and private key when reattaching the cluster.

compute_target = ComputeTarget(workspace=ws, name=clusterWorkspaceName)
compute_target.detach()
compute_target.wait_for_completion(show_output=True)

attach_config = AksCompute.attach_configuration(resource_group=resourceGroup, cluster_name=kubernetesClusterName)

## If SSL is enabled.
attach_config.enable_ssl(
    ssl_cert_pem_file="cert.pem",
    ssl_key_pem_file="key.pem",
    ssl_cname=sslCname)

attach_config.validate_configuration()

compute_target = ComputeTarget.attach(workspace=ws, name=args.clusterWorkspaceName, attach_configuration=attach_config)
compute_target.wait_for_completion(show_output=True)

如果不再具有 TLS/SSL 证书和私钥,或者使用 Azure 机器学习生成的证书,则可以在分离群集之前,使用 kubectl 连接到群集并检索机密 azuremlfessl 来检索文件。If you no longer have the TLS/SSL certificate and private key, or you are using a certificate generated by Azure Machine Learning, you can retrieve the files prior to detaching the cluster by connecting to the cluster using kubectl and retrieving the secret azuremlfessl.

kubectl get secret/azuremlfessl -o yaml

备注

Kubernetes 存储的机密采用 base-64 编码格式。Kubernetes stores the secrets in base-64 encoded format. 在将机密提供给 attach_config.enable_ssl 之前,需要对机密的 cert.pemkey.pem 组成部分进行 base-64 解码。You will need to base-64 decode the cert.pem and key.pem components of the secrets prior to providing them to attach_config.enable_ssl.

Web 服务失败Webservice failures

对于 AKS 中的许多 Web 服务失败,可以使用 kubectl 连接到群集进行调试。Many webservice failures in AKS can be debugged by connecting to the cluster using kubectl. 可以通过运行以下内容来获取 AKS 群集的 kubeconfig.jsonYou can get the kubeconfig.json for an AKS cluster by running

az aks get-credentials -g <rg> -n <aks cluster name>

后续步骤Next steps