AKS 疑难解答AKS troubleshooting

在创建或管理 Azure Kubernetes 服务 (AKS) 群集时,可能偶尔会遇到问题。When you create or manage Azure Kubernetes Service (AKS) clusters, you might occasionally come across problems. 本文详细介绍了一些常见问题及其排查步骤。This article details some common problems and troubleshooting steps.

通常在何处查看 Kubernetes 问题调试的相关信息?In general, where do I find information about debugging Kubernetes problems?

请尝试 Kubernetes 群集故障排除的官方指南Try the official guide to troubleshooting Kubernetes clusters. 还可尝试由 Azure 工程师发布的故障排除指南,用于对 Pod、节点、群集和其他功能进行故障排除。There's also a troubleshooting guide, published by a Azure engineer for troubleshooting pods, nodes, clusters, and other features.

在创建或升级期间遇到“超出配额”的错误。I'm getting a "quota exceeded" error during creation or upgrade. 我该怎么办?What should I do?

请求更多核心Request more cores.

在使用高级网络部署 AKS 群集时收到 insufficientSubnetSize 错误。I'm getting an insufficientSubnetSize error while deploying an AKS cluster with advanced networking. 我该怎么办?What should I do?

此错误表示用于群集的子网在其 CIDR 中不再具有用于成功分配资源的可用 IP。This error indicates a subnet in use for a cluster no longer has available IPs within its CIDR for successful resource assignment. 对于 Kubenet 群集,需要为群集中的每个节点提供足够的 IP 空间。For Kubenet clusters, the requirement is sufficient IP space for each node in the cluster. 对于 Azure CNI 群集,需要为群集中的每个节点和 Pod 提供足够的 IP 空间。For Azure CNI clusters, the requirement is sufficient IP space for each node and pod in the cluster. 阅读并详细了解如何设计为 Pod 分配 IP 的 Azure CNIRead more about the design of Azure CNI to assign IPs to pods.

以下三 (3) 种情况会导致子网大小不足的错误:The following three (3) cases cause an insufficient subnet size error:

  1. AKS 缩放或 AKS 节点池缩放AKS Scale or AKS Node pool scale

    1. 如果使用的是 Kubenet,当 number of free IPs in the subnet 小于 number of new nodes requested 时就会发生这种情况。If using Kubenet, when the number of free IPs in the subnet is less than the number of new nodes requested.
    2. 如果使用的是 Azure CNI,当 number of free IPs in the subnet 小于 number of nodes requested times (*) the node pool's --max-pod value 时就会发生这种情况。If using Azure CNI, when the number of free IPs in the subnet is less than the number of nodes requested times (*) the node pool's --max-pod value.
  2. AKS 升级或 AKS 节点池升级AKS Upgrade or AKS Node pool upgrade

    1. 如果使用的是 Kubenet,当 number of free IPs in the subnet 小于 number of buffer nodes needed to upgrade 时就会发生这种情况。If using Kubenet, when the number of free IPs in the subnet is less than the number of buffer nodes needed to upgrade.
    2. 如果使用的是 Azure CNI,当 number of free IPs in the subnet 小于 number of buffer nodes needed to upgrade times (*) the node pool's --max-pod value 时就会发生这种情况。If using Azure CNI, when the number of free IPs in the subnet is less than the number of buffer nodes needed to upgrade times (*) the node pool's --max-pod value.
  3. 创建 AKS 或添加 AKS 节点池AKS create or AKS Node pool add

    1. 如果使用的是 Kubenet,当 number of free IPs in the subnet 小于 number of nodes requested for the node pool 时就会发生这种情况。If using Kubenet, when the number of free IPs in the subnet is less than the number of nodes requested for the node pool.
    2. 如果使用的是 Azure CNI,当 number of free IPs in the subnet 小于 number of nodes requested times (*) the node pool's --max-pod value 时就会发生这种情况。If using Azure CNI, when the number of free IPs in the subnet is less than the number of nodes requested times (*) the node pool's --max-pod value.

通过创建新的子网,可以应用以下缓解措施。The following mitigation can be taken by creating new subnets. 由于无法更新现有子网的 CIDR 范围,因此需要获取创建新子网的权限才能应用缓解措施。The permission to create a new subnet is required for mitigation due to the inability to update an existing subnet's CIDR range.

  1. 使用足以实现操作目标的更大的 CIDR 范围来重建新子网:Rebuild a new subnet with a larger CIDR range sufficient for operation goals:
    1. 创建具有所需的无重叠新范围的新子网。Create a new subnet with a new desired non-overlapping range.
    2. 在新子网上创建新的节点池。Create a new node pool on the new subnet.
    3. 从驻留在要替换的旧子网中的旧节点池中清空 Pod。Drain pods from the old node pool residing in the old subnet to be replaced.
    4. 删除旧子网和旧节点池。Delete the old subnet and old node pool.

我的 Pod 停滞在 CrashLoopBackOff 模式。My pod is stuck in CrashLoopBackOff mode. 我该怎么办?What should I do?

可能有多种原因导致 Pod 停滞在该模式。There might be various reasons for the pod being stuck in that mode. 可能通过以下方式查看:You might look into:

  • 使用 kubectl describe pod <pod-name> 查看 Pod 本身。The pod itself, by using kubectl describe pod <pod-name>.
  • 使用 kubectl logs <pod-name> 查看日志。The logs, by using kubectl logs <pod-name>.

有关如何对 Pod 的问题进行故障排除的详细信息,请参阅调试应用程序For more information on how to troubleshoot pod problems, see Debug applications.

当我使用连接到 API 服务器的 kubectl 或其他第三方工具时,会收到 TCP timeoutsI'm receiving TCP timeouts when using kubectl or other third-party tools connecting to the API server

AKS 具有 HA 控制平面,可以根据内核数进行垂直缩放,以确保维持其服务级别目标 (SLO) 和服务级别协议 (SLA)。AKS has HA control planes that scale vertically according to the number of cores to ensure its Service Level Objectives (SLOs) and Service Level Agreements (SLAs). 如果遇到连接超时,请检查以下内容:If you're experiencing connections timing out, check the below:

  • 是所有 API 命令一致超时还是只有少数几个超时?Are all your API commands timing out consistently or only a few? 如果只是少数几个超时,那么负责节点与控制平面之间通信的 tunnelfront Pod 或 aks-link Pod 可能未运行。If it's only a few, your tunnelfront pod or aks-link pod, responsible for node -> control plane communication, might not be in a running state. 确保托管此 Pod 的节点没有过度利用或压力过大。Make sure the nodes hosting this pod aren't over-utilized or under stress. 考虑将它们移动到自己的 system 节点池Consider moving them to their own system node pool.
  • 是否已打开 AKS 限制出口流量文档中注明的所有所需端口、FQDN 和 IP?Have you opened all required ports, FQDNs, and IPs noted on the AKS restrict egress traffic docs? 否则,多个命令调用可能会失败。Otherwise several commands calls can fail.
  • 当前的 IP 是否在 API IP 授权范围内?Is your current IP covered by API IP Authorized Ranges? 如果正在使用此功能,而 IP 不在范围内,会导致调用被阻止。If you're using this feature and your IP is not included in the ranges your calls will be blocked.
  • 是否有客户端或应用程序泄漏了对 API 服务器的调用?Do you have a client or application leaking calls to the API server? 确保使用监视而不是频繁的 get 调用,以及第三方应用程序不会泄漏此类调用。Make sure to use watches instead of frequent get calls and that your third-party applications aren't leaking such calls. 例如,每次在内部读取机密时,Istio mixer 中的一个 bug 会导致创建新的 API 服务器监视连接。For example, a bug in the Istio mixer causes a new API Server watch connection to be created every time a secret is read internally. 因为这种行为是定期发生的,所以监视连接会迅速累积,最终导致 API 服务器过载,且不管采用什么扩展模式都会如此。Because this behavior happens at a regular interval, watch connections quickly accumulate, and eventually cause the API Server to become overloaded no matter the scaling pattern. https://github.com/istio/istio/issues/19481https://github.com/istio/istio/issues/19481
  • Helm 部署中是否有多个版本?Do you have many releases in your helm deployments? 这种情况会导致 tiller 在节点上使用过多的内存,同时导致大量的 configmaps,这会在 API 服务器上造成不必要的峰值。This scenario can cause both tiller to use too much memory on the nodes, as well as a large amount of configmaps, which can cause unnecessary spikes on the API server. 请考虑在 helm init 配置 --history-max,并利用新的 Helm 3。Consider configuring --history-max at helm init and leverage the new Helm 3. 有关以下问题的详细信息:More details on the following issues:
  • 节点之间的内部流量是否被阻止?Is internal traffic between nodes being blocked?

我收到 TCP timeouts,如 dial tcp <Node_IP>:10250: i/o timeoutI'm receiving TCP timeouts, such as dial tcp <Node_IP>:10250: i/o timeout

这些超时可能与被阻止节点之间的内部流量有关。These timeouts may be related to internal traffic between nodes being blocked. 验证此流量是否未被阻止,例如通过群集节点子网上的网络安全组来这样做。Verify that this traffic is not being blocked, such as by network security groups on the subnet for your cluster's nodes.

我想尝试在现有群集上启用 Kubernetes 基于角色的访问控制 (Kubernetes RBAC)。I'm trying to enable Kubernetes role-based access control (Kubernetes RBAC) on an existing cluster. 该如何操作?How can I do that?

目前不支持在现有群集上启用 Kubernetes 基于角色的访问控制 (Kubernetes RBAC),必须在创建新群集时对其进行设置。Enabling Kubernetes role-based access control (Kubernetes RBAC) on existing clusters isn't supported at this time, it must be set when creating new clusters. 在使用 CLI、门户或 2020-03-01 之后的 API 版本时,默认会启用 Kubernetes RBAC。Kubernetes RBAC is enabled by default when using CLI, Portal, or an API version later than 2020-03-01.

无法使用 Kubectl 日志获取日志或无法连接到 API 服务器。I can't get logs by using kubectl logs or I can't connect to the API server. 我收到“来自服务器的错误:拨号后端时出错: 拨打 tcp...”。I'm getting "Error from server: error dialing backend: dial tcp…". 我该怎么办?What should I do?

确保端口 22、9000 和 1194 已打开,以便连接到 API 服务器。Ensure ports 22, 9000 and 1194 are open to connect to the API server. 使用 kubectl get pods --namespace kube-system 命令检查 tunnelfrontaks-link Pod 是否正在 kube-system 命名空间中运行。Check whether the tunnelfront or aks-link pod is running in the kube-system namespace using the kubectl get pods --namespace kube-system command. 如果没有,请强制删除 Pod,它会重启。If it isn't, force deletion of the pod and it will restart.

当连接到 AKS API 时,我从客户端收到 "tls: client offered only unsupported versions"I'm getting "tls: client offered only unsupported versions" from my client when connecting to AKS API.   应采取何种操作?What should I do?

AKS 支持的最低 TLS 版本是 TLS 1.2。The minimum supported TLS version in AKS is TLS 1.2.

我在尝试进行升级或缩放时收到 "Changing property 'imageReference' is not allowed" 错误。I'm trying to upgrade or scale and am getting a "Changing property 'imageReference' is not allowed" error. 如何修复此问题?How do I fix this problem?

收到此错误的原因可能是,你修改了 AKS 群集内代理节点中的标记。You might be getting this error because you've modified the tags in the agent nodes inside the AKS cluster. 如果修改或删除 MC_* 资源组中资源的标记和其他属性,可能会导致意外结果。Modify or delete tags and other properties of resources in the MC_* resource group can lead to unexpected results. 更改 AKS 群集中 MC_ * 组下的资源会中断服务级别目标 (SLO)。Altering the resources under the MC_* group in the AKS cluster breaks the service-level objective (SLO).

有错误指出,我的群集处于故障状态,在解决此解决之前无法进行升级或缩放I'm receiving errors that my cluster is in failed state and upgrading or scaling will not work until it is fixed

此故障排除帮助摘自 aks-cluster-failedThis troubleshooting assistance is directed from aks-cluster-failed

如果群集出于多种原因进入故障状态,则会发生此错误。This error occurs when clusters enter a failed state for multiple reasons. 请遵循以下步骤解决群集故障状态,然后重试先前失败的操作:Follow the steps below to resolve your cluster failed state before retrying the previously failed operation:

  1. 除非群集摆脱 failed 状态,否则 upgradescale 操作不会成功。Until the cluster is out of failed state, upgrade and scale operations won't succeed. 常见的根本问题和解决方法包括:Common root issues and resolutions include:
    • 使用 不足的计算 (CRP) 配额 进行缩放。Scaling with insufficient compute (CRP) quota. 若要解决此问题,请先将群集缩放回到配额内的稳定目标状态。To resolve, first scale your cluster back to a stable goal state within quota. 遵循这些步骤请求提高计算配额,然后尝试扩展到超出初始配额限制。Then follow these steps to request a compute quota increase before trying to scale up again beyond initial quota limits.
    • 使用高级网络和 不足的子网(网络)资源 缩放群集。Scaling a cluster with advanced networking and insufficient subnet (networking) resources. 若要解决此问题,请先将群集缩放回到配额内的稳定目标状态。To resolve, first scale your cluster back to a stable goal state within quota. 遵循这些步骤请求提高资源配额,然后尝试扩展到超出初始配额限制。Then follow these steps to request a resource quota increase before trying to scale up again beyond initial quota limits.
  2. 解决升级失败的根本原因后,群集应会进入成功状态。Once the underlying cause for upgrade failure is resolved, your cluster should be in a succeeded state. 确认成功状态后,请重试原始操作。Once a succeeded state is verified, retry the original operation.

在尝试升级或缩放时收到错误,指示群集正在升级或升级失败I'm receiving errors when trying to upgrade or scale that state my cluster is being upgraded or has failed upgrade

此故障排除帮助摘自 aks-pending-upgradeThis troubleshooting assistance is directed from aks-pending-upgrade

不能同时升级和缩放群集或节点池。You can't have a cluster or node pool simultaneously upgrade and scale. 而只能先在目标资源上完成一个操作类型,然后再在同一资源上执行下一个请求。Instead, each operation type must complete on the target resource before the next request on that same resource. 因此,在进行或尝试进行活动升级或缩放操作时,操作会受限。As a result, operations are limited when active upgrade or scale operations are occurring or attempted.

为帮助诊断此问题,请运行 az aks show -g myResourceGroup -n myAKSCluster -o table 以检索群集的详细状态。To help diagnose the issue run az aks show -g myResourceGroup -n myAKSCluster -o table to retrieve detailed status on your cluster. 根据结果执行相应的操作:Based on the result:

  • 如果群集正在升级,请等待操作完成。If cluster is actively upgrading, wait until the operation finishes. 如果群集升级成功,请再次重试先前失败的操作。If it succeeded, retry the previously failed operation again.
  • 如果群集升级失败,请按前面部分所述的步骤操作。If cluster has failed upgrade, follow steps outlined in previous section.

是否可以将我的群集移到不同的订阅,或将包含我的群集的订阅移动到新的租户?Can I move my cluster to a different subscription or my subscription with my cluster to a new tenant?

如果已将 AKS 群集移动到其他订阅,或已将群集的订阅移动到新租户,群集会因缺少群集标识权限而无法正常工作。If you've moved your AKS cluster to a different subscription or the cluster's subscription to a new tenant, the cluster won't function because of missing cluster identity permissions. 由于存在此约束,因此,AKS 不支持跨订阅或租户移动群集。AKS doesn't support moving clusters across subscriptions or tenants because of this constraint.

在尝试使用需要虚拟机规模集的功能时遇到错误I'm receiving errors trying to use features that require virtual machine scale sets

以下故障排除帮助参考自 aka.ms/aks-vmss-enablementThis troubleshooting assistance is directed from aka.ms/aks-vmss-enablement

你可能会收到错误,指示 AKS 群集不在虚拟机规模集上,如以下示例中所示:You may receive errors that indicate your AKS cluster isn't on a virtual machine scale set, such as the following example:

AgentPool <agentpoolname> 已将自动缩放设置为启用状态,但它不在虚拟机规模集上AgentPool <agentpoolname> has set auto scaling as enabled but isn't on Virtual Machine Scale Sets

群集自动缩放程序或多节点池等功能需要 vm-set-type 的规模集。Features such as the cluster autoscaler or multiple node pools require virtual machine scale sets as the vm-set-type.

按照相应文档中的“准备工作”步骤操作,正确创建 AKS 群集:Follow the Before you begin steps in the appropriate doc to correctly create an AKS cluster:

对 AKS 资源和参数强制实施哪些命名限制?What naming restrictions are enforced for AKS resources and parameters?

此故障排除帮助来自 aka.ms/aks-naming-rulesThis troubleshooting assistance is directed from aka.ms/aks-naming-rules

Azure 平台和 AKS 都实施了命名限制。Naming restrictions are implemented by both the Azure platform and AKS. 如果资源名称或参数违反了这些限制之一,则会返回一个错误,要求你提供不同的输入。If a resource name or parameter breaks one of these restrictions, an error is returned that asks you provide a different input. 将应用以下通用命名规则:The following common naming guidelines apply:

  • 群集名称必须为 1-63 个字符。Cluster names must be 1-63 characters. 仅允许使用字母、数字、短划线和下划线字符。The only allowed characters are letters, numbers, dashes, and underscore. 第一个和最后一个字符必须是字母或数字。The first and last character must be a letter or a number.

  • AKS 节点/MC_ 资源组名称由资源组名称和资源名称组成。The AKS Node/MC_ resource group name combines resource group name and resource name. 自动生成的 MC_resourceGroupName_resourceName_AzureRegion 语法长度不能超过 80 个字符。The autogenerated syntax of MC_resourceGroupName_resourceName_AzureRegion must be no greater than 80 chars. 如果需要,请缩短你的资源组名称或 AKS 群集名称的长度。If needed, reduce the length of your resource group name or AKS cluster name.

  • dnsPrefix 必须以字母数字值开头和结尾,并且必须为 1 到 54 个字符。The dnsPrefix must start and end with alphanumeric values and must be between 1-54 characters. 有效字符包括字母数字值和连字符 (-)。Valid characters include alphanumeric values and hyphens (-). dnsPrefix 不能包含特殊字符,例如句点 (.)。The dnsPrefix can't include special characters such as a period (.).

  • AKS 节点池名称必须全部为小写形式,对于 Linux 节点池,长度为 1-11 个字符;对于 Windows 节点池,长度为 1-6 个字符。AKS Node Pool names must be all lowercase and be 1-11 characters for linux node pools and 1-6 characters for windows node pools. 名称必须以字母开头,并且仅允许使用字母和数字字符。The name must start with a letter and the only allowed characters are letters and numbers.

  • admin-username(用于设置 Linux 节点的管理员用户名)必须以字母开头,只能包含字母、数字、连字符和下划线,其最大长度为 64 个字符。The admin-username, which sets the administrator username for Linux nodes, must start with a letter, may only contain letters, numbers, hyphens, and underscores, and has a maximum length of 64 characters.

我在尝试创建、更新、缩放、删除或升级群集时收到错误,该操作不被允许,因为另一个操作正在进行。I'm receiving errors when trying to create, update, scale, delete or upgrade cluster, that operation is not allowed as another operation is in progress.

此故障排除帮助摘自 aks-pending-operationThis troubleshooting assistance is directed from aks-pending-operation

当上一个操作仍在进行时,群集操作会受限。Cluster operations are limited when a previous operation is still in progress. 若要检索群集的详细状态,请使用 az aks show -g myResourceGroup -n myAKSCluster -o table 命令。To retrieve a detailed status of your cluster, use the az aks show -g myResourceGroup -n myAKSCluster -o table command. 根据需要使用自己的资源组和 AKS 群集名称。Use your own resource group and AKS cluster name as needed.

根据群集状态输出执行相应操作:Based on the output of the cluster status:

在尝试创建新群集时收到错误消息,指示找不到服务主体或服务主体无效。Received an error saying my service principal wasn't found or is invalid when I try to create a new cluster.

创建 AKS 群集时,需要使用服务主体或托管标识代表本人创建资源。When creating an AKS cluster, it requires a service principal or managed identity to create resources on your behalf. AKS 可以在创建群集时自动创建新的服务主体,也可以接收现有服务主体。AKS can automatically create a new service principal at cluster creation time or receive an existing one. 如果使用自动创建的服务主体,Azure Active Directory 需要将其传播到每个区域,以便创建成功。When using an automatically created one, Azure Active Directory needs to propagate it to every region so the creation succeeds. 如果传播时间过长,群集创建验证将失败,因为它无法找到可用的服务主体来执行此操作。When the propagation takes too long, the cluster will fail validation to create as it can't find an available service principal to do so.

对于此问题,请使用以下解决方法:Use the following workarounds for this issue:

  • 使用现有服务主体,该主体已跨区域传播,并可在群集创建时传入 AKS。Use an existing service principal, which has already propagated across regions and exists to pass into AKS at cluster create time.
  • 如果使用自动化脚本,请在创建服务主体和创建 AKS 群集之间增加延迟时间。If using automation scripts, add time delays between service principal creation and AKS cluster creation.
  • 如果使用 Azure 门户,请在创建期间返回到群集设置,然后在几分钟后重试验证页面。If using Azure portal, return to the cluster settings during create and retry the validation page after a few minutes.

使用 AKS API 时,我收到 "AADSTS7000215: Invalid client secret is provided."I'm getting "AADSTS7000215: Invalid client secret is provided." when using AKS API.   应采取何种操作?What should I do?

此问题是由于服务主体凭据过期而引起的。This issue is due to the expiration of service principal credentials. 更新 AKS 群集的凭据。Update the credentials for an AKS cluster.

使用 API 服务器授权的 IP 范围时,无法从“自动化/开发计算机/工具”访问我的群集 API。I can't access my cluster API from my automation/dev machine/tooling when using API server authorized IP ranges. 如何修复此问题?How do I fix this problem?

若要解决此问题,请确保 --api-server-authorized-ip-ranges 包括所使用的自动化/开发/工具系统的 IP 或 IP 范围。To resolve this issue, ensure --api-server-authorized-ip-ranges includes the IP(s) or IP range(s) of automation/dev/tooling systems being used. 请参阅使用经授权的 IP 地址范围保护对 API 服务器的访问中的“如何查找我的 IP”部分。Refer section 'How to find my IP' in Secure access to the API server using authorized IP address ranges.

我无法在 Azure 门户的 Kubernetes 资源查看器中查看配置了 API 服务器授权 IP 范围的群集的资源。I'm unable to view resources in Kubernetes resource viewer in Azure portal for my cluster configured with API server authorized IP ranges. 如何修复此问题?How do I fix this problem?

Kubernetes 资源查看器要求 --api-server-authorized-ip-ranges 包含对本地客户端计算机或 IP 地址范围(在此范围内浏览门户)的访问权限。The Kubernetes resource viewer requires --api-server-authorized-ip-ranges to include access for the local client computer or IP address range (from which the portal is being browsed). 请参阅使用经授权的 IP 地址范围保护对 API 服务器的访问中的“如何查找我的 IP”部分。Refer section 'How to find my IP' in Secure access to the API server using authorized IP address ranges.

在限制出口流量后收到错误消息I'm receiving errors after restricting egress traffic

限制来自 AKS 群集的出口流量时,需要遵循针对 AKS 的必需和可选的建议出站端口/网络规则和 FQDN/应用程序规则。When restricting egress traffic from an AKS cluster, there are required and optional recommended outbound ports / network rules and FQDN / application rules for AKS. 如果你的设置与以上任意规则冲突,某些 kubectl 命令将无法正常运行。If your settings are in conflict with any of these rules, certain kubectl commands won't work correctly. 在创建 AKS 群集时,也可能会遇到错误。You may also see errors when creating an AKS cluster.

确认你的设置不与必需或可选的建议出站端口/网络规则和 FQDN/应用程序规则相冲突。Verify that your settings aren't conflicting with any of the required or optional recommended outbound ports / network rules and FQDN / application rules.

我收到“429 - 请求过多”错误I'm receiving "429 - Too Many Requests" errors

当 Azure 上的 Kubernetes 群集(AKS 或非 AKS)频繁执行纵向扩展/缩减操作或使用群集自动缩放程序 (CA) 时,这些操作可能导致大量 HTTP 调用,这些 HTTP 调用会超过所分配的订阅配额,从而导致失败。When a kubernetes cluster on Azure (AKS or no) does a frequent scale up/down or uses the cluster autoscaler (CA), those operations can result in a large number of HTTP calls that in turn exceed the assigned subscription quota leading to failure. 错误将如下所示The errors will look like

Service returned an error. Status=429 Code=\"OperationNotAllowed\" Message=\"The server rejected the request because too many requests have been received for this subscription.\" Details=[{\"code\":\"TooManyRequests\",\"message\":\"{\\\"operationGroup\\\":\\\"HighCostGetVMScaleSet30Min\\\",\\\"startTime\\\":\\\"2020-09-20T07:13:55.2177346+00:00\\\",\\\"endTime\\\":\\\"2020-09-20T07:28:55.2177346+00:00\\\",\\\"allowedRequestCount\\\":1800,\\\"measuredRequestCount\\\":2208}\",\"target\":\"HighCostGetVMScaleSet30Min\"}] InnerError={\"internalErrorCode\":\"TooManyRequestsReceived\"}"}

这些限制错误在此处此处进行了详细说明These throttling errors are described in detail here and here

AKS 工程团队的建议是确保运行的版本至少是 1.18.x(其中包含许多改进功能)。The recommendation from AKS Engineering Team is to ensure you are running version at least 1.18.x, which contains many improvements. 有关这些改进的更多详细信息,可参阅此文此文More details can be found on these improvements here and here.

鉴于这些限制错误是在订阅级别测量的,在以下情况下它们仍可能发生:Given these throttling errors are measured at the subscription level, they might still happen if:

  • 有第三方应用程序在发出 GET 请求(如监视应用程序等)。There are 3rd party applications making GET requests (for example, monitoring applications, and so on). 建议降低这些调用的频率。The recommendation is to reduce the frequency of these calls.
  • 有许多 AKS 群集/节点池在使用虚拟机规模集。There are numerous AKS clusters / node pools using virtual machine scale sets. 尝试将多个群集拆分为不同的订阅,特别是如果你希望它们处于非常活跃的状态(例如活动的群集自动缩放程序)或具有多个客户端(例如 rancher、terraform 等)。Try to split your number of clusters into different subscriptions, in particular if you expect them to be very active (for example, an active cluster autoscaler) or have multiple clients (for example, rancher, terraform, and so on).

无论我是否执行操作,群集的预配状态都会从“就绪”变为“失败”。My cluster's provisioning status changed from Ready to Failed with or without me performing an operation. 应采取何种操作?What should I do?

如果无论你是否执行操作,群集的预配状态都从“就绪”变为“失败”,而群集上的应用程序仍在继续运行,则此问题可以由服务自动解决,应用程序应该不会受到影响。If your cluster's provisioning status changes from Ready to Failed with or without you performing any operations, but the applications on your cluster are continuing to run, this issue may be resolved automatically by the service and your applications should not be affected.

如果群集的预配状态仍为“失败”或者群集上的应用程序停止工作,请提交支持请求If your cluster's provisioning status remains as Failed or the applications on your cluster stop working, submit a support request.

我的手表已过时或 Azure AD Pod 标识 NMI 返回状态 500My watch is stale or Azure AD Pod Identity NMI is returning status 500

如果你像本示例那样使用 Azure 防火墙,则可能会遇到此问题,因为使用应用程序规则通过防火墙的长期 TCP 连接当前有一个 bug(将在 Q1CY21 中解决),这会导致 Go keepalives 在防火墙终止。If you're using Azure Firewall like on this example, you may encounter this issue as the long lived TCP connections via firewall using Application Rules currently have a bug (to be resolved in Q1CY21) that causes the Go keepalives to be terminated on the firewall. 在解决此问题之前,可以通过将网络规则(而不是应用程序规则)添加到 AKS API 服务器 IP 来缓解此问题。Until this issue is resolved, you can mitigate by adding a Network rule (instead of application rule) to the AKS API server IP.

Azure 存储和 AKS 疑难解答Azure Storage and AKS Troubleshooting

在 Azure 磁盘的 mountOptions 中设置 uid 和 GID 失败Failure when setting uid and GID in mountOptions for Azure Disk

Azure 磁盘默认使用 ext4,xfs 文件系统,在装载时无法设置 uid=x,gid=x 之类的 mountOptions。Azure Disk uses the ext4,xfs filesystem by default and mountOptions such as uid=x,gid=x can't be set at mount time. 例如,如果尝试设置 mountpoptions uid=999、gid=999,将看到如下错误:For example if you tried to set mountOptions uid=999,gid=999, would see an error like:

Warning  FailedMount             63s                  kubelet, aks-nodepool1-29460110-0  MountVolume.MountDevice failed for volume "pvc-d783d0e4-85a1-11e9-8a90-369885447933" : azureDisk - mountDevice:FormatAndMount failed with mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m436970985 --scope -- mount -t xfs -o dir_mode=0777,file_mode=0777,uid=1000,gid=1000,defaults /dev/disk/azure/scsi1/lun2 /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m436970985
Output: Running scope as unit run-rb21966413ab449b3a242ae9b0fbc9398.scope.
mount: wrong fs type, bad option, bad superblock on /dev/sde,
       missing codepage or helper program, or other error

可以通过执行以下选项之一来缓解此问题:You can mitigate the issue by doing one the options:

  • 通过在 runAsUser 中设置 uid 和在 fsGroup 中设置 gid 为 Pod 配置安全上下文Configure the security context for a pod by setting uid in runAsUser and gid in fsGroup. 例如,以下设置将 pod 设置为作为根运行,使其可供任何文件访问:For example, the following setting will set pod run as root, make it accessible to any file:
apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo
spec:
  securityContext:
    runAsUser: 0
    fsGroup: 0

备注

因为 gid 和 uid 默认作为根或 0 装载。Since gid and uid are mounted as root or 0 by default. 如果 gid 或 uid 设置为非根(例如 1000),Kubernetes 将使用 chown 来更改该磁盘中的所有目录和文件。If gid or uid are set as non-root, for example 1000, Kubernetes will use chown to change all directories and files under that disk. 此操作可能非常耗时,并可能导致磁盘装载速度变得很慢。This operation can be time consuming and may make mounting the disk very slow.

  • 使用 initContainers 中的 chown 来设置 GIDUIDUse chown in initContainers to set GID and UID. 例如:For example:
initContainers:
- name: volume-mount
  image: busybox
  command: ["sh", "-c", "chown -R 100:100 /data"]
  volumeMounts:
  - name: <your data volume>
    mountPath: /data

Azure 磁盘分离失败可能导致争用条件问题和数据磁盘列表无效Azure Disk detach failure leading to potential race condition issue and invalid data disk list

当 Azure 磁盘无法分离时,它会重试最多六次,以使用指数回退来分离磁盘。When an Azure Disk fails to detach, it will retry up to six times to detach the disk using exponential back off. 它还会在数据磁盘列表上保留一个节点级锁约 3 分钟。It will also hold a node-level lock on the data disk list for about 3 minutes. 如果在这段时间内手动更新磁盘列表,会导致节点级锁保留的磁盘列表过时,并导致节点不稳定。If the disk list is updated manually during that time, it will cause the disk list held by the node-level lock to be obsolete and cause instability on the node.

此问题在以下版本的 Kubernetes 中已得到修复:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.121.12 1.12.9 或更高版本1.12.9 or later
1.131.13 1.13.6 或更高版本1.13.6 or later
1.141.14 1.14.2 或更高版本1.14.2 or later
1.15 和更高版本1.15 and later 空值N/A

如果当前使用的 Kubernetes 版本无法修复此问题,并且你的节点具有过时的磁盘列表,可以通过批量操作将所有不存在的磁盘从 VM 中分离出来,以缓解此问题。If you're using a version of Kubernetes that doesn't have the fix for this issue and your node has an obsolete disk list, you can mitigate by detaching all non-existing disks from the VM as a bulk operation. 单独分离不存在的磁盘可能会失败。Individually detaching non-existing disks may fail.

Azure 磁盘数目过大导致附加/分离速度缓慢Large number of Azure Disks causes slow attach/detach

如果针对单个节点 VM 的 Azure 磁盘附加/分离操作次数超过 10 次,或针对单个虚拟机规模集池时的操作次数超过 3 次,则速度可能会比预期要慢,因为它们是按顺序执行的。When the numbers of Azure Disk attach/detach operations targeting a single node VM is larger than 10, or larger than 3 when targeting single virtual machine scale set pool they may be slower than expected as they are done sequentially. 此问题是已知限制,目前没有解决方法。This issue is a known limitation and there are no workarounds at this time. 用于支持并行附加/分离的 User Voice 项超出数量限制User voice item to support parallel attach/detach beyond number..

Azure 磁盘分离失败可能导致节点 VM 处于失败状态Azure Disk detach failure leading to potential node VM in failed state

在某些极端情况下,Azure 磁盘分离操作可能部分失败,导致节点 VM 处于故障状态。In some edge cases, an Azure Disk detach may partially fail and leave the node VM in a failed state.

此问题在以下版本的 Kubernetes 中已得到修复:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.121.12 1.12.10 或更高版本1.12.10 or later
1.131.13 1.13.8 或更高版本1.13.8 or later
1.141.14 1.14.4 或更高版本1.14.4 or later
1.15 和更高版本1.15 and later 空值N/A

如果当前使用的 Kubernetes 版本无法修复此问题,并且你的节点处于失败状态,可以使用以下方式手动更新 VM 状态,以缓解此问题:If you're using a version of Kubernetes that doesn't have the fix for this issue and your node is in a failed state, you can mitigate by manually updating the VM status using one of the below:

  • 对于基于可用性集的群集,使用以下代码:For an availability set-based cluster:

    az vm update -n <VM_NAME> -g <RESOURCE_GROUP_NAME>
    
  • 对于基于 VMSS 的群集:For a VMSS-based cluster:

    az vmss update-instances -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id <ID>
    

Azure 文件存储和 AKS 疑难解答Azure Files and AKS Troubleshooting

Kubernetes 版本Kubernetes version 建议的版本Recommended version
1.121.12 1.12.6 或更高版本1.12.6 or later
1.131.13 1.13.4 或更高版本1.13.4 or later
1.141.14 1.14.0 或更高版本1.14.0 or later

使用 Azure 文件存储时的默认 mountOptions 是什么?What are the default mountOptions when using Azure Files?

建议的设置:Recommended settings:

Kubernetes 版本Kubernetes version fileMode 和 dirMode 值fileMode and dirMode value
1.12.0 - 1.12.11.12.0 - 1.12.1 07550755
1.12.2 和更高版本1.12.2 and later 07770777

可以对存储类对象指定装载选项。Mount options can be specified on the storage class object. 以下示例设置 0777The following example sets 0777:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: azurefile
provisioner: kubernetes.io/azure-file
mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=1000
  - gid=1000
  - mfsymlinks
  - nobrl
  - cache=none
parameters:
  skuName: Standard_LRS

其他一些有用的 mountOptions 设置:Some additional useful mountOptions settings:

  • mfsymlinks 将使 Azure 文件存储装入点 (cifs) 支持符号链接mfsymlinks will make Azure Files mount (cifs) support symbolic links

  • nobrl 将阻止向服务器发送字节范围锁请求。nobrl will prevent sending byte range lock requests to the server. 对于使用 cifs 样式的强制字节范围锁中断的某些应用程序,此设置是必需的。This setting is necessary for certain applications that break with cifs style mandatory byte range locks. 大多数 cifs 服务器尚不支持请求建议字节范围锁。Most cifs servers don't yet support requesting advisory byte range locks. 如果不使用 nobrl,则使用 cifs 样式的强制字节范围锁中断的应用程序可能会导致以下类似错误消息:If not using nobrl, applications that break with cifs style mandatory byte range locks may cause error messages similar to:

    Error: SQLITE_BUSY: database is locked
    

使用 Azure 文件存储时出现“无法更改权限”错误Error "could not change permissions" when using Azure Files

在 Azure 文件存储插件中运行 PostgreSQL 时,可能会出现如下所示的错误:When running PostgreSQL on the Azure Files plugin, you may see an error similar to:

initdb: could not change permissions of directory "/var/lib/postgresql/data": Operation not permitted
fixing permissions on existing directory /var/lib/postgresql/data

此错误是由使用 cifs/SMB 协议的 Azure 文件存储插件造成的。This error is caused by the Azure Files plugin using the cifs/SMB protocol. 使用 cifs/SMB 协议时,在装载后无法更改文件和目录权限。When using the cifs/SMB protocol, the file and directory permissions couldn't be changed after mounting.

若要解决此问题,请结合 Azure 磁盘插件使用 subPathTo resolve this issue, use subPath together with the Azure Disk plugin.

备注

对于 ext3/4 磁盘类型,格式化磁盘后会出现一个 lost+found 目录。For ext3/4 disk type, there is a lost+found directory after the disk is formatted.

处理许多小型文件时,Azure 文件存储的延迟高于 Azure 磁盘Azure Files has high latency compared to Azure Disk when handling many small files

在某些情况下(例如处理许多的小型文件时),使用 Azure 文件存储时出现的延迟可能高于 Azure 磁盘。In some case, such as handling many small files, you may experience high latency when using Azure Files when compared to Azure Disk.

对存储帐户启用“允许从所选网络进行访问”设置时出错Error when enabling "Allow access allow access from selected network" setting on storage account

如果在 AKS 中对用于动态预配的存储帐户启用“允许从所选网络进行访问”,则在 AKS 创建文件共享时会出错:If you enable allow access from selected network on a storage account that's used for dynamic provisioning in AKS, you'll get an error when AKS creates a file share:

persistentvolume-controller (combined from similar events): Failed to provision volume with StorageClass "azurefile": failed to create share kubernetes-dynamic-pvc-xxx in account xxx: failed to create file share, err: storage: service returned error: StatusCode=403, ErrorCode=AuthorizationFailure, ErrorMessage=This request is not authorized to perform this operation.

出现此错误是因为当设置“允许从所选网络进行访问”时,Kubernetes persistentvolume-controller 不在所选网络上。This error is because of the Kubernetes persistentvolume-controller not being on the network chosen when setting allow access from selected network.

可以通过使用 Azure 文件存储静态预配来缓解此问题。You can mitigate the issue by using static provisioning with Azure Files.

Azure 文件存储无法在 Windows Pod 中重新装载Azure Files fails to remount in Windows pod

如果删除了包含已装载 Azure 文件存储的 Windows Pod,然后计划在同一节点上重新创建它,装载将失败。If a Windows pod with an Azure Files mount is deleted and then scheduled to be recreated on the same node, the mount will fail. 之所以会失败,是因为 Azure 文件存储装载已装载在该节点上,这导致 New-SmbGlobalMapping 命令失败。This failure is because of the New-SmbGlobalMapping command failing since the Azure Files mount is already mounted on the node.

例如,你可能会看到类似以下内容的错误:For example, you may see an error similar to:

E0118 08:15:52.041014    2112 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/azure-file/42c0ea39-1af9-11e9-8941-000d3af95268-pvc-d7e1b5f9-1af3-11e9-8941-000d3af95268\" (\"42c0ea39-1af9-11e9-8941-000d3af95268\")" failed. No retries permitted until 2019-01-18 08:15:53.0410149 +0000 GMT m=+732.446642701 (durationBeforeRetry 1s). Error: "MountVolume.SetUp failed for volume \"pvc-d7e1b5f9-1af3-11e9-8941-000d3af95268\" (UniqueName: \"kubernetes.io/azure-file/42c0ea39-1af9-11e9-8941-000d3af95268-pvc-d7e1b5f9-1af3-11e9-8941-000d3af95268\") pod \"deployment-azurefile-697f98d559-6zrlf\" (UID: \"42c0ea39-1af9-11e9-8941-000d3af95268\") : azureMount: SmbGlobalMapping failed: exit status 1, only SMB mount is supported now, output: \"New-SmbGlobalMapping : Generic failure \\r\\nAt line:1 char:190\\r\\n+ ... ser, $PWord;New-SmbGlobalMapping -RemotePath $Env:smbremotepath -Cred ...\\r\\n+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\r\\n    + CategoryInfo          : NotSpecified: (MSFT_SmbGlobalMapping:ROOT/Microsoft/...mbGlobalMapping) [New-SmbGlobalMa \\r\\n   pping], CimException\\r\\n    + FullyQualifiedErrorId : HRESULT 0x80041001,New-SmbGlobalMapping\\r\\n \\r\\n\""

此问题在以下版本的 Kubernetes 中已得到修复:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.121.12 1.12.6 或更高版本1.12.6 or later
1.131.13 1.13.4 或更高版本1.13.4 or later
1.14 或更高版本1.14 and later 空值N/A

由于存储帐户密钥已更改导致 Azure 文件存储装载失败Azure Files mount fails because of storage account key changed

如果存储帐户密钥已更改,可能会遇到 Azure 文件存储装载失败。If your storage account key has changed, you may see Azure Files mount failures.

可以使用 base64 编码的存储帐户密钥在 Azure 文件机密中手动更新 azurestorageaccountkey 字段,从而缓解此问题。You can mitigate by manually updating the azurestorageaccountkey field manually in an Azure file secret with your base64-encoded storage account key.

若要对存储帐户密钥进行 base64 编码,可以使用 base64To encode your storage account key in base64, you can use base64. 例如:For example:

echo X+ALAAUgMhWHL7QmQ87E1kSfIqLKfgC03Guy7/xk9MyIg2w4Jzqeu60CVw2r/dm6v6E0DWHTnJUEJGVQAoPaBc== | base64

若要更新 Azure 机密文件,请使用 kubectl edit secretTo update your Azure secret file, use kubectl edit secret. 例如:For example:

kubectl edit secret azure-storage-account-{storage-account-name}-secret

几分钟后,代理节点将使用更新的存储密钥重新尝试装载 Azure 文件存储。After a few minutes, the agent node will retry the Azure File mount with the updated storage key.

群集自动缩放程序无法缩放并显示错误:无法设置固定的节点组大小Cluster autoscaler fails to scale with error failed to fix node group sizes

如果群集自动缩放程序无法增大/缩小,则会在群集自动缩放程序日志上看到如下错误。If your cluster autoscaler isn't scaling up/down and you see an error like the below on the cluster autoscaler logs.

E1114 09:58:55.367731 1 static_autoscaler.go:239] Failed to fix node group sizes: failed to decrease aks-default-35246781-vmss: attempt to delete existing nodes

此错误是由于上游群集自动缩放程序争用条件导致的。This error is because of an upstream cluster autoscaler race condition. 在这种情况下,集群自动缩放程序返回的值与群集中实际存在的值不同。In such a case, cluster autoscaler ends with a different value than the one that is actually in the cluster. 为此,请禁用群集自动缩放程序,然后再重新启用它。To get out of this state, disable and re-enable the cluster autoscaler.

磁盘附加速度缓慢,GetAzureDiskLun 需要 10 到 15 分钟,并且会显示一个错误Slow disk attachment, GetAzureDiskLun takes 10 to 15 minutes and you receive an error

在 1.15.0 之前的 Kubernetes 版本中,可能会收到错误消息,如“错误: WaitForAttach 找不到磁盘的 Lun”。On Kubernetes versions older than 1.15.0, you may receive an error such as Error WaitForAttach Cannot find Lun for disk. 为解决此问题,请等待大约 15 分钟,然后重试。The workaround for this issue is to wait approximately 15 minutes and retry.

为什么使用带有 kubernetes.io 前缀的节点标签时升级到 Kubernetes 1.16 失败Why do upgrades to Kubernetes 1.16 fail when using node labels with a kubernetes.io prefix

从 Kubernetes 1.16 开始,kubelet 只能将已定义的带有 kubernetes.io 前缀的标签子集应用于节点。As of Kubernetes 1.16 only a defined subset of labels with the kubernetes.io prefix can be applied by the kubelet to nodes. 未经许可,AKS 无法代表你删除活动标签,因为这可能导致受影响的工作负载发生故障。AKS cannot remove active labels on your behalf without consent, as it may cause downtime to impacted workloads.

因此,要缓解这种问题,可以执行以下操作:As a result, to mitigate this issue you can:

  1. 将群集控制平面升级到 1.16 或更高版本Upgrade your cluster control plane to 1.16 or higher
  2. 在 1.16 或更高版本上添加一个没有受支持的 kubernetes.io 标签的新 nodepooolAdd a new nodepoool on 1.16 or higher without the unsupported kubernetes.io labels
  3. 删除较旧的节点池Delete the older node pool

AKS 正在研究对节点池上的活动标签进行改变的功能以改进这种缓解效果。AKS is investigating the capability to mutate active labels on a node pool to improve this mitigation.