AKS 疑难解答AKS troubleshooting

当创建或管理 Azure Kubernetes 服务 (AKS) 群集时,可能偶尔会遇到问题。When you create or manage Azure Kubernetes Service (AKS) clusters, you might occasionally encounter problems. 本文详细介绍了一些常见问题及其排查步骤。This article details some common problems and troubleshooting steps.

通常在何处查看 Kubernetes 问题调试的相关信息?In general, where do I find information about debugging Kubernetes problems?

请尝试 Kubernetes 群集故障排除的官方指南Try the official guide to troubleshooting Kubernetes clusters. 还可尝试由 Azure 工程师发布的故障排除指南,用于对 Pod、节点、群集和其他功能进行故障排除。There's also a troubleshooting guide, published by a Azure engineer for troubleshooting pods, nodes, clusters, and other features.

在创建或升级期间遇到“超出配额”的错误。I'm getting a "quota exceeded" error during creation or upgrade. 我该怎么办?What should I do?

需要请求内核You need to request cores.

对 AKS 而言,每个节点设置的最大 Pod 是多少?What is the maximum pods-per-node setting for AKS?

如果在 Azure 门户中部署 AKS 群集,则每个节点的最大 Pod 均默认设置为 30。The maximum pods-per-node setting is 30 by default if you deploy an AKS cluster in the Azure portal. 如果在 Azure CLI 中部署 AKS 群集,则每个节点的最大 Pod 均默认设置为 110。The maximum pods-per-node setting is 110 by default if you deploy an AKS cluster in the Azure CLI. (确保使用最新版本的 Azure CLI)。(Make sure you're using the latest version of the Azure CLI). 可以使用 az aks create 命令中的 --max-pods 标记来更改此默认设置。This default setting can be changed by using the --max-pods flag in the az aks create command.

在使用高级网络部署 AKS 群集时收到 insufficientSubnetSize 错误。I'm getting an insufficientSubnetSize error while deploying an AKS cluster with advanced networking. 我该怎么办?What should I do?

如果使用 Azure CNI(高级网络),AKS 会根据配置的每个节点的“最大 Pod 数”分配 IP 地址。If Azure CNI (advanced networking) is used, AKS allocates IP addresses based on the "max-pods" per node configured. 根据配置的每个节点的最大 Pod 数,子网大小必须大于“节点数和每个节点的最大 Pod 数的乘积”设置。Based on the configured max pods per node, the subnet size must be greater than the product of the number of nodes and the max pod per node setting. 以下公式对此进行了概述:The following equation outlines this:

子网大小 > 群集中的节点数(考虑到未来的缩放要求)* 每个节点的最大 Pod 数。Subnet size > number of nodes in the cluster (taking into consideration the future scaling requirements) * max pods per node set.

有关详细信息,请参阅规划群集的 IP 地址For more information, see Plan IP addressing for your cluster.

我的 Pod 停滞在 CrashLoopBackOff 模式。My pod is stuck in CrashLoopBackOff mode. 我该怎么办?What should I do?

可能有多种原因导致 Pod 停滞在该模式。There might be various reasons for the pod being stuck in that mode. 可能通过以下方式查看:You might look into:

  • 使用 kubectl describe pod <pod-name> 查看 Pod 本身。The pod itself, by using kubectl describe pod <pod-name>.
  • 使用 kubectl log <pod-name> 查看日志。The logs, by using kubectl log <pod-name>.

有关如何对 Pod 的问题进行故障排除的详细信息,请参阅调试应用程序For more information on how to troubleshoot pod problems, see Debug applications.

尝试在现有群集上启用 RBAC。I'm trying to enable RBAC on an existing cluster. 该如何操作?How can I do that?

遗憾的是,目前不支持在现有群集上启用基于角色的访问控制 (RBAC)。Unfortunately, enabling role-based access control (RBAC) on existing clusters isn't supported at this time. 必须显式创建新群集。You must explicitly create new clusters. 如果使用 CLI,则默认启用 RBAC。If you use the CLI, RBAC is enabled by default. 如果使用 AKS 门户,则在创建工作流时可使用切换按钮来启用 RBAC。If you use the AKS portal, a toggle button to enable RBAC is available in the creation workflow.

使用带有默认值的 Azure CLI 或 Azure 门户创建了一个启用了 RBAC 的集群,现在 Kubernetes 仪表板上出现了许多警告。I created a cluster with RBAC enabled by using either the Azure CLI with defaults or the Azure portal, and now I see many warnings on the Kubernetes dashboard. 仪表板以前在没有任何警告的情况下工作。The dashboard used to work without any warnings. 我该怎么办?What should I do?

仪表板上收到警告的原因是群集现在启用了 RBAC,但已默认禁用了对它的访问。The reason for the warnings on the dashboard is that the cluster is now enabled with RBAC and access to it has been disabled by default. 一般来说,此方法比较棒,因为仪表板默认公开给群集的所有用户可能会导致安全威胁。In general, this approach is good practice because the default exposure of the dashboard to all users of the cluster can lead to security threats. 如果仍想要启用仪表板,请遵循此博客文章中的步骤进行操作。If you still want to enable the dashboard, follow the steps in this blog post.

我无法连接到仪表板。I can't connect to the dashboard. 我该怎么办?What should I do?

要访问群集外的服务,最简单的方法是运行 kubectl proxy,它将代理对 Kubernetes API 服务器使用 localhost 端口 8001 的请求。The easiest way to access your service outside the cluster is to run kubectl proxy, which proxies requests sent to your localhost port 8001 to the Kubernetes API server. 在此,API 服务器可以代理服务:http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy/#!/node?namespace=defaultFrom there, the API server can proxy to your service: http://localhost:8001/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy/#!/node?namespace=default.

如果看不到 Kubernetes 仪表板,请检查 kube-proxy Pod 是否在 kube-system 命名空间中运行。If you don't see the Kubernetes dashboard, check whether the kube-proxy pod is running in the kube-system namespace. 如果未处于运行状态,请删除 Pod,它会重启。If it isn't in a running state, delete the pod and it will restart.

无法使用 Kubectl 日志获取日志或无法连接到 API 服务器。I can't get logs by using kubectl logs or I can't connect to the API server. 我收到“来自服务器的错误:拨号后端时出错: 拨打 tcp...”。I'm getting "Error from server: error dialing backend: dial tcp…". 我该怎么办?What should I do?

请确保默认网络安全组未被修改并且端口 22 和 9000 已打开以连接到 API 服务器。Make sure that the default network security group isn't modified and that both port 22 and 9000 are open for connection to the API server. 使用 kubectl get pods --namespace kube-system 命令检查 tunnelfront Pod是否在 kube-system 命名空间中运行。Check whether the tunnelfront pod is running in the kube-system namespace using the kubectl get pods --namespace kube-system command. 如果没有,请强制删除 Pod,它会重启。If it isn't, force deletion of the pod and it will restart.

我在尝试进行升级或缩放,并收到“消息:不允许更改属性‘imageReference’”错误。I'm trying to upgrade or scale and am getting a "message: Changing property 'imageReference' is not allowed" error. 如何修复此问题?How do I fix this problem?

收到此错误的原因可能是,你修改了 AKS 群集内代理节点中的标记。You might be getting this error because you've modified the tags in the agent nodes inside the AKS cluster. 如果修改和删除 MC_* 资源组中资源的标记和其他属性,可能会导致意外结果。Modifying and deleting tags and other properties of resources in the MC_* resource group can lead to unexpected results. 修改 AKS 群集中 MC_ * 组下的资源会中断服务级别目标 (SLO)。Modifying the resources under the MC_* group in the AKS cluster breaks the service-level objective (SLO).

有错误指出,我的群集处于故障状态,在解决此解决之前无法进行升级或缩放I'm receiving errors that my cluster is in failed state and upgrading or scaling will not work until it is fixed

此故障排除帮助摘自 aks-cluster-failedThis troubleshooting assistance is directed from aks-cluster-failed

如果群集出于多种原因进入故障状态,则会发生此错误。This error occurs when clusters enter a failed state for multiple reasons. 请遵循以下步骤解决群集故障状态,然后重试先前失败的操作:Follow the steps below to resolve your cluster failed state before retrying the previously failed operation:

  1. 除非群集摆脱 failed 状态,否则 upgradescale 操作不会成功。Until the cluster is out of failed state, upgrade and scale operations won't succeed. 常见的根本问题和解决方法包括:Common root issues and resolutions include:
    • 使用不足的计算 (CRP) 配额进行缩放。Scaling with insufficient compute (CRP) quota. 若要解决此问题,请先将群集缩放回到配额内的稳定目标状态。To resolve, first scale your cluster back to a stable goal state within quota. 遵循这些步骤请求提高计算配额,然后尝试扩展到超出初始配额限制。Then follow these steps to request a compute quota increase before trying to scale up again beyond initial quota limits.
    • 使用高级网络和不足的子网(网络)资源缩放群集。Scaling a cluster with advanced networking and insufficient subnet (networking) resources. 若要解决此问题,请先将群集缩放回到配额内的稳定目标状态。To resolve, first scale your cluster back to a stable goal state within quota. 遵循这些步骤请求提高资源配额,然后尝试扩展到超出初始配额限制。Then follow these steps to request a resource quota increase before trying to scale up again beyond initial quota limits.
  2. 解决升级失败的根本原因后,群集应会进入成功状态。Once the underlying cause for upgrade failure is resolved, your cluster should be in a succeeded state. 确认处于成功状态后,重试原始操作。Once a succeeded state is verified, retry the original operation.

尝试升级或缩放群集时,有错误指出我的群集当前正在升级或升级失败I'm receiving errors when trying to upgrade or scale that state my cluster is being currently being upgraded or has failed upgrade

此故障排除帮助摘自 aks-pending-upgradeThis troubleshooting assistance is directed from aks-pending-upgrade

使用单节点池时,群集上的升级和缩放操作是互斥的。Upgrade and scale operations on a cluster with a single node pool is mutually exclusive. 不能让群集或节点池同时升级和缩放,You cannot have a cluster or node pool simultaneously upgrade and scale. 而只能先在目标资源上完成一个操作类型,然后再在同一资源上执行下一个请求。Instead, each operation type must complete on the target resource prior to the next request on that same resource. 因此,如果当前正在执行升级或缩放操作,或者曾经尝试过这些操作,但随后失败,则其他操作会受到限制。As a result, operations are limited when active upgrade or scale operations are occurring or attempted and subsequently failed.

若要诊断此问题,请运行 az aks show -g myResourceGroup -n myAKSCluster -o table 检索群集上的详细状态。To help diagnose the issue run az aks show -g myResourceGroup -n myAKSCluster -o table to retrieve detailed status on your cluster. 根据结果:Based on the result:

  • 如果群集正在升级,请等到该操作终止。If cluster is actively upgrading, wait until the operation terminates. 如果升级成功,请再次重试先前失败的操作。If it succeeded, retry the previously failed operation again.
  • 如果群集升级失败,请按前面部分所述的步骤操作。If cluster has failed upgrade, follow steps outlined in previous section.

是否可以将我的群集移动到其他订阅,或者说,是否可以将包含我的群集的订阅移动到新租户?Can I move my cluster to a different subscription or my subscription with my cluster to a new tenant?

如果你已将 AKS 群集移动到其他订阅,或者将拥有订阅的群集移动到新租户,则群集将会由于失去角色分配和服务主体权限而丢失功能。If you have moved your AKS cluster to a different subscription or the cluster owning subscription to a new tenant, the cluster will lose functionality due to losing role assignments and service principals rights. 由于此约束,AKS 不支持在订阅或租户之间移动群集AKS does not support moving clusters across subscriptions or tenants due to the this constraint.

针对 AKS 资源和参数强制实施了什么命名限制?What naming restrictions are enforced for AKS resources and parameters?

此故障排除帮助来自 aka.ms/aks-naming-rulesThis troubleshooting assistance is directed from aka.ms/aks-naming-rules

Azure 平台和 AKS 都实施了命名限制。Naming restrictions are implemented by both the Azure platform and AKS. 如果资源名称或参数违反了这些限制之一,则会返回一个错误,要求你提供不同的输入。If a resource name or parameter breaks one of these restrictions, an error is returned that asks you provide a different input. 将应用以下通用的命名准则:The following common naming guidelines apply:

  • 群集名称必须为 1-63 个字符。Cluster names must be 1-63 characters. 唯一允许的字符是字母、数字、短划线和下划线。The only allowed characters are letters, numbers, dashes, and underscores. 第一个和最后一个字符必须是字母或数字。The first and last character must be a letter or a number.
  • AKS MC_ 资源组名称组合了资源组名称和资源名称。The AKS MC_ resource group name combines resource group name and resource name. 自动生成的语法 MC_resourceGroupName_resourceName_AzureRegion 不能超过 80 个字符。The auto-generated syntax of MC_resourceGroupName_resourceName_AzureRegion must be no greater than 80 chars. 如果需要,请缩短你的资源组名称或 AKS 群集名称的长度。If needed, reduce the length of your resource group name or AKS cluster name.
  • dnsPrefix 的开头和结尾必须是字母数字值。The dnsPrefix must start and end with alphanumeric values. 有效字符包括字母数字值和连字符 (-)。Valid characters include alphanumeric values and hyphens (-). dnsPrefix 不能包含特殊字符,例如句点 (.)。The dnsPrefix can't include special characters such as a period (.).

我在尝试创建、更新、缩放、删除或升级群集时收到错误,该操作不被允许,因为另一个操作正在进行。I'm receiving errors when trying to create, update, scale, delete or upgrade cluster, that operation is not allowed as another operation is in progress.

此故障排除帮助摘自 aks-pending-operationThis troubleshooting assistance is directed from aks-pending-operation

当上一个操作仍在进行时,群集操作会受限。Cluster operations are limited when a previous operation is still in progress. 若要检索群集的详细状态,请使用 az aks show -g myResourceGroup -n myAKSCluster -o table 命令。To retrieve a detailed status of your cluster, use the az aks show -g myResourceGroup -n myAKSCluster -o table command. 根据需要使用自己的资源组和 AKS 群集名称。Use your own resource group and AKS cluster name as needed.

根据群集状态的输出:Based on the output of the cluster status:

尝试创建一个新群集而不是传入现有群集时,收到“找不到服务主体”错误。I'm receiving errors that my service principal was not found when I try to create a new cluster without passing in an existing one.

创建 AKS 群集时,需要服务主体来代表你创建资源。When creating an AKS cluster it requires a service principal to create resources on your behalf. AKS 提供了在创建群集时创建新服务主体的功能,但这需要 Azure Active Directory 在合理的时间内完全传播新的服务主体,以便成功创建群集。AKS offers the ability to have a new one created at cluster creation time, but this requires Azure Active Directory to fully propagate the new service principal in a reasonable time in order to have the cluster succeed in creation. 当此传播花费的时间太长时,群集将无法创建验证,因为它找不到可用的服务主体来执行此操作。When this propagation takes too long, the cluster will fail validation to create as it cannot find an available service principal to do so.

为此,请使用以下解决方法:Use the following workarounds for this:

  1. 使用已在区域中传播并且存在的现有服务主体,并在创建群集时将其传入 AKS。Use an existing service principal which has already propagated across regions and exists to pass into AKS at cluster create time.
  2. 如果使用自动化脚本,请在创建服务主体和创建 AKS 群集之间添加时间延迟。If using automation scripts, add time delays between service principal creation and AKS cluster creation.
  3. 如果使用 Azure 门户,请在创建过程中返回到群集设置,并在几分钟后重试验证页。If using Azure portal, return to the cluster settings during create and retry the validation page after a few minutes.

Azure 存储和 AKS 故障排除Azure Storage and AKS Troubleshooting

Kubernetes 版本Kubernetes version 建议的版本Recommended version
1.121.12 1.12.9 或更高版本1.12.9 or later
1.131.13 1.13.6 或更高版本1.13.6 or later
1.141.14 1.14.2 或更高版本1.14.2 or later

哪些 Kubernetes 版本在主权云中提供 Azure 磁盘支持?What versions of Kubernetes have Azure Disk support on the Sovereign Cloud?

Kubernetes 版本Kubernetes version 建议的版本Recommended version
1.121.12 1.12.0 或更高版本1.12.0 or later
1.131.13 1.13.0 或更高版本1.13.0 or later
1.141.14 1.14.0 或更高版本1.14.0 or later

Azure 磁盘的 WaitForAttach 失败: 分析 "/dev/disk/azure/scsi1/lun1": 语法无效WaitForAttach failed for Azure Disk: parsing "/dev/disk/azure/scsi1/lun1": invalid syntax

在 Kubernetes 版本 1.10 中,MountVolume.WaitForAttach 可能会失败并出现 Azure 磁盘重装入点。In Kubernetes version 1.10, MountVolume.WaitForAttach may fail with an the Azure Disk remount.

在 Linux 上,可能会出现“错误的 DevicePath 格式”错误。On Linux, you may see an incorrect DevicePath format error. 例如:For example:

MountVolume.WaitForAttach failed for volume "pvc-f1562ecb-3e5f-11e8-ab6b-000d3af9f967" : azureDisk - Wait for attach expect device path as a lun number, instead got: /dev/disk/azure/scsi1/lun1 (strconv.Atoi: parsing "/dev/disk/azure/scsi1/lun1": invalid syntax)
  Warning  FailedMount             1m (x10 over 21m)   kubelet, k8s-agentpool-66825246-0  Unable to mount volumes for pod

此问题已在以下版本的 Kubernetes 中得到解决:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.101.10 1.10.2 或更高版本1.10.2 or later
1.111.11 1.11.0 或更高版本1.11.0 or later
1.12 和更高版本1.12 and later 不适用N/A

在 Azure 磁盘的 mountOptions 中设置 uid 和 gid 失败Failure when setting uid and gid in mountOptions for Azure Disk

Azure 磁盘默认使用 ext4,xfs 文件系统,在装载时无法设置 uid=x,gid=x 之类的 mountOptions。Azure Disk uses the ext4,xfs filesystem by default and mountOptions such as uid=x,gid=x can't be set at mount time. 例如,如果尝试设置 mountOptions uid=999,gid=999,将出现如下所示的错误:For example if you tried to set mountOptions uid=999,gid=999, would see an error like:

Warning  FailedMount             63s                  kubelet, aks-nodepool1-29460110-0  MountVolume.MountDevice failed for volume "pvc-d783d0e4-85a1-11e9-8a90-369885447933" : azureDisk - mountDevice:FormatAndMount failed with mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m436970985 --scope -- mount -t xfs -o dir_mode=0777,file_mode=0777,uid=1000,gid=1000,defaults /dev/disk/azure/scsi1/lun2 /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m436970985
Output: Running scope as unit run-rb21966413ab449b3a242ae9b0fbc9398.scope.
mount: wrong fs type, bad option, bad superblock on /dev/sde,
       missing codepage or helper program, or other error

可以通过执行以下操作之一来缓解此问题:You can mitigate the issue by doing one the following:

  • 通过在 runAsUser 中设置 uid 并在 fsGroup 中设置 gid,来配置 pod 的安全上下文Configure the security context for a pod by setting uid in runAsUser and gid in fsGroup. 例如,以下设置将 pod 设置为作为根运行,使其可供任何文件访问:For example, the following setting will set pod run as root, make it accessible to any file:
apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo
spec:
  securityContext:
    runAsUser: 0
    fsGroup: 0

Note

因为 gid 和 uid 默认作为根或 0 装载。Since gid and uid are mounted as root or 0 by default. 如果 gid 或 uid 设置为非根(例如 1000),Kubernetes 将使用 chown 来更改该磁盘中的所有目录和文件。If gid or uid are set as non-root, for example 1000, Kubernetes will use chown to change all directories and files under that disk. 此操作可能非常耗时,并可能导致磁盘装载速度变得很慢。This operation can be time consuming and may make mounting the disk very slow.

  • 使用 initContainers 中的 chown 来设置 gid 和 uid。Use chown in initContainers to set gid and uid. 例如:For example:
initContainers:
- name: volume-mount
  image: busybox
  command: ["sh", "-c", "chown -R 100:100 /data"]
  volumeMounts:
  - name: <your data volume>
    mountPath: /data

删除 pod 使用的 Azure 磁盘 PersistentVolumeClaim 时出错Error when deleting Azure Disk PersistentVolumeClaim in use by a pod

如果尝试删除 pod 正在使用的 Azure 磁盘 PersistentVolumeClaim,可能会出现错误。If you try to delete an Azure Disk PersistentVolumeClaim that is being used by a pod, you may see an error. 例如:For example:

$ kubectl describe pv pvc-d8eebc1d-74d3-11e8-902b-e22b71bb1c06
...
Message:         disk.DisksClient#Delete: Failure responding to request: StatusCode=409 -- Original Error: autorest/azure: Service returned an error. Status=409 Code="OperationNotAllowed" Message="Disk kubernetes-dynamic-pvc-d8eebc1d-74d3-11e8-902b-e22b71bb1c06 is attached to VM /subscriptions/{subs-id}/resourceGroups/MC_markito-aks-pvc_markito-aks-pvc_chinaeast2/providers/Microsoft.Compute/virtualMachines/aks-agentpool-25259074-0."

在 Kubernetes 1.10 和更高版本中,默认已启用 PersistentVolumeClaim protection 功能以防止此错误。In Kubernetes version 1.10 and later, there is a PersistentVolumeClaim protection feature enabled by default to prevent this error. 如果使用的 Kubernetes 版本未解决此问题,可以通过在删除 PersistentVolumeClaim 之前使用 PersistentVolumeClaim 删除 pod 来缓解此问题。If you are using a version of Kubernetes that does not have the fix for this issue, you can mitigate this issue by deleting the pod using the PersistentVolumeClaim before deleting the PersistentVolumeClaim.

将磁盘附加到节点时出现“找不到磁盘的 LUN”错误Error "Cannot find Lun for disk" when attaching a disk to a node

将磁盘附加到节点时,可能会出现以下错误:When attaching a disk to a node, you may see the following error:

MountVolume.WaitForAttach failed for volume "pvc-12b458f4-c23f-11e8-8d27-46799c22b7c6" : Cannot find Lun for disk kubernetes-dynamic-pvc-12b458f4-c23f-11e8-8d27-46799c22b7c6

此问题已在以下版本的 Kubernetes 中得到解决:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.101.10 1.10.10 或更高版本1.10.10 or later
1.111.11 1.11.5 或更高版本1.11.5 or later
1.121.12 1.12.3 或更高版本1.12.3 or later
1.131.13 1.13.0 或更高版本1.13.0 or later
1.14 或更高版本1.14 and later 不适用N/A

如果使用的 Kubernetes 版本未解决此问题,可以等待几分钟再重试,这样就可以缓解此问题。If you are using a version of Kubernetes that does not have the fix for this issue, you can mitigate the issue by waiting several minutes and retrying.

在运行多个附加/分离操作期间出现 Azure 磁盘附加/分离失败、装载问题或 I/O 错误Azure Disk attach/detach failure, mount issues, or I/O errors during multiple attach/detach operations

从 Kubernetes 版本1.9.2 开始,同时运行多个附加/分离操作时,可能会出现脏 VM 缓存造成的以下磁盘问题:Starting in Kubernetes version 1.9.2, when running multiple attach/detach operations in parallel, you may see the following disk issues due to a dirty VM cache:

  • 磁盘附加/分离失败Disk attach/detach failures
  • 磁盘 I/O 错误Disk I/O errors
  • 磁盘从 VM 意外分离Unexpected disk detachment from VM
  • 由于附加不存在的磁盘导致 VM 在故障状态下运行VM running into failed state due to attaching non-existing disk

此问题已在以下版本的 Kubernetes 中得到解决:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.101.10 1.10.12 或更高版本1.10.12 or later
1.111.11 1.11.6 或更高版本1.11.6 or later
1.121.12 1.12.4 或更高版本1.12.4 or later
1.131.13 1.13.0 或更高版本1.13.0 or later
1.14 或更高版本1.14 and later 不适用N/A

如果使用的 Kubernetes 版本未解决此问题,可以尝试以下方法来缓解此问题:If you are using a version of Kubernetes that does not have the fix for this issue, you can mitigate the issue by trying the below:

  • 如果某个磁盘长时间等待分离,请尝试手动分离该磁盘If there a disk is waiting to detach for a long period of time, try detaching the disk manually

无限期等待分离的 Azure 磁盘Azure Disk waiting to detach indefinitely

在某些情况下,如果 Azure 磁盘首次尝试分离操作失败,该磁盘不会重试分离操作,而是保持附加到原始节点 VM。In some cases, if an Azure Disk detach operation fails on the first attempt, it will not retry the detach operation and will remain attached to the original node VM. 将磁盘从一个节点移到另一个节点时,可能会发生此错误。This error can occur when moving a disk from one node to another. 例如:For example:

[Warning] AttachVolume.Attach failed for volume "pvc-7b7976d7-3a46-11e9-93d5-dee1946e6ce9" : Attach volume "kubernetes-dynamic-pvc-7b7976d7-3a46-11e9-93d5-dee1946e6ce9" to instance "/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Compute/virtualMachines/aks-agentpool-57634498-0" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status= Code="ConflictingUserInput" Message="Disk '/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-7b7976d7-3a46-11e9-93d5-dee1946e6ce9' cannot be attached as the disk is already owned by VM '/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Compute/virtualMachines/aks-agentpool-57634498-1'."

此问题已在以下版本的 Kubernetes 中得到解决:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.111.11 1.11.9 或更高版本1.11.9 or later
1.121.12 1.12.7 或更高版本1.12.7 or later
1.131.13 1.13.4 或更高版本1.13.4 or later
1.14 或更高版本1.14 and later 不适用N/A

如果使用的 Kubernetes 版本未解决此问题,可以通过手动分离磁盘来缓解此问题。If you are using a version of Kubernetes that does not have the fix for this issue, you can mitigate the issue by manually detaching the disk.

Azure 磁盘分离失败导致潜在的争用条件问题和无效的数据磁盘列表Azure Disk detach failure leading to potential race condition issue and invalid data disk list

当 Azure 磁盘无法分离时,它会重试最多六次,以使用指数回退来分离磁盘。When an Azure Disk fails to detach, it will retry up to six times to detach the disk using exponential back off. 它还会持有数据磁盘列表中的节点级锁大约 3 分钟。It will also hold a node-level lock on the data disk list for about 3 minutes. 如果在该时间段内手动更新磁盘列表,例如,执行手动附加或分离操作,将会导致节点级锁持有的磁盘列表过时,并导致节点 VM 不稳定。If the disk list is updated manually during that period of time, such as a manual attach or detach operation, this will cause the disk list held by the node-level lock to be obsolete and cause instability on the node VM.

此问题已在以下版本的 Kubernetes 中得到解决:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.121.12 1.12.9 或更高版本1.12.9 or later
1.131.13 1.13.6 或更高版本1.13.6 or later
1.141.14 1.14.2 或更高版本1.14.2 or later
1.15 和更高版本1.15 and later 不适用N/A

如果使用的 Kubernetes 版本未解决此问题,并且节点 VM 包含过时的磁盘列表,则你可以通过一个批量操作从 VM 中分离所有不存在的磁盘,以此缓解此问题。If you are using a version of Kubernetes that does not have the fix for this issue and your node VM has an obsolete disk list, you can mitigate the issue by detaching all non-existing disks from the VM as a single, bulk operation. 单独分离不存在的磁盘可能会失败。Individually detaching non-existing disks may fail.

大量的 Azure 磁盘导致附加/分离速度缓慢Large number of Azure Disks causes slow attach/detach

如果将 10 个以上的 Azure 磁盘附加到节点 VM,附加和分离操作的速度可能很慢。When the number of Azure Disks attached to a node VM is larger than 10, attach and detach operations may be slow. 这是一个已知的问题,暂时没有解决方法。This issue is a known issue and there are no workarounds at this time.

Azure 磁盘分离失败可能导致节点 VM 处于故障状态Azure Disk detach failure leading to potential node VM in failed state

在某些极端情况下,Azure 磁盘分离操作可能部分失败,导致节点 VM 处于故障状态。In some edge cases, an Azure Disk detach may partially fail and leave the node VM in a failed state.

此问题已在以下版本的 Kubernetes 中得到解决:This issue has been fixed in the following versions of Kubernetes:

Kubernetes 版本Kubernetes version 已修复的版本Fixed version
1.121.12 1.12.10 或更高版本1.12.10 or later
1.131.13 1.13.8 或更高版本1.13.8 or later
1.141.14 1.14.4 或更高版本1.14.4 or later
1.15 和更高版本1.15 and later 不适用N/A

如果使用的 Kubernetes 版本未解决此问题,并且节点 VM 处于故障状态,可以使用以下方法之一手动更新 VM 状态,以此缓解此问题:If you are using a version of Kubernetes that does not have the fix for this issue and your node VM is in a failed state, you can mitigate the issue by manually updating the VM status using one of the below:

  • 对于基于可用性集的群集:For an availability set-based cluster:

    az vm update -n <VM_NAME> -g <RESOURCE_GROUP_NAME>
    
  • 对于基于 VMSS 的群集:For a VMSS-based cluster:

    az vmss update-instances -g <RESOURCE_GROUP_NAME> --name <VMSS_NAME> --instance-id <ID>
    

Azure 文件存储和 AKS 故障排除Azure Files and AKS Troubleshooting

Kubernetes 版本Kubernetes version 建议的版本Recommended version
1.121.12 1.12.6 或更高版本1.12.6 or later
1.131.13 1.13.4 或更高版本1.13.4 or later
1.141.14 1.14.0 或更高版本1.14.0 or later

哪些 Kubernetes 版本在主权云中提供 Azure 文件存储支持?What versions of Kubernetes have Azure Files support on the Sovereign Cloud?

Kubernetes 版本Kubernetes version 建议的版本Recommended version
1.121.12 1.12.0 或更高版本1.12.0 or later
1.131.13 1.13.0 或更高版本1.13.0 or later
1.141.14 1.14.0 或更高版本1.14.0 or later

使用 Azure 文件存储时的默认 mountOptions 是什么?What are the default mountOptions when using Azure Files?

建议的设置:Recommended settings:

Kubernetes 版本Kubernetes version fileMode 和 dirMode 值fileMode and dirMode value
1.12.0 - 1.12.11.12.0 - 1.12.1 07550755
1.12.2 和更高版本1.12.2 and later 07770777

如果使用 Kuberetes 版本为 1.8.5 或更高版本的群集并使用存储类动态创建永久性卷,则可以在存储类对象上指定装载选项。If using a cluster with Kuberetes version 1.8.5 or greater and dynamically creating the persistent volume with a storage class, mount options can be specified on the storage class object. 以下示例设置 0777The following example sets 0777:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: azurefile
provisioner: kubernetes.io/azure-file
mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=1000
  - gid=1000
  - mfsymlinks
  - nobrl
  - cache=none
parameters:
  skuName: Standard_LRS

其他一些有用的 mountOptions 设置:Some additional useful mountOptions settings:

  • mfsymlinks 将使 Azure 文件存储装入点 (cifs) 支持符号链接mfsymlinks will make Azure Files mount (cifs) support symbolic links
  • nobrl 将阻止向服务器发送字节范围锁请求。nobrl will prevent sending byte range lock requests to the server. 对于中断 cifs 样式强制字节范围锁的某些应用程序,必须使用此设置。This setting is necessary for certain applications that break with cifs style mandatory byte range locks. 大多数 cifs 服务器尚不支持请求建议字节范围锁。Most cifs servers do not yet support requesting advisory byte range locks. 如果不使用 nobrl,则中断 cifs 样式强制字节范围锁的应用程序可能导致如下所示的错误消息:If not using nobrl, applications that break with cifs style mandatory byte range locks may cause error messages similar to:
    Error: SQLITE_BUSY: database is locked
    

使用 Azure 文件存储时出现“无法更改权限”错误Error "could not change permissions" when using Azure Files

在 Azure 文件存储插件中运行 PostgreSQL 时,可能会出现如下所示的错误:When running PostgreSQL on the Azure Files plugin, you may see an error similar to:

initdb: could not change permissions of directory "/var/lib/postgresql/data": Operation not permitted
fixing permissions on existing directory /var/lib/postgresql/data

此错误是由使用 cifs/SMB 协议的 Azure 文件存储插件造成的。This error is caused by the Azure Files plugin using the cifs/SMB protocol. 使用 cifs/SMB 协议时,在装载后无法更改文件和目录权限。When using the cifs/SMB protocol, the file and directory permissions couldn't be changed after mounting.

若要解决此问题,请结合 Azure 磁盘插件使用 subPathTo resolve this issue, use subPath together with the Azure Disk plugin.

Note

对于 ext3/4 磁盘类型,格式化磁盘后会出现一个 lost+found 目录。For ext3/4 disk type, there is a lost+found directory after the disk is formatted.

处理许多小型文件时,Azure 文件存储的延迟高于 Azure 磁盘Azure Files has high latency compared to Azure Disk when handling many small files

在某些情况下(例如处理许多的小型文件时),使用 Azure 文件存储时出现的延迟可能高于 Azure 磁盘。In some case, such as handling many small files, you may experience high latency when using Azure Files when compared to Azure Disk.

在存储帐户中启用“允许从所选网络进行访问”设置时出错Error when enabling "Allow access allow access from selected network" setting on storage account

如果在用于 AKS 中的动态预配的存储帐户上启用“允许从所选网络进行访问”,当 AKS 创建文件共享时会出现错误: If you enable allow access from selected network on a storage account that is used for dynamic provisioning in AKS, you will get an error when AKS creates a file share:

persistentvolume-controller (combined from similar events): Failed to provision volume with StorageClass "azurefile": failed to create share kubernetes-dynamic-pvc-xxx in account xxx: failed to create file share, err: storage: service returned error: StatusCode=403, ErrorCode=AuthorizationFailure, ErrorMessage=This request is not authorized to perform this operation.

出现此错误的原因是在设置“允许从所选网络进行访问”时,Kubernetes persistentvolume-controller 不在所选的网络中。 This error is because of the Kubernetes persistentvolume-controller not being on the network chosen when setting allow access from selected network.

可以通过对 Azure 文件存储使用静态预配来缓解此问题。You can mitigate the issue by using static provisioning with Azure Files.

由于存储帐户密钥已更改,Azure 文件存储装载失败Azure Files mount fails due to storage account key changed

如果存储帐户密钥已更改,可能会发生 Azure 文件存储装载失败。If your storage account key has changed, you may see Azure Files mount failures.

若要缓解此问题,可以使用 base64 编码的存储帐户密钥手动更新 Azure 文件机密中的 azurestorageaccountkey 字段。You can mitigate the issue by doing manually updating the azurestorageaccountkey field manually in Azure file secret with your base64-encoded storage account key.

若要以 base64 编码存储帐户密钥,可以使用 base64To encode your storage account key in base64, you can use base64. 例如:For example:

echo X+ALAAUgMhWHL7QmQ87E1kSfIqLKfgC03Guy7/xk9MyIg2w4Jzqeu60CVw2r/dm6v6E0DWHTnJUEJGVQAoPaBc== | base64

若要更新 Azure 机密文件,请使用 kubectl edit secretTo update your Azure secret file, use kubectl edit secret. 例如:For example:

kubectl edit secret azure-storage-account-{storage-account-name}-secret

几分钟后,代理节点将使用更新的存储密钥重试 Azure 文件装载。After a few minutes, the agent node will retry the azure file mount with the updated storage key.