Azure Kubernetes 服务 (AKS) 节点自动修复Azure Kubernetes Service (AKS) node auto-repair

AKS 会持续检查工作器节点的运行状况,在节点运行不正常的情况下对其进行自动修复。AKS continuously checks the health state of worker nodes and performs automatic repair of the nodes if they become unhealthy. 本文档向操作员介绍了自动节点修复功能的运行方式。This document informs operators about how automatic node repair functionality behaves. 除了 AKS 修复,Azure VM 平台还会对遇到问题的虚拟机执行维护In addition to AKS repairs, the Azure VM platform performs maintenance on Virtual Machines that experience issues as well. AKS 和 Azure VM 协同工作,以最大程度地减少群集的服务中断次数。AKS and Azure VMs work together to minimize service disruptions for clusters.

限制Limitations

  • 目前不支持 Windows 节点池。Windows node pools are not supported today.

AKS 如何检查运行不正常的节点How AKS checks for unhealthy nodes

AKS 使用规则来确定节点是否运行不正常以及是否需要修复。AKS uses rules to determine if a node is unhealthy and needs repair. AKS 使用以下规则来确定是否需要自动修复。AKS uses the following rules to determine if automatic repair is needed.

  • 节点在 10 分钟的时间范围内持续检查时报告状态为“未就绪”The node reports status of NotReady on consecutive checks within a 10-minute timeframe
  • 节点在 10 分钟内未报告状态The node doesn't report a status within 10 minutes

可使用 kubectl 手动检查节点的运行状况状态。You can manually check the health state of your nodes with kubectl.

kubectl get nodes

自动修复的工作原理How automatic repair works

备注

AKS 使用用户帐户“aks-remediator”启动修复操作。AKS initiates repair operations with the user account aks-remediator.

如果根据上述规则确定节点运行不正常,且连续 10 分钟不正常,则 AKS 会重启该节点。If a node is determined to be unhealthy based on the rules above and remains unhealthy for 10 consecutive minutes, AKS reboots the node. 如果在初始修复操作后节点仍然运行不正常,则 AKS 工程师将调查其他修正。If nodes remain unhealthy after the initial repair operation, additional remediations are investigated by AKS engineers.

如果在运行状况检查期间多个节点运行不正常,则在开始另一个修复之前,将单独修复每个节点。If multiple nodes are unhealthy during a health check, each node is repaired individually before another repair begins.