使用适用于容器的 Azure Monitor 了解 Kubernetes 群集运行状况Understand Kubernetes cluster health with Azure Monitor for containers

使用适用于容器的 Azure Monitor,可以监视和报告托管基础结构组件的运行状况,以及适用于容器的 Azure Monitor 支持的任何 Kubernetes 群集上运行的所有节点的运行状况。With Azure Monitor for containers, it monitors and reports health status of the managed infrastructure components and all nodes running on any Kubernetes cluster supported by Azure Monitor for containers. 此体验超出了在多群集视图上计算和报告的群集运行状况的范围;在该视图中,你现在可以根据选定的指标,了解群集中的一个或多个节点是否受资源约束,或者节点或 Pod 是否不可用,从而影响集群中正在运行的应用程序。This experience extends beyond the cluster health status calculated and reported on the multi-cluster view, where now you can understand if one or more nodes in the cluster are resource constrained, or a node or pod is unavailable that could impact a running application in the cluster based on curated metrics.

若要了解如何启用适用于容器的 Azure Monitor,请参阅载入适用于容器的 Azure MonitorFor information about how to enable Azure Monitor for containers, see Onboard Azure Monitor for containers.

备注

若要支持 AKS 引擎群集,请验证它是否满足以下要求:To support AKS Engine clusters, verify it meets the following:

概述Overview

在适用于容器的 Azure Monitor 中,运行状况(预览)功能提供对 Kubernetes 群集的主动运行状况监视,以帮助你识别和诊断问题。In Azure Monitor for containers, the Health (preview) feature provides proactive health monitoring of your Kubernetes cluster to help you identify and diagnose issues. 通过它,你可以查看检测到的重要问题。It gives you the ability to view significant issues detected. 评估群集运行状况的监视器在群集中的容器化代理上运行,且运行状况数据写入 Log Analytics 工作区中的 KubeHealth表。Monitors evaluating the health of your cluster run on the containerized agent in your cluster, and the health data is written to the KubeHealth table in your Log Analytics workspace.

Kubernetes 群集运行状况基于由以下 Kubernetes 对象和抽象组织的许多监视场景:Kubernetes cluster health is based on a number of monitoring scenarios organized by the following Kubernetes objects and abstractions:

  • Kubernetes 基础结构 - 通过评估 CPU 和内存利用率以及 Pod 可用性,它提供在群集中部署的节点上运行的 Kubernetes API 服务器、ReplicaSet 和 DaemonSet 的汇总Kubernetes infrastructure - provides a rollup of the Kubernetes API server, ReplicaSets, and DaemonSets running on nodes deployed in your cluster by evaluating CPU and memory utilization, and a Pods availability

    Kubernetes 基础结构运行状况汇总视图

  • 节点 - 通过评估 CPU 和内存利用率以及 Kubernetes 报告的状态,它提供节点池和每个池中各个节点状态的汇总。Nodes - provides a rollup of the Node pools and state of individual Nodes in each pool, by evaluating CPU and memory utilization, and a Node's status as reported by Kubernetes.

    节点运行状况汇总视图

目前,只支持虚拟 kubelet 的状态。Currently, only the status of a virtual kubelet is supported. 虚拟 kublet 节点的 CPU 和内存利用率的运行状况状态报告为“未知”,因为未接收到来自它们的任何信号。The health state for CPU and memory utilization of virtual kublet nodes is reported as Unknown, since a signal is not received from them.

所有监视器都显示在“运行状况层次结构”窗格的层次结构布局中,其中表示 Kubernetes 对象或抽象(即 Kubernetes 基础结构或节点)的聚合监视器是反映所有依赖的子监控器的组合运行状况的最顶层监视器。All monitors are shown in a hierarchical layout in the Health Hierarchy pane, where an aggregate monitor representing the Kubernetes object or abstraction (that is, Kubernetes infrastructure or Nodes) are the top-most monitor reflecting the combined health of all dependent child monitors. 用于获得运行状况的主要监视场景包括:The key monitoring scenarios used to derive health are:

  • 从节点和容器评估 CPU 利用率。Evaluate CPU utilization from the node and container.
  • 从节点和容器评估内存使用率。Evaluate memory utilization from the node and container.
  • 根据 Kubernetes 报告的就绪状态计算 Pod 和节点的状态。Status of Pods and Nodes based on calculation of their ready state reported by Kubernetes.

用于指示状态的图标如下:The icons used to indicate state are as follows:

图标Icon 含义Meaning
绿色勾号图标表示正常 成功、运行状况正常(绿色)Success, health is OK (green)
黄色三角形和感叹号表示警告 警告(黄色)Warning (yellow)
带有白色 X 的红色按钮表示严重状态 严重(红色)Critical (red)
灰显图标 未知(灰色)Unknown (gray)

监视器配置Monitor configuration

若要了解支持适用于容器的 Azure Monitor 运行状况功能的每个监视器的行为和配置,请参阅运行状况监视器配置指南To understand the behavior and configuration of each monitor supporting Azure Monitor for containers Health feature, see Health monitor configuration guide.

登录到 Azure 门户Sign in to the Azure portal

登录到 Azure 门户Sign in to the Azure portal.

查看 AKS 或非 AKS 群集的运行状况View health of an AKS or non-AKS cluster

通过从 Azure 门户的左窗格中选择“见解”,可以直接从 AKS 群集访问适用于容器的 Azure Monitor 运行状况(预览)功能。Access to the Azure Monitor for containers Health (preview) feature is available directly from an AKS cluster by selecting Insights from the left pane in the Azure portal. 在“见解”部分,选择“容器”。 Under the Insights section, select Containers.

若要查看非 AKS 群集(即托管在本地或 Azure 堆栈上的 AKS 引擎群集)的运行状况,请从 Azure 门户的左侧窗格中选择“Azure Monitor”。To view health from a non-AKS cluster, that is an AKS Engine cluster hosted on-premises or on Azure Stack, select Azure Monitor from the left pane in the Azure portal. 在“见解”部分,选择“容器”。 Under the Insights section, select Containers. 在多群集页面上,从列表中选择相应的非 AKS 群集。On the multi-cluster page, select the non-AKS cluster from the list.

在适用于容器的 Azure Monitor 中,从“群集”页面中,选择“运行状况” 。In Azure Monitor for containers, from the Cluster page, select Health.

群集运行状况仪表板示例

查看群集运行状况Review cluster health

“运行状况”页打开时,在“运行状况特性”网格中默认选择了“Kubernetes 基础结构” 。When the Health page opens, by default Kubernetes Infrastructure is selected in the Health Aspect grid. 网格汇总了 Kubernetes 基础结构和群集节点的当前运行状况汇总状态。The grid summarizes current health rollup state of Kubernetes infrastructure and cluster nodes. 选择任一运行状况特性都会更新“运行状况层次结构”窗格(即中间窗格)中的结果,并在层次结构布局中显示所有子监视器以及其当前运行状况状态。Selecting either health aspect updates the results in the Health Hierarchy pane (that is, the middle-pane) and shows all child monitors in a hierarchical layout, displaying their current health state. 若要查看有关任何依赖监视器的详细信息,可以选择一个监视器,页面右侧自动显示属性窗格。To view more information about any dependent monitor, you can select one and a property pane automatically displays on the right side of the page.

群集运行状况属性窗格

在属性窗格中,可了解以下内容:On the property pane, you learn the following:

  • 在“概述”选项卡上,它显示所选监视器的当前状态、上次计算监视器的时间以及上次状态更改的时间。On the Overview tab, it shows the current state of the monitor selected, when the monitor was last calculated, and when the last state change occurred. 其他信息根据在层次结构中选择的监视器类型进行显示。Additional information is shown depending on the type of monitor selected in the hierarchy.

    如果你在“运行状况层次结构”窗格中选择一个聚合监视器,则在属性窗格上的“概述”选项卡下,显示层次结构中子监视器总数的汇总,以及处于严重、警告和正常状态的聚合监视器的数量。If you select an aggregate monitor in the Health Hierarchy pane, under the Overview tab on the property pane it shows a rollup of the total number of child monitors in the hierarchy, and how many aggregate monitors are in a critical, warning, and healthy state.

    聚合监视器的“运行状况”属性窗格“概述”选项卡

    如果你在“运行状况层次结构”窗格中选择一个单元监视器,它还会在“上次状态更改”下显示容器化代理在过去四小时内计算和报告的上一个样本。If you select a unit monitor in the Health Hierarchy pane, it also shows under Last state change the previous samples calculated and reported by the containerized agent within the last four hours. 它的基础是用于比较多个连续值以确定其状态的单元监视器计算。This is based on the unit monitors calculation for comparing several consecutive values to determine its state. 例如,如果你选择了“Pod 就绪状态”单元监视器,则它将显示由参数“ConsecutiveSamplesForStateTransition”控制的最后两个样本 。For example, if you selected the Pod ready state unit monitor, it shows the last two samples controlled by the parameter ConsecutiveSamplesForStateTransition. 有关详细信息,请参阅单元监视器的详细说明。For more information, see the detailed description of unit monitors.

    “运行状况”属性窗格的“概述”选项卡

    如果“上次状态更改”报告的时间是一天或更早,这是监视器状态没有更改的结果。If the time reported by Last state change is a day or older, it is the result of no changes in state for the monitor. 但是,如果收到的单元监视器的最后一个样本时间超过 4 小时,这可能表明容器代理没有发送数据。However, if the last sample received for a unit monitor is more than four hours old, this likely indicates the containerized agent has not been sending data. 如果代理知道某个特定资源(例如节点)存在,但它没有从该节点的 CPU 或内存利用率监视器(举例)接收到数据,则监视器的运行状况状态将设置为“未知”。If the agent knows that a particular resource exists, for example a Node, but it hasn't received data from the Node's CPU or memory utilization monitors (as an example), then the health state of the monitor is set to Unknown.

  • 在“配置”选项卡上,它显示默认配置参数设置(仅适用于单元监视器,而非聚合监视器)及其值。On theConfig tab, it shows the default configuration parameter settings (only for unit monitors, not aggregate monitors) and their values.

  • 在“知识”选项卡上,它包含说明监视器的行为以及如何评估运行不正常状况的信息。On the Knowledge tab, it contains information explaining the behavior of the monitor and how it evaluates for the unhealthy condition.

此页上的监视数据不会自动刷新,你需要选择页顶部的“刷新”以查看从群集收到的最新运行状况状态。Monitoring data on this page does not refresh automatically and you need to select Refresh at the top of the page to see the most recent health state received from the cluster.

后续步骤Next steps

请参阅日志查询示例,以查看预定义的查询,以及有关群集警报、可视化或分析的评估或自定义示例。View log query examples to see predefined queries and examples to evaluate or customize to alert, visualize, or analyze your clusters.