用于容器的 Azure Monitor 运行状况监视器配置指南Azure Monitor for containers health monitor configuration guide

在用于容器的 Azure Monitor 中,监视器是衡量运行状况和检测错误的主要元素。Monitors are the primary element for measuring health and detecting errors in Azure Monitor for containers. 本文可帮助你了解用于衡量运行状况的概念,以及组成运行状况模型的元素,学习这些内容后,你将能够使用运行状况(预览版)功能监视和报告 Kubernetes 群集的运行状况。This article helps you understand the concepts of how health is measured and the elements that comprise the health model to monitor and report on the health of your Kubernetes cluster with the Health (preview) feature.

备注

运行状况功能目前处于公开预览状态。The Health feature is in public preview at this time.

监视器Monitors

监视器衡量托管对象某个方面的运行状况。A monitor measures the health of some aspect of a managed object. 每个监视器都有两到三个运行状况状态。Monitors each have either two or three health states. 在任何给定的时间,监视器都处于唯一一个可能的状态。A monitor will be in one and only one of its potential states at any given time. 当容器化代理加载监视器后,它将初始化为正常状态。When a monitor loaded by the containerized agent, it is initialized to a healthy state. 仅当检测到另一个状态的指定条件时,状态才会更改。The state changes only if the specified conditions for another state are detected.

特定对象的总体运行状况由其每个监视器的运行状况确定。The overall health of a particular object is determined from the health of each of its monitors. 在用于容器的 Azure Monitor 中,此层次结构在“运行状况层次结构”窗格中显示。This hierarchy is illustrated in the Health Hierarchy pane in Azure Monitor for containers. 有关如何汇总运行状况的策略在聚合监视器的配置部分介绍。The policy for how health is rolled up is part of the configuration of the aggregate monitors.

监视器类型Types of monitors

监视Monitor 说明Description
单元监视器Unit monitor 单元监视器衡量资源或应用程序的某些方面。A unit monitor measures some aspect of a resource or application. 可能是检查性能计数器以确定资源的性能或其可用性。This might be checking a performance counter to determine the performance of the resource, or its availability.
聚合监视器Aggregate Monitor 聚合监视器将多个监视器组合在一起,以提供一个聚合运行状况状态。Aggregate monitors group multiple monitors to provide a single health aggregated health state. 单元监视器通常在特定的聚合监视器下配置。Unit monitors are typically configured under a particular aggregate monitor. 例如,节点聚合监视器汇总节点 CPU 利用率、内存利用率和节点状态这三个状态。For example, a Node aggregate monitor rolls up the status of the Node CPU utilization, memory utilization, and Node status.

聚合监视器运行状况汇总策略Aggregate monitor health rollup policy

每个聚合监视器都定义了一个运行状况汇总策略,这是一种逻辑,用于根据其下监视器的运行状况确定自己的运行状况。Each aggregate monitor defines a health rollup policy, which is the logic that is used to determine the health of the aggregate monitor based on the health of the monitors under it. 聚合监视器的可能运行状况汇总策略如下:The possible health rollup policies for an aggregate monitor are as follows:

最差状态策略Worst state policy

聚合监视器的状态与具有最差运行状况状态的子监视器的状态一致。The state of the aggregate monitor matches the state of the child monitor with the worst health state. 这是聚合监视器最常使用的策略。This is the most common policy used by aggregate monitors.

聚合监视器汇总最差状态示例

百分比策略Percentage policy

源对象与处于最佳状态的指定百分比目标对象的单个成员的最差状态一致。The source object matches the worst state of a single member of a specified percentage of target objects in the best state. 当特定百分比的目标对象必须处于正常状态才能将某个目标对象视为正常时,将使用此策略。This policy is used when a certain percentage of target objects must be healthy for the target object to be considered healthy. 百分比策略按状态的严重性对监视器降序排序,聚合监视器的状态计算为 N% 的最差状态(N 由配置参数 StateThresholdPercentage 指定)。Percentage policy sorts the monitors in descending order of severity of state, and the aggregate monitor's state is computed as the worst state of N% (N is dictated by the configuration parameter StateThresholdPercentage).

例如,假设一个容器映像有五个容器实例,它们的状态分别为“严重”、“警告”、“正常”、“正常”、“正常” 。For example, suppose there are five container instances of a container image, and their individual states are Critical, Warning, Healthy, Healthy, Healthy. 容器 CPU 利用率监视器的状态将为“严重”,因为按严重性降序排序时,90% 容器的最差状态为“严重” 。The status of the container CPU utilization monitor will be Critical, since the worst state of 90% of the containers is Critical when sorted in descending order of severity.

了解监视配置Understand the monitoring configuration

用于容器的 Azure Monitor 包括许多关键的监视方案,配置如下。Azure Monitor for containers includes a number of key monitoring scenarios that are configured as follows.

单元监视器Unit monitors

监视器名称Monitor name 监视器类型Monitor type 说明Description 参数Parameter Value
节点内存利用率Node Memory Utilization 单元监视器Unit monitor 此监视器使用 cadvisor 报告的数据,每分钟评估一个节点的内存利用率。This monitor evaluates the memory utilization of a node every minute, using the cadvisor reported data. ConsecutiveSamplesForStateTransitionConsecutiveSamplesForStateTransition
FailIfGreaterThanPercentageFailIfGreaterThanPercentage
WarnIfGreaterThanPercentageWarnIfGreaterThanPercentage
33
9090
8080
节点 CPU 利用率Node CPU Utilization 单元监视器Unit Monitor 此监视器使用 cadvisor 报告的数据,每分钟检查一次节点的 CPU 利用率。This monitor checks the CPU utilization of the node every minute, using the cadvisor reported data. ConsecutiveSamplesForStateTransitionConsecutiveSamplesForStateTransition
FailIfGreaterThanPercentageFailIfGreaterThanPercentage
WarnIfGreaterThanPercentageWarnIfGreaterThanPercentage
33
9090
8080
节点状态Node Status 单元监视器Unit monitor 此监视器检查 Kubernetes 报告的节点状况。This monitor checks node conditions reported by Kubernetes.
目前检查以下节点状况:磁盘压力、内存压力、PID 压力、磁盘不足、网络不可用、节点的就绪状态。Currently the following node conditions are checked: Disk Pressure, Memory Pressure, PID Pressure, Out of Disk, Network unavailable, Ready status for the node.
在上述情况下,如果“磁盘不足”或“网络不可用”为 true,则监视器将更改为“严重”状态 。Out of the above conditions, if either Out of Disk or Network Unavailable is true, the monitor changes to Critical state.
如果除“就绪”状态外的任何其他状况均等于 true,监视器将更改为“警告”状态 。If any other conditions equal true, other than a Ready status, the monitor changes to a Warning state.
NodeConditionTypeForFailedStateNodeConditionTypeForFailedState outofdisk,networkunavailableoutofdisk,networkunavailable
容器内存利用率Container memory utilization 单元监视器Unit monitor 此监视器报告容器实例内存利用率 (RSS) 的综合运行状况状态。This monitor reports combined health status of the Memory utilization(RSS) of the instances of the container.
它执行简单的比较,将每个样本与单个阈值进行比较,并由配置参数 ConsecutiveSamplesForStateTransition 指定。It performs a simple comparison that compares each sample to a single threshold, and specified by the configuration parameter ConsecutiveSamplesForStateTransition.
其状态计算为 90% 容器 (StateThresholdPercentage) 实例的最差状态,按容器运行状况状态的严重性(即严重、警告、正常)降序排序。Its state is calculated as the worst state of 90% of the container (StateThresholdPercentage) instances, sorted in descending order of severity of container health state (that is, Critical, Warning, Healthy).
如果没有从容器实例收到任何记录,则容器实例的运行状况状态将报告为“未知”,并且在排序顺序上比“严重”状态具有更高的优先级 。If no record is received from a container instance, then the health state of the container instance is reported as Unknown, and has higher precedence in the sorting order over the Critical state.
使用配置中指定的阈值计算各容器实例的状态。Each individual container instance's state is calculated using the thresholds specified in the configuration. 如果使用率超过临界阈值 (90%),则实例处于“严重”状态,如果低于临界阈值 (90%) 但高于警告阈值 (80%),则实例处于“警告”状态 。If the usage is over critical threshold (90%), then the instance is in a Critical state, if it is less than critical threshold (90%) but greater than warning threshold (80%), then the instance is in a Warning state. 否则,它处于“正常”状态。Otherwise, it is in Healthy state.
ConsecutiveSamplesForStateTransitionConsecutiveSamplesForStateTransition
FailIfLessThanPercentageFailIfLessThanPercentage
StateThresholdPercentageStateThresholdPercentage
WarnIfGreaterThanPercentageWarnIfGreaterThanPercentage
33
9090
9090
8080
容器 CPU 利用率Container CPU utilization 单元监视器Unit monitor 此监视器报告容器实例 CPU 利用率的综合运行状况状态。This monitor reports combined health status of the CPU utilization of the instances of the container.
它执行简单的比较,将每个样本与单个阈值进行比较,并由配置参数 ConsecutiveSamplesForStateTransition 指定。It performs a simple comparison that compares each sample to a single threshold, and specified by the configuration parameter ConsecutiveSamplesForStateTransition.
其状态计算为 90% 容器 (StateThresholdPercentage) 实例的最差状态,按容器运行状况状态的严重性(即严重、警告、正常)降序排序。Its state is calculated as the worst state of 90% of the container (StateThresholdPercentage) instances, sorted in descending order of severity of container health state (that is, Critical, Warning, Healthy).
如果没有从容器实例收到任何记录,则容器实例的运行状况状态将报告为“未知”,并且在排序顺序上比“严重”状态具有更高的优先级 。If no record is received from a container instance, then the health state of the container instance is reported as Unknown, and has higher precedence in the sorting order over the Critical state.
使用配置中指定的阈值计算各容器实例的状态。Each individual container instance's state is calculated using the thresholds specified in the configuration. 如果使用率超过临界阈值 (90%),则实例处于“严重”状态,如果低于临界阈值 (90%) 但高于警告阈值 (80%),则实例处于“警告”状态 。If the usage is over critical threshold (90%), then the instance is in a Critical state, if it is less than critical threshold (90%) but greater than warning threshold (80%), then the instance is in a Warning state. 否则,它处于“正常”状态。Otherwise, it is in Healthy state.
ConsecutiveSamplesForStateTransitionConsecutiveSamplesForStateTransition
FailIfLessThanPercentageFailIfLessThanPercentage
StateThresholdPercentageStateThresholdPercentage
WarnIfGreaterThanPercentageWarnIfGreaterThanPercentage
33
9090
9090
8080
系统工作负载 Pod 就绪System workload pods ready 单元监视器Unit monitor 此监视器基于给定工作负载中处于就绪状态的 Pod 百分比报告状态。This monitor reports status based on percentage of pods in ready state in a given workload. 如果小于 100% 的 Pod 处于“正常”状态,则其状态设置为“严重” Its state is set to Critical if less than 100% of the pods are in a Healthy state ConsecutiveSamplesForStateTransitionConsecutiveSamplesForStateTransition
FailIfLessThanPercentageFailIfLessThanPercentage
22
100100
Kube API 状态Kube API status 单元监视器Unit monitor 此监视器报告 Kube API 服务的状态。This monitor reports status of Kube API service. 当 Kube API 终结点不可用时,监视器处于“严重”状态。Monitor is in critical state in case Kube API endpoint is unavailable. 对于此特定监视器,状态是通过查询 kube-api 服务器的“节点”终结点来确定的。For this particular monitor, the state is determined by making a query to the 'nodes' endpoint for the kube-api server. 除了 OK 之外的任何其他响应代码都会将监视器更改为“严重”状态。Anything other than an OK response code changes the monitor to a Critical state. 无配置属性No configuration properties

聚合监视器Aggregate monitors

监视器名称Monitor name 说明Description 算法Algorithm
节点Node 此监视器是所有节点监视器的聚合。This monitor is an aggregate of the all the node monitors. 它与具有最差运行状况状态的子监视器的状态一致:It matches the state of the child monitor with the worst health state:
节点 CPU 利用率Node CPU utilization
节点内存利用率Node memory utilization
节点状态Node Status
最差Worst of
节点池Node pool 此监视器报告节点池 agentpool 中所有节点的综合运行状况状态。This monitor reports combined health status of all nodes in the node pool agentpool. 这是一个三状态监视器,其状态基于节点池中 80% 节点的最差状态,按节点状态的严重性(即严重、警告、正常)降序排序。This is a three state monitor, whose state is based on the worst state of 80% of the nodes in the node pool, sorted in descending order of severity of node states (that is, Critical, Warning, Healthy). 百分比Percentage
节点(节点池的父级)Nodes (parent of Node pool) 这是所有节点池的聚合监视器。This is an aggregate monitor of all the node pools. 其状态基于其子监视器(即群集中存在的节点池)的最差状态。Its state is based on the worst state of its child monitors (that is, the node pools present in the cluster). 最差Worst of
群集(节点的父级/Cluster (parent of nodes/
Kubernetes 基础结构)Kubernetes infrastructure)
这是父监视器,与具有最差运行状况状态的子监视器(即 Kubernetes 基础结构和节点)的状态一致。This is the parent monitor that matches the state of the child monitor with the worst health state, that is kubernetes infrastructure and nodes. 最差Worst of
Kubernetes 基础结构Kubernetes infrastructure 此监视器报告群集托管基础结构组件的综合运行状况状态。This monitor reports combined health status of the managed infrastructure components of the cluster. 其状态计算为子监视器状态中的“最差”状态,即 kube 系统工作负载和 API 服务器状态。its status is calculated as the 'worst of' its child monitor states i.e. kube-system workloads and API Server status. 最差Worst of
系统工作负载System workload 此监视器报告 kube 系统工作负载的运行状况状态。This monitor reports health status of a kube-system workload. 此监视器与具有最差运行状况状态的子监视器的状态一致,即“处于就绪状态的 Pod”(监视器和工作负载中的容器)。This monitor matches the state of the child monitor with the worst health state, that is the Pods in ready state (monitor and the containers in the workload). 最差Worst of
容器Container 此监视器报告给定工作负载中容器的总体运行状况状态。This monitor reports overall health status of a container in a given workload. 此监视器与具有最差运行状况状态的子监视器的状态一致,即“CPU 利用率”和“内存利用率”监视器 。This monitor matches the state of the child monitor with the worst health state, that is the CPU utilization and Memory utilization monitors. 最差Worst of