从容器见解创建日志搜索警报
容器见解可监视部署到托管或自托管 Kubernetes 群集的容器工作负荷的性能。 本文介绍如何在以下场景中使用 Azure Kubernetes 服务 (AKS) 群集创建基于日志的警报,以便针对重要内容发出警报:
- 当群集节点上的 CPU 或内存利用率超过阈值时
- 当控制器中任何容器上的 CPU 或内存利用率超过阈值时(与相应资源中设置的限制相比)
NotReady
状态节点计数Failed
、Pending
、Unknown
、Running
或Succeeded
Pod 阶段计数- 当群集节点上的可用磁盘空间超过阈值时
若要针对群集节点上的 CPU 或内存利用率过高或可用磁盘空间不足发出警报,请使用提供的查询来创建指标警报或指标度量警报。 指标警报的延迟比日志搜索警报低,但日志搜索警报提供了高级查询和更复杂的功能。 日志搜索警报查询通过使用 now
运算符并将时间往过去推一个小时,将某个日期/时间与当前时间进行比较。 (容器见解以协调世界时 [UTC] 格式存储所有日期。)
重要
本文中的查询取决于容器见解收集并存储在 Log Analytics 工作区中的数据。 如果你修改了默认数据收集设置,查询可能不会返回预期结果。 最值得注意的是,如果你在为群集启用 Prometheus 指标后禁用了性能数据收集,则任何使用 Perf
表的查询都不会返回结果。
请参阅使用数据收集规则在容器见解中配置数据收集,了解预设配置(包括禁用性能数据收集)。 如需更多数据收集选项,请参阅使用 ConfigMap 在容器见解中配置数据收集。
如果你不熟悉 Azure Monitor 警报,请在开始之前参阅 Azure 中的警报概述。 若要详细了解使用日志查询的警报,请参阅 Azure Monitor 中的日志搜索警报。 有关指标警报的详细信息,请参阅 Azure Monitor 中的指标警报。
日志查询度量
日志搜索警报可以衡量两种不同的内容,可用于监视不同方案中的虚拟机:
以资源和维度为目标
可以使用一个规则通过维度监视多个实例的值。 例如,如果你想要监视运行网站或应用的多个实例上的 CPU 使用率,并针对 CPU 使用率超过 80% 的情况创建警报,则可以使用维度。
若要为订阅或资源组大规模创建以资源为中心的警报,可以按维度进行拆分。 如果要在多个 Azure 资源上监视相同的条件,按维度进行拆分会通过使用数字或字符串列对唯一组合进行分组,将警报拆分为独立的警报。 对 Azure 资源 ID 列进行拆分会使指定的资源进入警报目标。
在需要范围内的多个资源的条件时,你也可能会决定不进行拆分。 例如,你可能希望在资源组范围中至少有五台计算机的 CPU 使用率超过 80% 时创建警报。
你可能想要查看受影响计算机的警报列表。 你可以使用自定义工作簿,该工作簿使用自定义 Resource Graph 提供此视图。 请使用以下查询显示警报,并使用工作簿中的 Azure Resource Graph 数据源。
资源利用率
每分钟的平均 CPU 利用率,即成员节点的平均 CPU 利用率(指标度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuCapacityNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ObjectName == 'K8SNode'
| where CounterName == capacityCounterName
| summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
| project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime + trendBinSize
| where TimeGenerated >= startDateTime - trendBinSize
| where ObjectName == 'K8SNode'
| where CounterName == usageCounterName
| project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName
每分钟的平均内存利用率,即成员节点的平均内存利用率(指标度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryCapacityBytes';
let usageCounterName = 'memoryRssBytes';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ObjectName == 'K8SNode'
| where CounterName == capacityCounterName
| summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
| project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime + trendBinSize
| where TimeGenerated >= startDateTime - trendBinSize
| where ObjectName == 'K8SNode'
| where CounterName == usageCounterName
| project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName
重要
以下查询使用占位符值 <your-cluster-name> 和 <your-controller-name> 来分别表示群集和控制器。 设置警报时,请将这些占位符替换为环境特定的值。
每分钟控制器中所有容器的平均 CPU 利用率,即控制器中每个容器实例的平均 CPU 利用率(指标度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuLimitNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ObjectName == 'K8SContainer'
| where CounterName == capacityCounterName
| summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
| project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime + trendBinSize
| where TimeGenerated >= startDateTime - trendBinSize
| where ObjectName == 'K8SContainer'
| where CounterName == usageCounterName
| project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName
每分钟控制器中所有容器的平均内存利用率,即控制器中每个容器实例的平均内存利用率(指标度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryLimitBytes';
let usageCounterName = 'memoryRssBytes';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ObjectName == 'K8SContainer'
| where CounterName == capacityCounterName
| summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
| project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
Perf
| where TimeGenerated < endDateTime + trendBinSize
| where TimeGenerated >= startDateTime - trendBinSize
| where ObjectName == 'K8SContainer'
| where CounterName == usageCounterName
| project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName
资源可用性
处于“就绪”和“未就绪”状态的节点和计数(指标度量):
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| distinct ClusterName, Computer, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName, Computer
| join hint.strategy=broadcast kind=inner (
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| summarize TotalCount = count(), ReadyCount = sumif(1, Status contains ('Ready'))
by ClusterName, Computer, bin(TimeGenerated, trendBinSize)
| extend NotReadyCount = TotalCount - ReadyCount
) on ClusterName, Computer, TimeGenerated
| project TimeGenerated,
ClusterName,
Computer,
ReadyCount = todouble(ReadyCount) / ClusterSnapshotCount,
NotReadyCount = todouble(NotReadyCount) / ClusterSnapshotCount
| order by ClusterName asc, Computer asc, TimeGenerated desc
以下查询返回基于所有阶段(Failed
、Pending
、Unknown
、Running
或 Succeeded
)的 Pod 阶段计数。
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| distinct ClusterName, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName
| join hint.strategy=broadcast (
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| summarize PodStatus=any(PodStatus) by TimeGenerated, PodUid, ClusterName
| summarize TotalCount = count(),
PendingCount = sumif(1, PodStatus =~ 'Pending'),
RunningCount = sumif(1, PodStatus =~ 'Running'),
SucceededCount = sumif(1, PodStatus =~ 'Succeeded'),
FailedCount = sumif(1, PodStatus =~ 'Failed')
by ClusterName, bin(TimeGenerated, trendBinSize)
) on ClusterName, TimeGenerated
| extend UnknownCount = TotalCount - PendingCount - RunningCount - SucceededCount - FailedCount
| project TimeGenerated,
TotalCount = todouble(TotalCount) / ClusterSnapshotCount,
PendingCount = todouble(PendingCount) / ClusterSnapshotCount,
RunningCount = todouble(RunningCount) / ClusterSnapshotCount,
SucceededCount = todouble(SucceededCount) / ClusterSnapshotCount,
FailedCount = todouble(FailedCount) / ClusterSnapshotCount,
UnknownCount = todouble(UnknownCount) / ClusterSnapshotCount
| summarize AggValue = avg(PendingCount) by bin(TimeGenerated, trendBinSize)
注意
若要针对特定的 Pod 阶段(例如“Pending
”、“Failed
”或“Unknown
”)发出警报,请修改查询的最后一行。 例如,若要针对 FailedCount
发出警报,请使用 | summarize AggValue = avg(FailedCount) by bin(TimeGenerated, trendBinSize)
。
以下查询返回可用空间超过 90% 的已用群集节点磁盘。 若要获取群集 ID,请首先运行以下查询并从 ClusterId
属性中复制值:
InsightsMetrics
| extend Tags = todynamic(Tags)
| project ClusterId = Tags['container.azm.ms/clusterId']
| distinct tostring(ClusterId)
let clusterId = '<cluster-id>';
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
InsightsMetrics
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where Origin == 'container.azm.ms/telegraf'
| where Namespace == 'container.azm.ms/disk'
| extend Tags = todynamic(Tags)
| project TimeGenerated, ClusterId = Tags['container.azm.ms/clusterId'], Computer = tostring(Tags.hostName), Device = tostring(Tags.device), Path = tostring(Tags.path), DiskMetricName = Name, DiskMetricValue = Val
| where ClusterId =~ clusterId
| where DiskMetricName == 'used_percent'
| summarize AggValue = max(DiskMetricValue) by bin(TimeGenerated, trendBinSize)
| where AggValue >= 90
当过去 10 分钟内的单个系统容器重启计数超过阈值时,会发出单个容器重启次数(结果数量)警报:
let _threshold = 10m;
let _alertThreshold = 2;
let Timenow = (datetime(now) - _threshold);
let starttime = ago(5m);
KubePodInventory
| where TimeGenerated >= starttime
| where Namespace in ('default', 'kube-system') // the namespace filter goes here
| where ContainerRestartCount > _alertThreshold
| extend Tags = todynamic(ContainerLastStatus)
| extend startedAt = todynamic(Tags.startedAt)
| where startedAt >= Timenow
| summarize arg_max(TimeGenerated, *) by Name
后续步骤
- 请参阅日志查询示例,以查看预定义的查询,以及用于发警报、可视化或分析群集的评估或自定义示例。
- 若要详细了解 Azure Monitor 以及如何监视 Kubernetes 群集的其他方面,请参阅查看 Kubernetes 群集性能和查看 Kubernetes 群集运行状况。