从容器见解创建日志搜索警报

2025-06-23

容器见解可监视部署到托管或自托管 Kubernetes 群集的容器工作负荷的性能。本文介绍如何在以下场景中使用 Azure Kubernetes 服务 (AKS) 群集创建基于日志的警报，以便针对重要内容发出警报：

当群集节点上的 CPU 或内存利用率超过阈值时
当控制器中任何容器上的 CPU 或内存利用率超过阈值时（与相应资源中设置的限制相比）
NotReady 状态节点计数
Failed、Pending、Unknown、Running 或 Succeeded Pod 阶段计数
当群集节点上的可用磁盘空间超过阈值时

若要针对群集节点上的 CPU 或内存利用率过高或可用磁盘空间不足发出警报，请使用提供的查询来创建指标警报或指标度量警报。指标警报的延迟比日志搜索警报低，但日志搜索警报提供了高级查询和更复杂的功能。日志搜索警报查询通过使用 now 运算符并将时间往过去推一个小时，将某个日期/时间与当前时间进行比较。 (容器见解以协调世界时 [UTC] 格式存储所有日期。)

重要

本文中的查询依赖于由“容器见解”收集并存储在 Log Analytics 工作区中的数据。如果你修改了默认数据收集设置，查询可能不会返回预期结果。最值得注意的是，如果你在为群集启用 Prometheus 指标后禁用了性能数据收集，则任何使用 Perf 表的查询都不会返回结果。

请参阅使用数据收集规则在容器见解中配置数据收集，了解预设配置（包括禁用性能数据收集）。如需更多数据收集选项，请参阅使用 ConfigMap 在容器见解中配置数据收集。

如果不熟悉 Azure Monitor 警报，请参阅 Azure 中的警报概述，然后再开始。若要详细了解使用日志查询的警报，请参阅 Azure Monitor 中的日志搜索警报。有关指标警报的详细信息，请参阅 Azure Monitor 中的指标警报。

日志查询度量

日志搜索警报可以衡量两种不同的内容，可用于监视不同方案中的虚拟机：

结果计数：计算查询返回的行数，可用于处理 Windows 事件日志、Syslog、应用程序异常等事件。
值的计算：基于数字列进行计算，可用于包含任意数量的资源。例如 CPU 百分比。

目标资源和维度

可以使用一个规则通过维度监视多个实例的值。例如，如果你想要监视运行网站或应用的多个实例上的 CPU 使用率，并针对 CPU 使用率超过 80% 的情况创建警报，则可以使用维度。

若要为订阅或资源组大规模创建以资源为中心的警报，可以按维度进行拆分。如果要在多个 Azure 资源上监视相同的条件，按维度进行拆分会通过使用数字或字符串列对唯一组合进行分组，将警报拆分为独立的警报。对 Azure 资源 ID 列进行拆分会使指定的资源进入警报目标。

如果你需要在一个范围内的多个资源上设置条件，你可能也会决定不进行拆分。例如，你可能希望在资源组范围中至少有五台计算机的 CPU 使用率超过 80% 时创建警报。

你可能想要查看受影响计算机的警报列表。你可以使用自定义工作簿，该工作簿使用自定义资源图来提供此视图。请使用以下查询显示警报，并使用工作簿中的 Azure Resource Graph 数据源。

创建日志搜索警报规则

若要使用门户创建日志搜索警报规则，请参阅此日志搜索警报示例，其中提供了完整的演练。可以遵循与此相同的过程，使用类似于本文的查询为 AKS 群集创建预警规则。

若要使用 Azure 资源管理器 (ARM) 模板创建查询警报规则，请参阅 Azure Monitor 中日志搜索警报规则的资源管理器模板示例。可以使用这些相同的过程为本文中的日志查询创建 ARM 模板。

资源利用率

每分钟的平均 CPU 利用率，即成员节点的平均 CPU 利用率（指标度量）：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuCapacityNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

每分钟的平均内存利用率，即成员节点的平均内存利用率（指标度量）：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryCapacityBytes';
let usageCounterName = 'memoryRssBytes';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

重要

以下查询使用占位符值 <your-cluster-name> 和 <your-controller-name> 来分别表示群集和控制器。设置警报时，请将这些占位符替换为环境特定的值。

每分钟控制器中所有容器的平均 CPU 利用率，即控制器中每个容器实例的平均 CPU 利用率（指标度量）：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuLimitNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

每分钟控制器中所有容器的平均内存利用率，即控制器中每个容器实例的平均内存利用率（指标度量）：

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryLimitBytes';
let usageCounterName = 'memoryRssBytes';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

资源可用性

处于“就绪”和“未就绪”状态的节点和计数(指标度量):

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| distinct ClusterName, Computer, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName, Computer
| join hint.strategy=broadcast kind=inner (
    KubeNodeInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | summarize TotalCount = count(), ReadyCount = sumif(1, Status contains ('Ready'))
                by ClusterName, Computer,  bin(TimeGenerated, trendBinSize)
    | extend NotReadyCount = TotalCount - ReadyCount
) on ClusterName, Computer, TimeGenerated
| project   TimeGenerated,
            ClusterName,
            Computer,
            ReadyCount = todouble(ReadyCount) / ClusterSnapshotCount,
            NotReadyCount = todouble(NotReadyCount) / ClusterSnapshotCount
| order by ClusterName asc, Computer asc, TimeGenerated desc

以下查询返回基于所有阶段（Failed、Pending、Unknown、Running 或 Succeeded）的 Pod 阶段计数。

let endDateTime = now(); 
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubePodInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ClusterName == clusterName
    | distinct ClusterName, TimeGenerated
    | summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName
    | join hint.strategy=broadcast (
        KubePodInventory
        | where TimeGenerated < endDateTime
        | where TimeGenerated >= startDateTime
        | summarize PodStatus=any(PodStatus) by TimeGenerated, PodUid, ClusterName
        | summarize TotalCount = count(),
                    PendingCount = sumif(1, PodStatus =~ 'Pending'),
                    RunningCount = sumif(1, PodStatus =~ 'Running'),
                    SucceededCount = sumif(1, PodStatus =~ 'Succeeded'),
                    FailedCount = sumif(1, PodStatus =~ 'Failed')
                by ClusterName, bin(TimeGenerated, trendBinSize)
    ) on ClusterName, TimeGenerated
    | extend UnknownCount = TotalCount - PendingCount - RunningCount - SucceededCount - FailedCount
    | project TimeGenerated,
              TotalCount = todouble(TotalCount) / ClusterSnapshotCount,
              PendingCount = todouble(PendingCount) / ClusterSnapshotCount,
              RunningCount = todouble(RunningCount) / ClusterSnapshotCount,
              SucceededCount = todouble(SucceededCount) / ClusterSnapshotCount,
              FailedCount = todouble(FailedCount) / ClusterSnapshotCount,
              UnknownCount = todouble(UnknownCount) / ClusterSnapshotCount
| summarize AggValue = avg(PendingCount) by bin(TimeGenerated, trendBinSize)

注释

若要针对特定的 Pod 阶段（例如“Pending”、“Failed”或“Unknown”）发出警报，请修改查询的最后一行。例如，若要针对 FailedCount 发出警报，请使用 | summarize AggValue = avg(FailedCount) by bin(TimeGenerated, trendBinSize)。

以下查询返回可用空间超过 90% 的已用群集节点磁盘。若要获取群集 ID，请首先运行以下查询并从 ClusterId 属性中复制值：

InsightsMetrics
| extend Tags = todynamic(Tags)            
| project ClusterId = Tags['container.azm.ms/clusterId']   
| distinct tostring(ClusterId)

let clusterId = '<cluster-id>';
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
InsightsMetrics
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where Origin == 'container.azm.ms/telegraf'            
| where Namespace == 'container.azm.ms/disk'            
| extend Tags = todynamic(Tags)            
| project TimeGenerated, ClusterId = Tags['container.azm.ms/clusterId'], Computer = tostring(Tags.hostName), Device = tostring(Tags.device), Path = tostring(Tags.path), DiskMetricName = Name, DiskMetricValue = Val   
| where ClusterId =~ clusterId       
| where DiskMetricName == 'used_percent'
| summarize AggValue = max(DiskMetricValue) by bin(TimeGenerated, trendBinSize)
| where AggValue >= 90

当过去 10 分钟内的单个系统容器重启计数超过阈值时，会发出单个容器重启次数(结果数量)警报:

let _threshold = 10m; 
let _alertThreshold = 2;
let Timenow = (datetime(now) - _threshold); 
let starttime = ago(5m); 
KubePodInventory
| where TimeGenerated >= starttime
| where Namespace in ('default', 'kube-system') // the namespace filter goes here
| where ContainerRestartCount > _alertThreshold
| extend Tags = todynamic(ContainerLastStatus)
| extend startedAt = todynamic(Tags.startedAt)
| where startedAt >= Timenow
| summarize arg_max(TimeGenerated, *) by Name

后续步骤

请参阅日志查询示例，以查看预定义的查询，以及用于发警报、可视化或分析群集的评估或自定义示例。
若要详细了解 Azure Monitor 以及如何监视 Kubernetes 群集的其他方面，请参阅查看 Kubernetes 群集性能和查看 Kubernetes 群集运行状况。

通过