如何在用于容器的 Azure Monitor 中针对性能问题设置警报How to set up alerts for performance problems in Azure Monitor for containers

用于容器的 Azure Monitor 可以监视部署到 Azure 容器实例或 Azure Kubernetes 服务 (AKS) 上托管的托管 Kubernetes 群集的容器工作负荷的性能。Azure Monitor for containers monitors the performance of container workloads that are deployed to Azure Container Instances or to managed Kubernetes clusters that are hosted on Azure Kubernetes Service (AKS).

本文介绍如何针对以下情况启用警报:This article describes how to enable alerts for the following situations:

  • 当群集节点上的 CPU 或内存利用率超过阈值时When CPU or memory utilization on cluster nodes exceeds a threshold
  • 当控制器中任何容器上的 CPU 或内存利用率超过阈值时(与相应资源中设置的限制相比)When CPU or memory utilization on any container within a controller exceeds a threshold as compared to a limit that's set on the corresponding resource
  • “未就绪”状态节点计数NotReady status node counts
  • “失败”、“挂起”、“未知”、“正在运行”或“成功”Pod 阶段计数 Failed, Pending, Unknown, Running, or Succeeded pod-phase counts
  • 当群集节点上的可用磁盘空间超过阈值时When free disk space on cluster nodes exceeds a threshold

若要针对群集节点上的 CPU 或内存利用率过高或可用磁盘空间不足发出警报,请使用提供的查询来创建指标警报或指标度量警报。To alert for high CPU or memory utilization, or low free disk space on cluster nodes, use the queries that are provided to create a metric alert or a metric measurement alert. 指标警报的延迟要低于日志警报。Metric alerts have lower latency than log alerts. 但是,日志警报提供高级查询和更精密的信息。But log alerts provide advanced querying and greater sophistication. 日志警报查询使用 now 运算符将某个日期时间与当前时间进行比较,并将时间推后一个小时。Log alerts queries compare a datetime to the present by using the now operator and going back one hour. (用于容器的 Azure Monitor 以协调世界时 (UTC) 格式存储所有日期。)(Azure Monitor for containers stores all dates in Coordinated Universal Time (UTC) format.)

如果你不熟悉 Azure Monitor 警报,请在开始之前参阅 Microsoft Azure 中的警报概述If you're not familiar with Azure Monitor alerts, see Overview of alerts in Microsoft Azure before you start. 若要详细了解使用日志查询的警报,请参阅 Azure Monitor 中的日志警报To learn more about alerts that use log queries, see Log alerts in Azure Monitor. 有关指标警报的详细信息,请参阅 Azure Monitor 中的指标警报For more about metric alerts, see Metric alerts in Azure Monitor.

资源利用率日志搜索查询Resource utilization log search queries

本部分所述的查询支持每种警报方案。The queries in this section support each alerting scenario. 本文创建警报部分的步骤 7 中使用了这些查询。They're used in step 7 of the create alert section of this article.

以下查询每隔一分钟计算平均 CPU 利用率作为成员节点的平均 CPU 利用率。The following query calculates average CPU utilization as an average of member nodes' CPU utilization every minute.

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuCapacityNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggregatedValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

以下查询每隔一分钟计算平均内存利用率作为成员节点的平均内存利用率。The following query calculates average memory utilization as an average of member nodes' memory utilization every minute.

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryCapacityBytes';
let usageCounterName = 'memoryRssBytes';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
// cluster filter would go here if multiple clusters are reporting to the same Log Analytics workspace
| distinct ClusterName, Computer
| join hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime
  | where TimeGenerated >= startDateTime
  | where ObjectName == 'K8SNode'
  | where CounterName == capacityCounterName
  | summarize LimitValue = max(CounterValue) by Computer, CounterName, bin(TimeGenerated, trendBinSize)
  | project Computer, CapacityStartTime = TimeGenerated, CapacityEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer
| join kind=inner hint.strategy=shuffle (
  Perf
  | where TimeGenerated < endDateTime + trendBinSize
  | where TimeGenerated >= startDateTime - trendBinSize
  | where ObjectName == 'K8SNode'
  | where CounterName == usageCounterName
  | project Computer, UsageValue = CounterValue, TimeGenerated
) on Computer
| where TimeGenerated >= CapacityStartTime and TimeGenerated < CapacityEndTime
| project ClusterName, Computer, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggregatedValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize), ClusterName

重要

以下查询使用占位符值 <your-cluster-name> 和 <your-controller-name> 来表示群集和控制器。The following queries use the placeholder values <your-cluster-name> and <your-controller-name> to represent your cluster and controller. 设置警报时,请将这些占位符替换为环境特定的值。Replace them with values specific to your environment when you set up alerts.

以下查询每隔一分钟计算控制器中所有容器的平均 CPU 利用率,作为控制器中每个容器实例的平均 CPU 利用率。The following query calculates the average CPU utilization of all containers in a controller as an average of CPU utilization of every container instance in a controller every minute. 度量值是针对容器设置的限制百分比。The measurement is a percentage of the limit set up for a container.

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'cpuLimitNanoCores';
let usageCounterName = 'cpuUsageNanoCores';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggregatedValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

以下查询每隔一分钟计算控制器中所有容器的平均内存利用率,作为控制器中每个容器实例的平均内存利用率。The following query calculates the average memory utilization of all containers in a controller as an average of memory utilization of every container instance in a controller every minute. 度量值是针对容器设置的限制百分比。The measurement is a percentage of the limit set up for a container.

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let capacityCounterName = 'memoryLimitBytes';
let usageCounterName = 'memoryRssBytes';
let clusterName = '<your-cluster-name>';
let controllerName = '<your-controller-name>';
KubePodInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where ClusterName == clusterName
| where ControllerName == controllerName
| extend InstanceName = strcat(ClusterId, '/', ContainerName),
         ContainerName = strcat(controllerName, '/', tostring(split(ContainerName, '/')[1]))
| distinct Computer, InstanceName, ContainerName
| join hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ObjectName == 'K8SContainer'
    | where CounterName == capacityCounterName
    | summarize LimitValue = max(CounterValue) by Computer, InstanceName, bin(TimeGenerated, trendBinSize)
    | project Computer, InstanceName, LimitStartTime = TimeGenerated, LimitEndTime = TimeGenerated + trendBinSize, LimitValue
) on Computer, InstanceName
| join kind=inner hint.strategy=shuffle (
    Perf
    | where TimeGenerated < endDateTime + trendBinSize
    | where TimeGenerated >= startDateTime - trendBinSize
    | where ObjectName == 'K8SContainer'
    | where CounterName == usageCounterName
    | project Computer, InstanceName, UsageValue = CounterValue, TimeGenerated
) on Computer, InstanceName
| where TimeGenerated >= LimitStartTime and TimeGenerated < LimitEndTime
| project Computer, ContainerName, TimeGenerated, UsagePercent = UsageValue * 100.0 / LimitValue
| summarize AggregatedValue = avg(UsagePercent) by bin(TimeGenerated, trendBinSize) , ContainerName

以下查询返回处于“就绪”和“未就绪”状态的所有节点和计数。 The following query returns all nodes and counts that have a status of Ready and NotReady.

let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
let clusterName = '<your-cluster-name>';
KubeNodeInventory
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| distinct ClusterName, Computer, TimeGenerated
| summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName, Computer
| join hint.strategy=broadcast kind=inner (
    KubeNodeInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | summarize TotalCount = count(), ReadyCount = sumif(1, Status contains ('Ready'))
                by ClusterName, Computer,  bin(TimeGenerated, trendBinSize)
    | extend NotReadyCount = TotalCount - ReadyCount
) on ClusterName, Computer, TimeGenerated
| project   TimeGenerated,
            ClusterName,
            Computer,
            ReadyCount = todouble(ReadyCount) / ClusterSnapshotCount,
            NotReadyCount = todouble(NotReadyCount) / ClusterSnapshotCount
| order by ClusterName asc, Computer asc, TimeGenerated desc

以下查询基于所有阶段返回 Pod 阶段计数:“失败”、“挂起”、“未知”、“正在运行”或“成功”。 The following query returns pod phase counts based on all phases: Failed, Pending, Unknown, Running, or Succeeded.

let endDateTime = now();
    let startDateTime = ago(1h);
    let trendBinSize = 1m;
    let clusterName = '<your-cluster-name>';
    KubePodInventory
    | where TimeGenerated < endDateTime
    | where TimeGenerated >= startDateTime
    | where ClusterName == clusterName
    | distinct ClusterName, TimeGenerated
    | summarize ClusterSnapshotCount = count() by bin(TimeGenerated, trendBinSize), ClusterName
    | join hint.strategy=broadcast (
        KubePodInventory
        | where TimeGenerated < endDateTime
        | where TimeGenerated >= startDateTime
        | distinct ClusterName, Computer, PodUid, TimeGenerated, PodStatus
        | summarize TotalCount = count(),
                    PendingCount = sumif(1, PodStatus =~ 'Pending'),
                    RunningCount = sumif(1, PodStatus =~ 'Running'),
                    SucceededCount = sumif(1, PodStatus =~ 'Succeeded'),
                    FailedCount = sumif(1, PodStatus =~ 'Failed')
                 by ClusterName, bin(TimeGenerated, trendBinSize)
    ) on ClusterName, TimeGenerated
    | extend UnknownCount = TotalCount - PendingCount - RunningCount - SucceededCount - FailedCount
    | project TimeGenerated,
              TotalCount = todouble(TotalCount) / ClusterSnapshotCount,
              PendingCount = todouble(PendingCount) / ClusterSnapshotCount,
              RunningCount = todouble(RunningCount) / ClusterSnapshotCount,
              SucceededCount = todouble(SucceededCount) / ClusterSnapshotCount,
              FailedCount = todouble(FailedCount) / ClusterSnapshotCount,
              UnknownCount = todouble(UnknownCount) / ClusterSnapshotCount
| summarize AggregatedValue = avg(PendingCount) by bin(TimeGenerated, trendBinSize)

备注

若要针对特定的 Pod 阶段(例如“挂起”、“失败”或“未知”)发出警报,请修改查询的最后一行。 To alert on certain pod phases, such as Pending, Failed, or Unknown, modify the last line of the query. 例如,若要针对“失败计数”发出警报,请使用:For example, to alert on FailedCount use:
| summarize AggregatedValue = avg(FailedCount) by bin(TimeGenerated, trendBinSize)

以下查询返回可用空间超过 90% 的已用群集节点磁盘。The following query returns cluster nodes disks which exceed 90% free space used. 若要获取群集 ID,请首先运行以下查询并从 ClusterId 属性中复制值:To get the cluster ID, first run the following query and copy the value from the ClusterId property:

InsightsMetrics
| extend Tags = todynamic(Tags)            
| project ClusterId = Tags['container.azm.ms/clusterId']   
| distinct tostring(ClusterId)   
let clusterId = '<cluster-id>';
let endDateTime = now();
let startDateTime = ago(1h);
let trendBinSize = 1m;
InsightsMetrics
| where TimeGenerated < endDateTime
| where TimeGenerated >= startDateTime
| where Origin == 'container.azm.ms/telegraf'            
| where Namespace == 'container.azm.ms/disk'            
| extend Tags = todynamic(Tags)            
| project TimeGenerated, ClusterId = Tags['container.azm.ms/clusterId'], Computer = tostring(Tags.hostName), Device = tostring(Tags.device), Path = tostring(Tags.path), DiskMetricName = Name, DiskMetricValue = Val   
| where ClusterId =~ clusterId       
| where DiskMetricName == 'used_percent'
| summarize AggregatedValue = max(DiskMetricValue) by bin(TimeGenerated, trendBinSize)
| where AggregatedValue >= 90

创建警报规则Create an alert rule

在 Azure Monitor 中,遵循以下步骤使用前面提供的日志搜索规则之一创建日志警报。Follow these steps to create a log alert in Azure Monitor by using one of the log search rules that was provided earlier.

  1. 登录到 Azure 门户Sign in to the Azure portal.

  2. 在左侧窗格中选择“监视”。Select Monitor from the pane on the left side. 在“见解”下,选择“容器”。 Under Insights, select Containers.

  3. 在“监视的群集”选项卡上,从列表中选择一个群集。On the Monitored Clusters tab, select a cluster from the list.

  4. 在左侧窗格中的“监视”下,选择“日志”打开 Azure Monitor 日志页。 In the pane on the left side under Monitoring, select Logs to open the Azure Monitor logs page. 使用此页编写并执行 Azure Log Analytics 查询。You use this page to write and execute Azure Log Analytics queries.

  5. 在“日志”页上,选择“+新建警报规则”。 On the Logs page, select +New alert rule.

  6. 在“条件”部分,选择预定义的自定义日志条件“每当自定义日志搜索为 <logic undefined> 时” 。In the Condition section, select the Whenever the Custom log search is <logic undefined> pre-defined custom log condition. 系统会自动选择“自定义日志搜索”信号类型,因为我们要直接从 Azure Monitor 日志页创建警报规则。The custom log search signal type is automatically selected because we're creating an alert rule directly from the Azure Monitor logs page.

  7. 将前面提供的某个查询粘贴到“搜索查询”字段中。Paste one of the queries provided earlier into the Search query field.

  8. 按如下所述配置警报:Configure the alert as follows:

    1. 从“基于”下拉列表中选择“指标度量” 。From the Based on drop-down list, select Metric measurement. 指标度量将为查询中其值超过指定阈值的每个对象创建一个警报。A metric measurement creates an alert for each object in the query that has a value above our specified threshold.
    2. 对于“条件”,选择“大于”,并输入 75 作为 CPU 和内存利用率警报的初始基线阈值For Condition, select Greater than, and enter 75 as an initial baseline Threshold for the CPU and memory utilization alerts. 对于磁盘空间不足警报,输入 90For the low disk space alert, enter 90. 或输入符合条件的其他值。Or enter a different value that meets your criteria.
    3. 在“触发警报的条件”部分选择“连续违规”。 In the Trigger Alert Based On section, select Consecutive breaches. 从下拉列表中选择“大于”并输入 2From the drop-down list, select Greater than, and enter 2.
    4. 若要针对容器 CPU 或内存利用率配置警报,请在“聚合依据”下选择“容器名称”。 To configure an alert for container CPU or memory utilization, under Aggregate on, select ContainerName. 若要配置群集节点磁盘不足警报,请选择 ClusterIdTo configure for cluster node low disk alert, select ClusterId.
    5. 在“评估依据”部分,将“时段”值设置为 60 分钟In the Evaluated based on section, set the Period value to 60 minutes. 该规则将每隔 5 分钟运行一次,返回从当前时间算起过去一小时内创建的记录。The rule will run every 5 minutes and return records that were created within the last hour from the current time. 将时段设置为较宽的时限可以适应潜在的数据延迟。Setting the time period to a wide window accounts for potential data latency. 这也可以确保查询返回数据,以避免漏报,导致警报永远不会激发。It also ensures that the query returns data to avoid a false negative in which the alert never fires.
  9. 选择“完成”以完成警报规则。Select Done to complete the alert rule.

  10. 在“警报规则名称”字段中输入一个名称。Enter a name in the Alert rule name field. 填写“说明”以提供有关该警报的详细信息。Specify a Description that provides details about the alert. 从提供的选项中选择适当的严重性级别。And select an appropriate severity level from the options provided.

  11. 若要立即激活该警报规则,请接受“创建后启用规则”选项的默认值。To immediately activate the alert rule, accept the default value for Enable rule upon creation.

  12. 选择现有的操作组或创建新组。Select an existing Action Group or create a new group. 此步骤确保每次触发警报时都执行相同的操作。This step ensures that the same actions are taken every time that an alert is triggered. 请根据 IT 或 DevOps 运营团队管理事件的方式进行配置。Configure based on how your IT or DevOps operations team manages incidents.

  13. 选择“创建警报规则”以完成警报规则。Select Create alert rule to complete the alert rule. 该警报会立即开始运行。It starts running immediately.

后续步骤Next steps