排查 Azure Monitor 指标警报的问题Troubleshooting problems in Azure Monitor metric alerts

本文介绍了 Azure Monitor 指标警报的常见问题,以及如何排查这些问题。This article discusses common problems in Azure Monitor metric alerts and how to troubleshoot them.

在监视数据中发现重要情况时,Azure Monitor 警报会主动通知你。Azure Monitor alerts proactively notify you when important conditions are found in your monitoring data. 有了警报,你就可以在系统的用户注意到问题之前确定和解决这些问题。They allow you to identify and address issues before the users of your system notice them. 有关警报的详细信息,请参阅 Azure 中的警报概述For more information on alerting, see Overview of alerts in Azure.

指标警报应当已触发但未触发Metric alert should have fired but didn't

如果你认为某个指标警报应当已触发但未触发且在 Azure 门户中找不到该警报,则请尝试执行以下步骤:If you believe a metric alert should have fired but it didn’t fire and isn't found in the Azure portal, try the following steps:

  1. 配置 - 检查指标警报规则配置以确保它正确配置:Configuration - Review the metric alert rule configuration to make sure it’s properly configured:

    • 检查是否按预期配置了“聚合类型”、“聚合粒度(周期)”和“阈值”或“敏感度” Check that the Aggregation type, Aggregation granularity (period), and Threshold value or Sensitivity are configured as expected

    • 对于使用“动态阈值”的警报规则,请检查是否配置了高级设置,因为“冲突的数量”可能会筛选警报,而“忽略之前的数据”会影响阈值的计算方式 For an alert rule that uses Dynamic Thresholds, check if advanced settings are configured, as Number of violations may filter alerts and Ignore data before can impact how the thresholds are calculated

      备注

      动态阈值在变为活动状态之前至少需要 3 天和 30 个指标示例。Dynamic Thresholds require at least 3 days and 30 metric samples before becoming active.

  2. 已触发但没有通知 - 复查 触发的警报列表,看是否可以找到触发的警报。Fired but no notification - Review the fired alerts list to see if you can locate the fired alert. 如果可以在列表中看到该警报,但其部分操作或通知存在问题,请在此处了解更多信息。If you can see the alert in the list, but have an issue with some of its actions or notifications, see more information here.

  3. 已处于活动状态 - 检查你预计会收到警报的指标时序是否已存在触发的警报。Already active - Check if there’s already a fired alert on the metric time series you expected to get an alert for. 指标警报是有状态的,即,一旦在特定的指标时序中触发某个警报,就不会触发该时序中的其他警报,直到相应的问题不再出现。Metric alerts are stateful, meaning that once an alert is fired on a specific metric time series, additional alerts on that time series will not be fired until the issue is no longer observed. 此设计选择减少了干扰。This design choice reduces noise. 当连续三次评估不满足警报条件时,警报会自动解决。The alert is resolved automatically when the alert condition is not met for three consecutive evaluations.

  4. 使用的维度 - 如果选择了一些 针对某个指标的维度值,则警报规则会监视各个指标时序(通过将维度值组合在一起来定义)中是否存在超出阈值的情况。Dimensions used - If you've selected some dimension values for a metric, the alert rule monitors each individual metric time series (as defined by the combination of dimension values) for a threshold breach. 如果还要监视聚合指标时序(不选择任何维度),请在该指标上配置附加警报规则而不选择维度。To also monitor the aggregate metric time series (without any dimensions selected), configure an additional alert rule on the metric without selecting dimensions.

  5. 聚合和时间粒度 - 如果要使用 指标图表来将指标可视化,请确保:Aggregation and time granularity - If you're visualizing the metric using metrics charts, ensure that:

    • 指标图表中选择的“聚合”与警报规则中的“聚合类型”相同 The selected Aggregation in the metric chart is the same as Aggregation type in your alert rule
    • 所选的“时间粒度”与警报规则中的“聚合粒度(周期)”相同,且未设置为“自动” The selected Time granularity is the same as the Aggregation granularity (period) in your alert rule (and not set to 'Automatic')

指标警报在不应当触发时触发Metric alert fired when it shouldn't have

如果你认为指标警报不应当触发但却触发,则可通过以下步骤来解决问题。If you believe your metric alert shouldn't have fired but it did, the following steps might help resolve the issue.

  1. 查看触发的警报列表,找到触发的警报,然后单击以查看其详细信息。Review the fired alerts list to locate the fired alert, and click to view its details. 查看“为何会触发此警报?”下提供的信息,以了解触发警报时的指标图表、“指标值”和“阈值” 。Review the information provided under Why did this alert fire? to see the metric chart, Metric Value, and Threshold value at the time when the alert was triggered.

    备注

    如果你使用动态阈值条件类型,并且认为使用的阈值不正确,请使用哭脸图标提供反馈。If you're using a Dynamic Thresholds condition type and think that the thresholds used were not correct, please provide feedback using the frown icon. 此反馈会影响机器学习算法研究,有助于改进未来的检测。This feedback will impact the machine learning algorithmic research and help improve future detections.

  2. 如果为某个指标选择了多个维度值,则当任何指标时序(通过将维度值组合在一起来定义)超出阈值时,都会触发警报。If you've selected multiple dimension values for a metric, the alert will be triggered when any of the metric time series (as defined by the combination of dimension values) breaches the threshold. 有关在指标警报中使用维度的详细信息, 请参阅For more information about using dimensions in metric alerts, see here.

  3. 检查警报规则配置以确保它已正确配置:Review the alert rule configuration to make sure it’s properly configured:

    • 检查是否按预期配置了“聚合类型”、“聚合粒度(周期)”和“阈值”或“敏感度” Check that the Aggregation type, Aggregation granularity (period), and Threshold value or Sensitivity are configured as expected
    • 对于使用“动态阈值”的警报规则,请检查是否配置了高级设置,因为“冲突的数量”可能会筛选警报,而“忽略之前的数据”会影响阈值的计算方式 For an alert rule that uses Dynamic Thresholds, check if advanced settings are configured, as Number of violations may filter alerts and Ignore data before can impact how the thresholds are calculated

    备注

    动态阈值在变为活动状态之前至少需要 3 天和 30 个指标示例。Dynamic Thresholds require at least 3 days and 30 metric samples before becoming active.

  4. 如果使用指标图表将指标可视化,请确保:If you're visualizing the metric using Metrics chart, ensure that:

    • 指标图表中选择的“聚合”与警报规则中的“聚合类型”相同 The selected Aggregation in the metric chart is the same as Aggregation type in your alert rule
    • 所选的“时间粒度”与警报规则中的“聚合粒度(周期)”相同,且未设置为“自动” The selected Time granularity is the same as the Aggregation granularity (period) in your alert rule (and not set to 'Automatic')
  5. 如果在触发该警报时已存在对同一条件进行监视的已触发警报(尚未解决),则请检查是否已将警报规则的已配置 autoMitigate 属性设置为“false”(此属性只能通过 REST/PowerShell/CLI 进行配置,因此请检查用来部署警报规则的脚本)。If the alert fired while there are already fired alerts that monitor the same criteria (that aren’t resolved), check if the alert rule has been configured with the autoMitigate property set to false (this property can only be configured via REST/PowerShell/CLI, so check the script used to deploy the alert rule). 在这种情况下,警报规则不会自动解析触发的警报,并且在再次触发警报之前不需要解析触发的警报。In such case, the alert rule does not autoresolve fired alerts, and does not require a fired alert to be resolved before firing again.

找不到警报所针对的指标Can’t find the metric to alert on

如果要对特定指标发出警报,但创建警报规则时看不到该指标,请检查以下内容:If you’re looking to alert on a specific metric but can’t see it when creating an alert rule, check the following:

找不到警报所针对的指标维度Can’t find the metric dimension to alert on

如果希望对指标的特定维度值发出警报,但找不到这些值,请注意以下事项:If you're looking to alert on specific dimension values of a metric, but cannot find these values, note the following:

  1. 维度值可能需要几分钟时间才能显示在“维度值”列表下It might take a few minutes for the dimension values to appear under the Dimension values list
  2. 显示的维度值基于在过去三天内收集到的指标数据The displayed dimension values are based on metric data collected in the last three days
  3. 如果未发出此维度值,请单击“+”符号以添加自定义值If the dimension value isn’t yet emitted, click the '+' sign to add a custom value
  4. 如果要对某个维度的所有可能值(包括将来的值)发出警报,请选中“选择 *”复选框If you’d like to alert on all possible values of a dimension (including future values), check the 'Select *' checkbox

在已删除资源上仍然会定义指标警报规则Metric alert rules still defined on a deleted resource

删除 Azure 资源时,不会自动删除关联的指标预警规则。When deleting an Azure resource, associated metric alert rules aren't deleted automatically. 若要删除与已删除的资源关联的警报规则,请执行以下操作:To delete alert rules associated with a resource that has been deleted:

  1. 打开在其中定义了删除的资源的资源组Open the resource group in which the deleted resource was defined
  2. 在显示资源的列表中,选中“显示隐藏的类型”复选框In the list displaying the resources, check the Show hidden types checkbox
  3. 按类型 == microsoft.insights/metricalerts 筛选列表Filter the list by Type == microsoft.insights/metricalerts
  4. 选择相关的警报规则,并选择“删除”Select the relevant alert rules and select Delete

使指标警报在每次满足条件时都出现Make metric alerts occur every time my condition is met

默认情况下,指标警报是有状态的,因此,如果给定的时序已存在触发的警报,则不会触发其他警报。Metric alerts are stateful by default, and therefore additional alerts are not fired if there’s already a fired alert on a given time series. 如果希望将特定指标警报规则设为无状态,并在每次评估符合警报条件时收到警报,请以编程方式(例如,通过资源管理器PowerShellRESTCLI)创建警报规则,并将 autoMitigate 属性设为“False”。If you wish to make a specific metric alert rule stateless, and get alerted on every evaluation in which the alert condition is met, create the alert rule programmatically (for example, via Resource Manager, PowerShell, REST, CLI), and set the autoMitigate property to 'False'.

备注

将指标警报规则设为无状态会妨碍已触发警报的解决,因此即使在不再满足条件后,触发的警报也会在 30 天的保留期内保持已触发状态。Making a metric alert rule stateless prevents fired alerts from becoming resolved, so even after the condition isn’t met anymore, the fired alerts will remain in a fired state until the 30 days retention period.

通过 Azure 门户导出指标警报规则的 Azure 资源管理器模板Export the Azure Resource Manager template of a metric alert rule via the Azure portal

导出指标警报规则的资源管理器模板有助于了解其 JSON 语法和属性,并可用于自动执行后续部署。Exporting the Resource Manager template of a metric alert rule helps you understand its JSON syntax and properties, and can be used to automate future deployments.

  1. 在门户中导航到“资源组”部分,然后选择包含该规则的资源组。Navigate to the Resource Groups section in the portal, and select the resource group containing the rule.
  2. 在“概述”部分,选中“显示隐藏的类型”复选框。In the Overview section, check the Show hidden types checkbox.
  3. 在“类型”筛选器中,选择 microsoft.insights/metricalerts。In the Type filter, select microsoft.insights/metricalerts.
  4. 选择相关警报规则以查看其详细信息。Select the relevant alert rule to view its details.
  5. 在“设置”下,选择“导出模板”。 Under Settings, select Export template.

指标警报规则配额太小Metric alert rules quota too small

每个订阅允许的指标警报规则数目受制于配额限制The allowed number of metric alert rules per subscription is subject to quota limits.

如果已达到配额限制,请执行以下步骤以帮助解决此问题:If you've reached the quota limit, the following steps may help resolve the issue:

  1. 尝试删除或禁用不再使用的指标警报规则。Try deleting or disabling metric alert rules that aren’t used anymore.

  2. 切换到使用监视多个资源的指标预警规则。Switch to using metric alert rules that monitor multiple resources. 通过此功能,一个警报规则可以监视多个资源,且只会将一个警报规则计入配额。With this capability, a single alert rule can monitor multiple resources using only one alert rule counted against the quota. 要详细了解此功能和支持的资源类型,请参阅此处For more information about this capability and the supported resource types, see here.

  3. 如果需要提高配额限制,请创建支持请求,并提供以下信息:If you need the quota limit to be increased, open a support request, and provide the following information:

    • 需要提高配额限制的订阅 IDSubscription Id(s) for which the quota limit needs to be increased
    • 增大配额的资源类型:“指标警报”或“指标警报(经典)” Resource type for the quota increase: Metric alerts or Metric alerts (Classic)
    • 请求的配额限制Requested quota limit

检查指标警报规则的总数Check total number of metric alert rules

若要检查指标警报规则的当前使用情况,请执行以下步骤。To check the current usage of metric alert rules, follow the steps below.

通过 Azure 门户From the Azure portal

  1. 打开“警报”屏幕,然后单击“管理预警规则” Open the Alerts screen, and click Manage alert rules
  2. 使用“订阅”下拉列表控件筛选到相关订阅Filter to the relevant subscription, by using the Subscription dropdown control
  3. 请勿筛选到特定的资源组、资源类型或资源Make sure NOT to filter to a specific resource group, resource type, or resource
  4. 在“信号类型”下拉列表控件中,选择“指标” In the Signal type dropdown control, select Metrics
  5. 验证“状态”下拉列表控件是否设置为“已启用” Verify that the Status dropdown control is set to Enabled
  6. 指标警报规则总数将显示在警报规则列表上方The total number of metric alert rules are displayed above the alert rules list

通过 APIFrom API

使用资源管理器模板、REST API、PowerShell 或 Azure CLI 管理警报规则Managing alert rules using Resource Manager templates, REST API, PowerShell, or Azure CLI

如果使用资源管理器模板、REST API、PowerShell 或 Azure 命令行界面 (CLI) 创建、更新、检索或删除指标警报时遇到问题,则以下步骤可能有助于解决问题。If you're running into issues creating, updating, retrieving, or deleting metric alerts using Resource Manager templates, REST API, PowerShell, or the Azure command-line interface (CLI), the following steps may help resolve the issue.

Resource Manager 模板Resource Manager templates

REST APIREST API

查看 REST API 指南来验证是否正确传递了所有参数Review the REST API guide to verify you're passing the all the parameters correctly

PowerShellPowerShell

确保对指标警报使用正确的 PowerShell cmdlet 命令:Make sure that you're using the right PowerShell cmdlets for metric alerts:

Azure CLIAzure CLI

确保对指标警报使用正确的 CLI 命令:Make sure that you're using the right CLI commands for metric alerts:

常规General

  • 如果收到 Metric not found 错误:If you're receiving a Metric not found error:

    • 对于平台指标:请确保你使用的是来自“Azure Monitor 支持的指标”页的“指标名称”而不是“指标显示名称” For a platform metric: Make sure that you're using the Metric name from the Azure Monitor supported metrics page, and not the Metric Display Name

    • 对于自定义指标:请确保已发出指标(不能对尚不存在的自定义指标创建警报规则),并且提供的是自定义指标的命名空间(请参阅此处的资源管理器模板示例)For a custom metric: Make sure that the metric is already being emitted (you cannot create an alert rule on a custom metric that doesn't yet exist), and that you're providing the custom metric's namespace (see a Resource Manager template example here)

  • 如果要创建关于日志的指标警报,请确保包括相应的依赖项。If you're creating metric alerts on logs, ensure appropriate dependencies are included. 参阅示例模板See sample template.

  • 如果要创建包含多个条件的警报规则,请注意以下限制:If you're creating an alert rule that contains multiple criteria, note the following constraints:

    • 在每个条件内,只能为每个维度选择一个值You can only select one value per dimension within each criterion
    • 不能使用“*”作为维度值You cannot use "*" as a dimension value
    • 当以不同条件配置的指标支持相同维度时,则必须以相同方式为所有这些指标显式设置配置的维度值(请参阅此处的资源管理器模板示例)When metrics that are configured in different criterions support the same dimension, then a configured dimension value must be explicitly set in the same way for all of those metrics (see a Resource Manager template example here)

无权创建指标警报规则No permissions to create metric alert rules

若要创建指标警报规则,你需要有以下权限:To create a metric alert rule, you’ll need to have the following permissions:

  • 对警报规则的目标资源的读取权限Read permission on the target resource of the alert rule
  • 对在其中创建预警规则的资源组的写入权限(如果是从 Azure 门户中创建预警规则,则默认在目标资源所在的资源组中创建预警规则)Write permission on the resource group in which the alert rule is created (if you’re creating the alert rule from the Azure portal, the alert rule is created by default in the same resource group in which the target resource resides)
  • 对关联到警报规则的任何操作组的读取权限(如果适用)Read permission on any action group associated to the alert rule (if applicable)

指标警报规则的命名限制Naming restrictions for metric alert rules

请考虑对指标警报规则名称的以下限制:Consider the following restrictions for metric alert rule names:

  • 指标警报规则名称在创建后无法更改(重命名)Metric alert rule names can’t be changed (renamed) once created
  • 指标警报规则名称在资源组中必须唯一Metric alert rule names must be unique within a resource group
  • 指标警报规则名称不能包含以下字符:* # & + : < > ?Metric alert rule names can’t contain the following characters: * # & + : < > ? @ % { } \ /@ % { } \ /
  • 指标预警规则名称不能以空格或句点结尾Metric alert rule names can’t end with a space or a period

在具有多个条件的指标警报规则中使用维度时的限制Restrictions when using dimensions in a metric alert rule with multiple conditions

指标警报支持根据多维指标发出警报,并支持定义多个条件(每个警报规则最多可定义 5 个条件)。Metric alerts support alerting on multi-dimensional metrics as well as support defining multiple conditions (up to 5 conditions per alert rule).

在包含多个条件的警报规则中使用维度时,请考虑以下约束:Consider the following constraints when using dimensions in an alert rule that contains multiple conditions:

  • 在每个条件中,只能为每个维度选择一个值。You can only select one value per dimension within each condition.
  • 不能使用选项“选择所有当前值和未来值”(选择 *)。You can't use the option to "Select all current and future values" (Select *).
  • 如果在不同条件中配置的指标支持同一维度,则必须以相同方式为所有这些指标(在相关条件中)显式设置配置的维度值。When metrics that are configured in different conditions support the same dimension, then a configured dimension value must be explicitly set in the same way for all of those metrics (in the relevant conditions). 例如:For example:
    • 请考虑在存储帐户上定义的一个指标警报规则,该警报规则监视两个条件:Consider a metric alert rule that is defined on a storage account and monitors two conditions:
      • 事务 总数 > 5Total Transactions > 5
      • 平均 SuccessE2ELatency > 250 毫秒Average SuccessE2ELatency > 250 ms
    • 我想更新第一个条件,并且仅监视 ApiName 维度等于“GetBlob”的事务I'd like to update the first condition, and only monitor transactions where the ApiName dimension equals "GetBlob"
    • 由于“事务数”和 SuccessE2ELatency 指标都支持 ApiName 维度,所以我需要更新这两个条件,并将它们的 ApiName 维度都指定为“GetBlob”值。Because both the Transactions and SuccessE2ELatency metrics support an ApiName dimension, I'll need to update both conditions, and have both of them specify the ApiName dimension with a "GetBlob" value.

设置预警规则的周期和频率Setting the alert rule's Period and Frequency

建议选择大于评估频率的聚合粒度(周期),以降低在以下情况下错过对已添加的时序进行首次评估的可能性 :We recommend choosing an Aggregation granularity (Period) that is larger than the Frequency of evaluation, to reduce the likelihood of missing the first evaluation of added time series in the following cases:

  • 监视多个维度的指标警报规则 - 添加新维度值组合时Metric alert rule that monitors multiple dimensions - When a new dimension value combination is added
  • 监视多个资源的指标警报规则 - 新资源添加到范围时Metric alert rule that monitors multiple resources - When a new resource is added to the scope
  • 监视未连续发出的指标(稀疏指标)的指标警报规则 - 指标在超过 24 小时的时间段发出时(24 小时内未发出)Metric alert rule that monitors a metric that isn’t emitted continuously (sparse metric) - When the metric is emitted after a period longer than 24 hours in which it wasn’t emitted

后续步骤Next steps