Azure Monitor 中具有动态阈值的指标警报Metric Alerts with Dynamic Thresholds in Azure Monitor

具有动态阈值的指标警报检测功能利用高级机器学习 (ML) 来学习指标的历史行为,并识别可能指示存在服务问题的模式和异常。Metric Alert with Dynamic Thresholds detection leverages advanced machine learning (ML) to learn metrics' historical behavior, identify patterns and anomalies that indicate possible service issues. 它支持简单 UI 和大规模操作,可让用户通过 Azure 资源管理器 API 以完全自动化的方式配置警报规则。It provides support of both a simple UI and operations at scale by allowing users to configure alert rules through the Azure Resource Manager API, in a fully automated manner.

创建警报规则后,仅当受监视指标的行为不符合预期时(根据定制的阈值来确定),才会触发该规则。Once an alert rule is created, it will fire only when the monitored metric doesn’t behave as expected, based on its tailored thresholds.

我们很乐意听取你的反馈意见,请将其发送到支持团队We would love to hear your feedback, keep it coming at support team.

  1. 可缩放的警报 - 动态阈值警报规则一次可为数百个指标系列创建定制的阈值,同时也可以轻松地在单个指标上定义警报规则。Scalable Alerting - Dynamic threshold alert rules can create tailored thresholds for hundreds of metric series at a time, yet providing the same ease of defining an alert rule on a single metric. 它们可让你创建和管理更少的警报。They give you fewer alert to create and manage. 可以使用 Azure 门户或 Azure 资源管理器 API 来创建它们。You can use either Azure portal or the Azure Resource Manager API to create them. 处理指标维度或者应用到多个资源(例如所有订阅资源)时,可缩放的方法特别有用。The scalable approach is especially useful when dealing with metric dimensions or when applying to multiple resources, such as to all subscription resources. 详细了解如何使用模板配置具有动态阈值的指标警报Learn more about how to configure Metric Alerts with Dynamic Thresholds using templates.

  2. 智能识别指标模式 - 使用我们的 ML 技术可以自动检测指标模式,并不断适应指标的变化(通常包括每小时/每天/每周等季节性变化)。Smart Metric Pattern Recognition - Using our ML technology, we’re able to automatically detect metric patterns and adapt to metric changes over time, which may often include seasonality (hourly / daily / weekly). 不断适应指标的行为,并根据指标与模式之间的偏差发出警报可以缓解识别每个指标的“适当”阈值的压力。Adapting to the metrics’ behavior over time and alerting based on deviations from its pattern relieves the burden of knowing the "right" threshold for each metric. 动态阈值中使用的机器学习算法旨在防止出现不能提供预期模式的干扰性(低精度)阈值或宽泛性(低召回率)阈值。The ML algorithm used in Dynamic Thresholds is designed to prevent noisy (low precision) or wide (low recall) thresholds that don’t have an expected pattern.

  3. 直观配置 - 动态阈值允许使用高级概念设置指标警报,使用户无需在指标方面具有丰富的领域知识。Intuitive Configuration - Dynamic Thresholds allows setting up metric alerts using high-level concepts, alleviating the need to have extensive domain knowledge about the metric.

如何配置具有动态阈值的警报规则?How to configure alerts rules with Dynamic Thresholds?

可以通过 Azure Monitor 中的“指标警报”配置具有动态阈值的警报。Alerts with Dynamic Thresholds can be configured through Metric Alerts in Azure Monitor. 详细了解如何配置指标警报Learn more about how to configure Metric Alerts.

如何计算阈值?How are the thresholds calculated?

动态阈值持续学习指标系列的数据,并尝试使用一组算法和方法来为此数据建模。Dynamic Thresholds continuously learns the data of the metric series and tries to model it using a set of algorithms and methods. 它会检测数据中的模式(例如每小时/每天/每周等季节性),并且能够处理干扰性指标(例如计算机 CPU 或内存),以及离散度较低的指标(例如可用性和误差率)。It detects patterns in the data such as seasonality (Hourly / Daily / Weekly), and is able to handle noisy metrics (such as machine CPU or memory) as well as metrics with low dispersion (such as availability and error rate).

如果与选择的阈值之间有偏差,则表示指标行为存在异常。The thresholds are selected in such a way that a deviation from these thresholds indicates an anomaly in the metric behavior.

备注

季节性模式检测设置为一小时、一天或一周的间隔。Seasonal pattern detection is set to a hour, day, or week interval. 这意味着其他模式(如两小时模式或半周模式)可能不会检测到。This means other patterns like bihourly pattern or semiweekly might not be detected.

动态阈值中的“敏感度”设置是指什么?What does 'Sensitivity' setting in Dynamic Thresholds mean?

警报阈值敏感度是一个高级概念,用于控制触发警报所要实现的指标行为的偏差量。Alert threshold sensitivity is a high-level concept that controls the amount of deviation from metric behavior required to trigger an alert. 无需具备指标的领域知识(例如静态阈值)就能使用此选项。This option doesn't require domain knowledge about the metric like static threshold. 可用选项包括:The options available are:

  • 高 - 阈值比较严格,并且与指标系列模式接近。High - The thresholds will be tight and close to the metric series pattern. 警报规则将在偏差最小时触发,因此会生成更多的警报。An alert rule will be triggered on the smallest deviation, resulting in more alerts.
  • 中 - 不太严格且比较均衡的阈值,生成的警报比使用高敏感度(默认设置)时更少。Medium - Less tight and more balanced thresholds, fewer alerts than with high sensitivity (default).
  • 低 - 阈值比较宽松,与指标系列模式的偏差更大。Low - The thresholds will be loose with more distance from metric series pattern. 只会在偏差较大时触发警报规则,因此生成的警报较少。An alert rule will only trigger on large deviations, resulting in fewer alerts.

动态阈值中的“运算符”设置选项有哪些?What are the 'Operator' setting options in Dynamic Thresholds?

动态阈值警报规则可以使用相同的警报规则,基于指标行为的上限和下限创建定制的阈值。Dynamic Thresholds alerts rule can create tailored thresholds based on metric behavior for both upper and lower bounds using the same alert rule. 可以选择要在符合以下三种条件之一时触发的警报:You can choose the alert to be triggered on one of the following three conditions:

  • 大于阈值上限或低于阈值下限(默认条件)Greater than the upper threshold or lower than the lower threshold (default)
  • 大于阈值上限Greater than the upper threshold
  • 低于阈值下限。Lower than the lower threshold.

动态阈值中的高级设置是指什么?What do the advanced settings in Dynamic Thresholds mean?

失败时段 - 动态阈值还允许配置“触发警报之前的违规次数”,即,系统在引发警报之前,在特定时间范围内所要出现的最小偏离次数(默认时间范围是在 20 分钟内出现 4 次偏离)。Failing Periods - Dynamic Thresholds also allows you to configure "Number violations to trigger the alert", a minimum number of deviations required within a certain time window for the system to raise an alert (the default time window is four deviations in 20 minutes). 用户可以配置失败时段,并通过更改失败时段和时间范围,来选择要针对哪些指标发出警报。The user can configure failing periods and choose what to be alerted on by changing the failing periods and time window. 此功能可以减少暂时性高峰生成的警报干扰。This ability reduces alert noise generated by transient spikes. 例如:For example:

若要在问题持续 20 分钟(给定时间段为 5 分钟时连续 4 次)时触发警报,请使用以下设置:To trigger an alert when the issue is continuous for 20 minutes, 4 consecutive times in a given period grouping of 5 minutes, use the following settings:

失败时段设置为 20 分钟内持续出现问题,在给定的 5 分钟时间段内连续出现 4 次

若要在过去 30 分钟(时间段为 5 分钟)内有 20 分钟出现违反动态阈值的情况下触发警报,请使用以下设置:To trigger an alert when there was a violation from a Dynamic Thresholds in 20 minutes out of the last 30 minutes with period of 5 minutes, use the following settings:

失败时段设置为在过去 30 分钟内有 20 分钟出现问题,时间段为 5 分钟

忽略此前的数据 - 用户还可以选择性地定义一个开始日期,从此日期开始,系统应开始计算阈值。Ignore data before - Users may also optionally define a start date from which the system should begin calculating the thresholds from. 一个典型的用例是资源过去在测试模式下运行,现在模式已提升,以便为生产工作负荷提供服务,在这种情况下,应忽略任何指标在测试阶段的行为。A typical use case may occur when a resource was a running in a testing mode and is now promoted to serve a production workload, and therefore the behavior of any metric during the testing phase should be disregarded.

如何找出触发动态阈值警报的原因?How do you find out why a Dynamic Thresholds alert was triggered?

可以在警报视图中浏览触发的警报实例,方法是单击电子邮件或短信中的链接,或者通过浏览器在 Azure 门户中查看警报视图。You can explore triggered alert instances in the alerts view either by clicking on the link in the email or text message, or browser to see the alerts view in the Azure portal. 详细了解警报视图Learn more about the alerts view.

警报视图显示:The alert view displays:

  • 触发动态阈值警报时的所有指标详细信息。All the metric details at the moment the Dynamic Thresholds alert fired.
  • 触发警报的期间的图表,其中包括在该时间点使用的动态阈值。A chart of the period in which the alert was triggered that includes the Dynamic Thresholds used at that point in time.
  • 能够提供关于动态阈值警报和警报视图体验的反馈,这可以改进未来的检测。Ability to provide feedback on Dynamic Thresholds alert and the alerts view experience, which could improve future detections.

指标中的慢速行为变更是否会触发警报?Will slow behavior changes in the metric trigger an alert?

很有可能并非如此。Probably not. 动态阈值适合检测重大偏差,而不适合检测逐渐形成的问题。Dynamic Thresholds are good for detecting significant deviations rather than slowly evolving issues.

需要使用多少数据来预览再计算阈值?How much data is used to preview and then calculate thresholds?

首次创建警报规则时,将根据足够的历史数据计算图表中显示的阈值,以计算小时或每日季节性模式(10 天)。When an alert rule is first created, the thresholds appearing in the chart are calculated based on enough historical data to calculate hour or daily seasonal patterns (10 days). 创建警报规则后,动态阈值会使用所有所需的可用历史数据,并会根据新数据不断学习和调整,以使阈值更加准确。Once an alert rule is created, Dynamic Thresholds uses all needed historical data that is available and will continuously learn and adapt based on new data to make the thresholds more accurate. 这意味着在计算之后,图表还将显示每周模式。This means that after this calculation, the chart will also display weekly patterns.

触发警报需要多少数据?How much data is needed to trigger an alert?

如果你有新资源或缺少指标数据,则动态阈值将不会在三天或至少 30 个样本的指标数据可用之前触发警报,以确保阈值准确。If you have a new resource or missing metric data, Dynamic Thresholds won't trigger alerts before three days and at least 30 samples of metric data are available, to ensure accurate thresholds. 对于具有足够指标数据的现有资源,动态阈值可以立即触发警报。For existing resources with sufficient metric data, Dynamic Thresholds can trigger alerts immediately.

动态阈值最佳做法Dynamic Thresholds best practices

动态阈值可以应用于任何平台或 Azure Monitor 中的自定义指标,而且它已针对常用应用程序和基础结构指标进行了优化。Dynamic Thresholds can be applied to any platform or custom metric in Azure Monitor and it was also tuned for the common application and infrastructure metrics. 以下各项是有关如何使用动态阈值针对这些指标中的某一些配置警报的最佳做法。The following items are best practices on how to configure alerts on some of these metrics using Dynamic Thresholds.

针对虚拟机 CPU 百分比指标的动态阈值Dynamic Thresholds on virtual machine CPU percentage metrics

  1. Azure 门户中单击“监视”。In Azure portal, click on Monitor. “监视”视图将所有监视设置和数据合并到一个视图中。The Monitor view consolidates all your monitoring settings and data in one view.

  2. 依次单击“警报”、“+ 新建警报规则”。 Click Alerts then click + New alert rule.

    提示

    大多数资源边栏选项卡的资源菜单中的“监视”下面也包含“警报”,同样可从中创建警报。 Most resource blades also have Alerts in their resource menu under Monitoring, you could create alerts from there as well.

  3. 在加载的上下文窗格中单击“选择目标”,选择要触发警报的目标资源。Click Select target, in the context pane that loads, select a target resource that you want to alert on. 使用“订阅”和“虚拟机”资源类型下拉列表查找要监视的资源。 Use Subscription and 'Virtual Machines' Resource type drop-downs to find the resource you want to monitor. 也可以使用搜索栏查找资源。You can also use the search bar to find your resource.

  4. 选择目标资源之后,单击“添加条件”。Once you have selected a target resource, click on Add condition.

  5. 选择“CPU 百分比”。Select the 'CPU Percentage'.

  6. (可选)通过调整期间聚合来优化指标。Optionally, refine the metric by adjusting Period and Aggregation. 建议不要对此指标类型使用“最大值”聚合类型,因为它不是具有代表性的行为。It is discouraged to use 'Maximum' aggregation type for this metric type as it is less representative of behavior. 对于“最大值”聚合类型,静态阈值可能更合适。For 'Maximum' aggregation type static threshold maybe more appropriate.

  7. 随后会该指标在显示过去 6 小时的图表。You will see a chart for the metric for the last 6 hours. 定义警报参数:Define the alert parameters:

    1. 条件类型 - 选择“动态”选项。Condition Type - Choose 'Dynamic' option.
    2. 敏感度 - 选择中/低敏感度,以减少警报干扰。Sensitivity - Choose Medium/Low sensitivity to reduce alert noise.
    3. 运算符 - 选择“大于”,除非行为表示应用程序使用情况。Operator - Choose 'Greater Than' unless behavior represents the application usage.
    4. 频率 - 请考虑根据警报的业务影响进行降低。Frequency - Consider lowering based on business impact of the alert.
    5. 故障期间(高级选项)- 回退窗口应当至少为 15 分钟。Failing Periods (Advanced Option) - The look back window should be at least 15 minutes. 例如,如果该期间设置为五分钟,则故障期间应当至少为三分钟或更长时间。For example, if the period is set to five minutes, then failing periods should be at least three or more.
  8. 指标图表将显示基于最新数据计算得出的阈值。The metric chart will display the calculated thresholds based on recent data.

  9. 单击“Done”(完成) 。Click Done.

  10. 填写“警报详细信息”,例如“警报规则名称”、“说明”和“严重性”。 Fill in Alert details like Alert Rule Name, Description, and Severity.

  11. 通过选择现有操作组或创建新的操作组,将一个操作组添加到警报中。Add an action group to the alert either by selecting an existing action group or creating a new action group.

  12. 单击“完成”保存指标警报规则。Click Done to save the metric alert rule.

备注

通过门户创建的指标警报规则将在目标资源所在的同一个资源组中创建。Metric alert rules created through portal are created in the same resource group as the target resource.

针对 Application Insights HTTP 请求执行时间的动态阈值Dynamic Thresholds on Application Insights HTTP request execution time

  1. Azure 门户中单击“监视”。In Azure portal, click on Monitor. “监视”视图将所有监视设置和数据合并到一个视图中。The Monitor view consolidates all your monitoring settings and data in one view.

  2. 依次单击“警报”、“+ 新建警报规则”。 Click Alerts then click + New alert rule.

    提示

    大多数资源边栏选项卡的资源菜单中的“监视”下面也包含“警报”,同样可从中创建警报。 Most resource blades also have Alerts in their resource menu under Monitoring, you could create alerts from there as well.

  3. 在加载的上下文窗格中单击“选择目标”,选择要触发警报的目标资源。Click Select target, in the context pane that loads, select a target resource that you want to alert on. 使用“订阅”和“Application Insights”资源类型下拉列表查找要监视的资源。 Use Subscription and 'Application Insights' Resource type drop-downs to find the resource you want to monitor. 也可以使用搜索栏查找资源。You can also use the search bar to find your resource.

  4. 选择目标资源之后,单击“添加条件”。Once you have selected a target resource, click on Add condition.

  5. 选择“HTTP 请求执行时间”。Select the 'HTTP request execution time'.

  6. (可选)通过调整期间聚合来优化指标。Optionally, refine the metric by adjusting Period and Aggregation. 建议不要对此指标类型使用“最大值”聚合类型,因为它不是具有代表性的行为。It is discouraged to use 'Maximum' aggregation type for this metric type as it is less representative of behavior. 对于“最大值”聚合类型,静态阈值可能更合适。For 'Maximum' aggregation type static threshold maybe more appropriate.

  7. 随后会该指标在显示过去 6 小时的图表。You will see a chart for the metric for the last 6 hours. 定义警报参数:Define the alert parameters:

    1. 条件类型 - 选择“动态”选项。Condition Type - Choose 'Dynamic' option.
    2. 运算符 - 选择“大于”,以减少针对持续时间内的改进触发的警报。Operator - Choose 'Greater Than' to reduce alerts fired on improvement in duration.
    3. 频率 - 请考虑根据警报的业务影响进行降低。Frequency - Consider lowering based on business impact of the alert.
  8. 指标图表将显示基于最新数据计算得出的阈值。The metric chart will display the calculated thresholds based on recent data.

  9. 单击“Done”(完成) 。Click Done.

  10. 填写“警报详细信息”,例如“警报规则名称”、“说明”和“严重性”。 Fill in Alert details like Alert Rule Name, Description, and Severity.

  11. 通过选择现有操作组或创建新的操作组,将一个操作组添加到警报中。Add an action group to the alert either by selecting an existing action group or creating a new action group.

  12. 单击“完成”保存指标警报规则。Click Done to save the metric alert rule.

备注

通过门户创建的指标警报规则将在目标资源所在的同一个资源组中创建。Metric alert rules created through portal are created in the same resource group as the target resource.

解释动态阈值图表Interpreting Dynamic Threshold charts

下面是一个图表,其中显示了一个指标和它的动态阈值限制,以及当值超出允许的阈值时触发的一些警报。Following is a chart showing a metric, its dynamic threshold limits, and some alerts fired when the value was outside of the allowed thresholds.

详细了解如何配置指标警报

使用以下信息解释上一张图表。Use the following information to interpret the previous chart.

  • 蓝线 - 一段时间内实际测量的指标。Blue line - The actual measured metric over time.
  • 蓝色阴影区域 - 显示指标的允许范围。Blue shaded area - Shows the allowed range for the metric. 只要指标值停留在此范围内,就不会出现警报。As long as the metric values stay within this range, no alert will occur.
  • 蓝点 - 如果左键单击图表的一部分,然后将鼠标悬停在蓝线上,则会看到光标下出现一个蓝点,显示单个聚合指标值。Blue dots - If you left click on part of the chart and then hover over the blue line, you see a blue dot appear under your cursor showing an individual aggregated metric value.
  • 带有蓝点的弹出窗口 - 显示测量的指标值(蓝点)和允许范围的上限值和下限值。Pop-up with blue dot - Shows the measured metric value (the blue dot) and the upper and lower values of allowed range.
  • 带有黑色圆圈的红点 - 显示超出允许范围的第一个指标值。Red dot with a black circle - Shows the first metric value out of the allowed range. 这是触发指标警报并将该警报置于活动状态的值。This is the value that fires a metric alert and puts it in an active state.
  • 红点 - 指示超出允许范围的其他度量值。Red dots- Indicate additional measured values outside of the allowed range. 它们不会触发其他指标警报,但该警报会保持活动状态。They will not fire additional metric alerts, but the alert stays in the active.
  • 红色区域 - 显示指标值超出允许范围的时间。Red area - Shows the time when the metric value was outside of the allowed range. 只要随后的测量值超出允许范围,该警报就会保持活动状态,但不会触发新警报。The alert remains in the active state as long as subsequent measured values are out of the allowed range, but no new alerts are fired.
  • 红色区域结束 - 当蓝线回到允许值内时,红色区域停止,度量值线变蓝。End of red area - When the blue line is back inside the allowed values, the red area stops and the measured value line turns blue. 在带有黑色轮廓的红点设置为“已解决”时触发的指标警报的状态。The status of the metric alert fired at the time of the red dot with black outline is set to resolved.