Azure Monitor 中的日志警报Log alerts in Azure Monitor

本文提供日志警报的详细信息,该警报是 Azure 警报中支持的警报类型之一,允许用户使用 Azure 分析平台作为警报的基础。This article provides details of Log alerts are one of the types of alerts supported within the Azure Alerts and allow users to use Azure's analytics platform as basis for alerting.

日志警报包含为 Azure Monitor 创建的日志查询规则。Log Alert consists of log query rules created for Azure Monitor 若要详细了解其用法,请参阅在 Azure 中创建日志警报To learn more about its usage, see creating log alerts in Azure

备注

Azure Monitor 中的常见日志数据现在也可以在 Azure Monitor 中的指标平台上查看。Popular log data from Azure Monitor is now also available on the metric platform in Azure Monitor. 有关详细信息,请查看日志的指标警报For details view, Metric Alert for Logs

日志搜索警报规则 - 定义和类型Log search alert rule - definition and types

日志搜索规则由 Azure 警报创建,用于定期自动运行指定的日志查询。Log search rules are created by Azure Alerts to automatically run specified log queries at regular intervals. 如果日志查询的结果符合特定条件,则会创建警报记录。If the results of the log query match particular criteria, then an alert record is created. 然后,该规则可以使用操作组自动运行一个或多个操作。The rule can then automatically run one or more actions using Action Groups. 可能需要具有用于创建、修改和更新日志警报的 Azure 监视参与者角色,以及在警报规则或警报查询中分析目标的访问和查询执行权限。Azure Monitoring Contributor role for creating, modifying, and updating log alerts may be required; along with access & query execution rights for the analytics target(s) in alert rule or alert query. 如果执行创建操作的用户无法访问警报规则或警报查询中的所有分析目标,则警报创建操作可能会失败,或者日志警报规则在执行时只会生成部分结果。If the user creating doesn't have access to all analytics target(s) in alert rule or alert query - the rule creation may fail or the log alert rule will be executed with partial results.

日志搜索规则由以下详细信息定义:Log search rules are defined by the following details:

  • 日志查询Log Query. 这是每次触发预警规则时都会运行的查询。The query that runs every time the alert rule fires. 此查询返回的记录用于确定是否将触发某个警报。The records returned by this query are used to determine whether an alert is to be triggered. 某些分析命令和组合不适合在日志警报中使用;有关更详细的视图,请参阅 Azure Monitor 中的日志警报查询Specific analytic commands and combinations are incompatible with use in log alerts; for more details view, Log alert queries in Azure Monitor.

  • 时间段Time Period. 指定查询的时间范围。Specifies the time range for the query. 查询仅返回在当前时间的这个范围内创建的记录。The query returns only records that were created within this range of the current time. 时间段限制为日志查询提取的数据以防止滥用,并规避日志查询中使用的任何时间命令(如 ago)。Time period restricts the data fetched for log query to prevent abuse and circumvents any time command (like ago) used in log query.
    例如,如果时间段设置为 60 分钟,且在下午 1:15 运行查询,则执行日志查询时仅返回中午 12:15 和下午 1:15 之间创建的记录。现在,如果日志查询使用时间命令(如 ago (7d)),则日志查询将仅针对中午 12:15 和下午 1:15 之间的数据运行 - 就像仅存在过去 60 分钟的数据一样。而不是按在日志查询中所指定针对七天的数据。For example, If the time period is set to 60 minutes, and the query is run at 1:15 PM, only records created between 12:15 PM and 1:15 PM is returned to execute log query. Now if the log query uses time command like ago (7d), the log query would be run only for data between 12:15 PM and 1:15 PM - as if data exists for only the past 60 minutes. And not for seven days of data as specified in log query.

  • 频率Frequency. 指定应运行查询的频率。Specifies how often the query should be run. 可以是介于 5 分钟到 24 小时之间的任何值。Can be any value between 5 minutes and 24 hours. 应等于或小于时间段。Should be equal to or less than the time period. 如果该值大于时间段,则会有记录缺失的风险。If the value is greater than the time period, then you risk records being missed.
    例如,假设时间段为 30 分钟,频率为 60 分钟。如果查询在下午 1:00 运行,则会返回中午 12:30 和下午 1:00 之间的记录。下次运行查询的时间是下午 2:00,会返回下午 1:30 到 2:00 之间的记录。在下午 1:00 和 1:30 之间创建的任何记录不会获得评估。For example, consider a time period of 30 minutes and a frequency of 60 minutes. If the query is run at 1:00, it returns records between 12:30 and 1:00 PM. The next time the query would run is 2:00 when it would return records between 1:30 and 2:00. Any records created between 1:00 and 1:30 would never be evaluated.

  • 阈值Threshold. 对日志搜索的结果进行评估,确定是否应创建警报。The results of the log search are evaluated to determine whether an alert should be created. 不同类型的日志搜索警报规则的阈值不同。The threshold is different for the different types of log search alert rules.

针对 Azure Monitor 的日志查询规则可以分为两种类型。Log query rules be it for Azure Monitor, can be of two types. 这些类型中的每一种都在随后的相应部分进行了详细介绍。Each of these types is described in detail in the sections that follow.

  • 结果数Number of results. 当日志搜索返回的记录数超出指定数目时,将创建单个警报。Single alert created when the number records returned by the log search exceed a specified number.
  • 指标度量值Metric measurement. 为日志搜索结果中其值超出指定阈值的每个对象创建警报。Alert created for each object in the results of the log search with values that exceed specified threshold.

警报规则类型之间的差异如下所示。The differences between alert rule types are as follows.

  • “结果数”警报规则始终创建单个警报,而“指标度量”预警规则将为超出阈值的每个对象创建一个警报。 Number of results alert rules always creates a single alert, while Metric measurement alert rule creates an alert for each object that exceeds the threshold.
  • “结果数”预警规则会在超出阈值一次时创建一个警报。 Number of results alert rules create an alert when the threshold is exceeded a single time. 当阈值在特定的时间间隔内超出特定的次数时,“指标度量”警报规则即可创建一个警报。 Metric measurement alert rules can create an alert when the threshold is exceeded a certain number of times over a particular time interval.

“结果数”警报规则Number of results alert rules

当搜索查询返回的记录数超出指定的阈值时,“结果数”警报规则将创建一个警报。 Number of results alert rules create a single alert when the number of records returned by the search query exceed the specified threshold. 此类警报规则适用于处理 Windows 事件日志、Syslog、WebApp Response 和自定义日志等事件。This type of alert rule is ideal for working with events such as Windows event logs, Syslog, WebApp Response, and Custom logs. 生成特定错误事件时,或在特定时间段内生成多个错误事件时,就可能需要创建警报。You may want to create an alert when a particular error event gets created, or when multiple error events are created within a particular time period.

阈值:“结果数”警报规则的阈值要么超出某个特定值,要么低于该值。Threshold: The threshold for a Number of results alert rules is greater than or less than a particular value. 如果日志搜索返回的记录数与此条件匹配,则创建警报。If the number of records returned by the log search match this criteria, then an alert is created.

若要针对单个事件发出警报,请将结果数设置为大于 0 并检查自上次运行查询起创建的某事件的出现次数。To alert on a single event, set the number of results to greater than 0 and check for the occurrence of a single event that was created since the last time the query was run. 某些应用程序可能会记录不一定引发警报的偶然错误。Some applications may log an occasional error that shouldn't necessarily raise an alert. 例如,应用程序可能会重试导致错误事件的进程,而下一次就会成功。For example, the application may retry the process that created the error event and then succeed the next time. 在这种情况下,可能不想创建警报,除非在特定时间段内生成多个事件。In this case, you may not want to create the alert unless multiple events are created within a particular time period.

某些情况下,可能需要在没有事件的情况下创建警报。In some cases, you may want to create an alert in the absence of an event. 例如,进程可能记录常规事件以指明其运行正常。For example, a process may log regular events to indicate that it's working properly. 如果它不在特定时间段内记录某个事件,则应创建警报。If it doesn't log one of these events within a particular time period, then an alert should be created. 在这种情况下,应将阈值设置为小于 1In this case, you would set the threshold to less than 1.

记录类型日志警报数目的示例Example of Number of Records type log alert

假设你希望知道你的基于 web 的应用何时向用户返回代码为 500 的响应,即内部服务器错误。Consider a scenario where you want to know when your web-based App gives a response to users with code 500 (that is) Internal Server Error. 可以创建一个警报规则,详情如下:You would create an alert rule with the following details:

  • 查询: requests | where resultCode == "500"Query: requests | where resultCode == "500"
  • 时间段: 30 分钟Time period: 30 minutes
  • 警报频率: 五分钟Alert frequency: five minutes
  • 阈值: 大于 0Threshold value: Greater than 0

然后,警报将每隔 5 分钟运行一次查询,在 30 分钟的数据中查找结果代码为 500 的任何记录。Then alert would run the query every 5 minutes, with 30 minutes of data - to look for any record where result code was 500. 即使找到一条这样的记录,它也会引发警报并触发所配置的操作。If even one such record is found, it fires the alert and triggers the action configured.

指标度量警报规则Metric measurement alert rules

“指标度量”警报规则为查询中其值超出指定阈值和指定触发条件的每个对象创建一个警报。 Metric measurement alert rules create an alert for each object in a query with a value that exceeds a specified threshold and specified trigger condition. 与“结果数”警报规则不同,当分析结果提供了时序时,“指标度量”警报规则将会运行。 Unlike Number of results alert rules, Metric measurement alert rules work when analytics result provides a time series. 这些规则具有下述不同于“结果数”警报规则的差异。 They have the following distinct differences from Number of results alert rules.

  • 聚合函数:确定要执行的计算以及可能要聚合的数字字段。Aggregate function: Determines the calculation that is performed and potentially a numeric field to aggregate. 例如,count() 返回查询中的记录数,avg(CounterValue) 返回 CounterValue 字段在特定时间间隔内的平均值。For example, count() returns the number of records in the query, avg(CounterValue) returns the average of the CounterValue field over the interval. 查询中的聚合函数必须名为:AggregatedValue 并提供数值。Aggregate function in query must be named/called: AggregatedValue and provide a numeric value.

  • 分组字段:将为此字段的每个实例创建包含聚合值的记录,并可为每个实例生成警报。Group Field: A record with an aggregated value is created for each instance of this field, and an alert can be generated for each. 例如,如果需要为每台计算机生成一个警报,则可使用“按计算机”。 For example, if you wanted to generate an alert for each computer, you would use by Computer. 如果在警报查询中指定了多个分组字段,则用户可以使用聚合依据 (metricColumn) 参数指定要使用哪个字段对结果进行排序。In case there are multiple group field specified in alert query, user can specify which field to be used to sort results by using the Aggregate On (metricColumn) parameter

    备注

    聚合依据 (metricColumn) 选项仅适用于 Application Insights 的指标度量类型日志警报。Aggregate On (metricColumn) option is available for Metric Measurement type log alerts for Application Insights only.

  • 时间间隔:定义一个时间间隔,在该间隔内对数据进行聚合。Interval: Defines the time interval over which the data is aggregated. 例如,如果指定“五分钟”,则会在为警报指定的时间段内,为分组字段(按 5 分钟间隔进行聚合)的每个实例创建一个记录。 For example, if you specified five minutes, a record would be created for each instance of the group field aggregated at 5-minute intervals over the time period specified for the alert.

    备注

    必须在查询中使用 Bin 函数来指定间隔。Bin function must be used in query to specify interval. 由于 Bin() 可能生成不相等的时间间隔,警报会在运行时使用相应的时间自动将 bin 命令转换为 bin_at 命令,以确保结果包含固定点。As bin() can result in unequal time intervals - Alert will automatically convert bin command to bin_at command with appropriate time at runtime, to ensure results with a fixed point. 日志警报的指标度量类型设计为用于最多具有三个 bin() 命令实例的查询Metric measurement type of log alert is designed to work with queries having up to three instances of bin() command

  • 阈值:“指标度量”警报规则的阈值通过一个聚合值和一个违规次数来定义。Threshold: The threshold for Metric measurement alert rules is defined by an aggregate value and a number of breaches. 如果日志搜索中的某数据点超出该值,则被视为违规。If any data point in the log search exceeds this value, it's considered a breach. 如果结果中某对象的违规次数超出指定值,则会针对该对象创建警报。If the number of breaches in for any object in the results exceeds the specified value, then an alert is created for that object.

聚合依据metricColumn 选项配置错误可能会导致警报规则误触发。Misconfiguration of the Aggregate On or metricColumn option can cause alert rules to misfire. 有关详细信息,请参阅当指标度量警报规则不正确时进行故障排除For more information, see troubleshooting when metric measurement alert rule is incorrect.

指标度量类型日志警报的示例Example of Metric Measurement type log alert

考虑一下这样一种情形:如果任何计算机的处理器利用率在 30 分钟内超出 90% 三次,则需发出警报。Consider a scenario where you wanted an alert if any computer exceeded processor utilization of 90% three times over 30 minutes. 可以创建一个警报规则,详情如下:You would create an alert rule with the following details:

  • 查询: Perf | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize AggregatedValue = avg(CounterValue) by bin(TimeGenerated, 5m), ComputerQuery: Perf | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize AggregatedValue = avg(CounterValue) by bin(TimeGenerated, 5m), Computer
  • 时间段: 30 分钟Time period: 30 minutes
  • 警报频率: 五分钟Alert frequency: five minutes
  • 警报逻辑 - 条件和阈值: 大于 90Alert Logic - Condition & Threshold: Greater than 90
  • 组字段(聚合): ComputerGroup Field (Aggregate-on): Computer
  • 触发警报的条件: 总违规次数大于 2 次Trigger alert based on: Total breaches Greater than 2

查询将按 5 分钟的时间间隔为每台计算机创建一个平均值。The query would create an average value for each computer at 5-minute intervals. 对于在前 30 分钟 内收集的数据,此查询将每隔 5 分钟运行一次。This query would be run every 5 minutes for data collected over the previous 30 minutes. 由于所选“组字段(聚合)”为纵栏式“计算机”,因此针对“计算机”的各种值对 AggregatedValue 进行了拆分,而每个计算机的平均处理器利用率在 5 分钟的时间段内是确定的。Since the Group Field (Aggregate-on) chosen is columnar 'Computer' - the AggregatedValue is split for various values of 'Computer' and average processor utilization for each computer is determined for a time bin of 5 minutes. 例如,三台计算机的查询结果示例将如下所示。Sample query result for (say) three computers, would be as below.

TimeGenerated [UTC]TimeGenerated [UTC] ComputerComputer AggregatedValueAggregatedValue
20xx-xx-xxT01:00:00Z20xx-xx-xxT01:00:00Z srv01.contoso.comsrv01.contoso.com 7272
20xx-xx-xxT01:00:00Z20xx-xx-xxT01:00:00Z srv02.contoso.comsrv02.contoso.com 9191
20xx-xx-xxT01:00:00Z20xx-xx-xxT01:00:00Z srv03.contoso.comsrv03.contoso.com 8383
...... ...... ......
20xx-xx-xxT01:30:00Z20xx-xx-xxT01:30:00Z srv01.contoso.comsrv01.contoso.com 8888
20xx-xx-xxT01:30:00Z20xx-xx-xxT01:30:00Z srv02.contoso.comsrv02.contoso.com 8484
20xx-xx-xxT01:30:00Z20xx-xx-xxT01:30:00Z srv03.contoso.comsrv03.contoso.com 9292

如果绘制查询结果,该结果将如下所示。If query result was to be plotted, it would appear as.

示例查询结果

在此示例中,我们看到的是三台计算机中的每台计算机在 5 分钟的时间范围内计算出来的平均处理器利用率。In this example, we see in bins of 5 mins for each of the three computers - average processor utilization as computed for 5 mins. srv01 只有一次(在 1:25 处)超出了阈值 90。Threshold of 90 being breached by srv01 only once at 1:25 bin. 如果进行比较,则会发现 srv02 在 1:10、1:15 和 1:25 处超出了阈值 90,而 srv03 则在 1:10、1:15、1:20 和 1:30 处超出了阈值 90。In comparison, srv02 exceeds 90 threshold at 1:10, 1:15 and 1:25 bins; while srv03 exceeds 90 threshold at 1:10, 1:15, 1:20 and 1:30. 由于已将警报配置为超出阈值两次以上才触发,因此我们看到只有 srv02 和 srv03 符合此标准。Since alert is configured to trigger based on total breaches are more than two, we see that srv02 and srv03 only meet the criteria. 因此,会为 srv02 和 srv03 创建单独的警报,因为它们在多个时间段内超出了 90% 这个阈值两次。如果为“连续超出阈值”选项配置了“触发警报的标准:”参数, ,则只会为 srv03 触发警报,因为在从 1:10 到 1:20 这个时间范围内,只有它连续三个时间段超出阈值。Hence separate alerts would be created for srv02 and srv03 since they breached the 90% threshold twice across multiple time bins. If the Trigger alert based on: parameter was instead configured for Continuous breaches option, then an alert would be fired only for srv03 since it breached the threshold for three consecutive time bins from 1:10 to 1:20. 不会为 srv02 触发警报,因为它只在从 1:10 到 1:15 这个时间范围内连续两个时间段超出阈值。 And not for srv02, as it breached the threshold for two consecutive time bins from 1:10 to 1:15.

日志搜索警报规则 - 触发和状态Log search alert rule - firing and state

日志搜索警报规则仅基于内置到查询中的逻辑。Log search alert rules work only on the logic you build into the query. 警报系统没有查询所隐含的系统状态、意图或根本原因的任何其他上下文。The alert system doesn't have any other context of the state of the system, your intent, or the root cause implied by the query. 因此,日志警报被称为无状态警报。As such, log alerts are referred to as state-less. 条件在每次运行时的计算结果为“TRUE”或“FALSE”。The conditions are evaluated as "TRUE" or "FALSE" each time they are run. 警报条件的计算结果一旦为“TRUE”就会触发警报,不管该警报以前是否已触发。An alert will fire each time the evaluation of the alert condition is "TRUE", regardless of it is fired previously.

让我们通过一个实际示例来实际了解该行为。Let's see this behavior in action with a practical example. 假设我们有一条名为 Contoso-Log-Alert 的日志警报规则,该规则的配置如针对“结果数”类型的日志警报提供的示例中所示。Assume we have a log alert rule called Contoso-Log-Alert, which is configured as shown in the example provided for Number of Results type log alert. 条件是一个自定义警报查询,旨在查找日志中的 500 结果代码。The condition is a custom alert query designed to look for 500 result code in logs. 如果在日志中找到一个或多个 500 结果代码,则警报的条件为 true。If one more more 500 result codes are found in logs, the condition of the alert is true.

Azure 警报系统按下面的每个时间间隔评估 Contoso-Log-Alert 的条件。At each interval below, the Azure alerts system evaluates the condition for the Contoso-Log-Alert.

时间Time 日志搜索查询返回的记录数Num of records returned by log search query 日志条件评估Log condition evalution 结果Result
下午 1:051:05 PM 0 个记录0 records 0 不 > 0,因此为 FALSE0 is not > 0 so FALSE 警报不触发。Alert does not fire. 没有调用任何操作。No actions called.
下午 1:101:10 PM 2 个记录2 records 2 > 0,因此为 TRUE2 > 0 so TRUE 警报触发,操作组被调用。Alert fires and action groups called. 警报状态为 ACTIVE。Alert state ACTIVE.
下午 1:151:15 PM 5 个记录5 records 5 > 0,因此为 TRUE5 > 0 so TRUE 警报触发,操作组被调用。Alert fires and action groups called. 警报状态为 ACTIVE。Alert state ACTIVE.
下午 1:201:20 PM 0 个记录0 records 0 不 > 0,因此为 FALSE0 is not > 0 so FALSE 警报不触发。Alert does not fire. 没有调用任何操作。No actions called. 警报状态仍然为 ACTIVE。Alert state left ACTIVE.

使用上面的情况作为示例:Using the previous case as an example:

在下午 1:15,Azure 警报无法确定在 1:10 出现的根本问题是否仍然存在,以及记录是存在新的故障,还是下午 1:10 出现的旧故障的重复。At 1:15 PM Azure alerts can't determine if the underlying issues seen at 1:10 persist and if the records are net new failures or repeats of older failures at 1:10PM. 用户提供的查询可能考虑到了(也可能未考虑到)以前的记录,但系统不知道。The query provided by user may or may not be taking into account earlier records and the system doesn't know. Azure 警报系统必须谨慎,因此在下午 1:15 再次触发警报和关联的操作。The Azure alerts system is built to err on the side of caution, and fires the alert and associated actions again at 1:15 PM.

在下午 1:20,当出现 0 条包含 500 结果代码的记录时,Azure 警报无法确定在下午 1:10 和 1:15 出现 500 结果代码的原因现在是否已得到解决。At 1:20 PM when zero records are seen with 500 result code, Azure alerts can't be certain that the cause of 500 result code seen at 1:10 PM and 1:15 PM is now solved. 它不知道 500 错误问题是否会再次出于相同的原因而发生。It doesn't know if the 500 error issues will happen for the same reasons again. 因此 Contoso-Log-Alert 的状态不会在 Azure 警报仪表板中更改为“已解决”,并且/或者不会发出表明警报已解决的通知。 Hence Contoso-Log-Alert does not change to Resolved in Azure Alert dashboard and/or notifications are not sent out stating the alert is resolved.

日志警报的定价和计费Pricing and Billing of Log Alerts

适用于日志警报的定价在 Azure Monitor 定价页中有说明。Pricing applicable for Log Alerts is stated at the Azure Monitor Pricing page. 在 Azure 帐单中,日志警报表示为 microsoft.insights/scheduledqueryrules,并且:In Azure bills, Log Alerts are represented as type microsoft.insights/scheduledqueryrules with:

  • Application Insights 上的日志警报显示确切的警报名称以及资源组和警报属性Log Alerts on Application Insights shown with exact alert name along with resource group and alert properties
  • 如果是使用 scheduledQueryRules API 创建的,则 Log Analytics 上的日志警报显示确切的警报名称以及资源组和警报属性Log Alerts on Log Analytics shown with exact alert name along with resource group and alert properties; when created using scheduledQueryRules API

旧 Log Analytics API 将警报操作和计划作为 Log Analytics 保存的搜索的一部分,而不是相应 Azure 资源的一部分。The legacy Log Analytics API has alert actions and schedules as part of Log Analytics Saved Search and not proper Azure Resources.microsoft.insights/scheduledqueryrules 上创建的用于计费的隐藏伪警报规则将随资源组和警报属性一起显示,格式为 <WorkspaceName>|<savedSearchId>|<scheduleId>|<ActionId>The hidden pseudo alert rules created for billing on microsoft.insights/scheduledqueryrules as shown as <WorkspaceName>|<savedSearchId>|<scheduleId>|<ActionId> along with resource group and alert properties.

备注

如果存在无效字符(例如 <, >, %, &, \, ?, /),则它们在隐藏的伪警报规则名称以及 Azure 帐单中会被替换为 _If invalid characters such as <, >, %, &, \, ?, / are present, they will be replaced with _ in the hidden pseudo alert rule name and hence also in the Azure bill.

若要删除使用旧 Log Analytics API 为警报规则的计费创建的隐藏 scheduleQueryRules 资源,用户可以执行以下任一操作:To remove the hidden scheduleQueryRules resources created for billing of alert rules using legacy Log Analytics API, user can do any of the following:

此外,对于使用旧版 Log Analytics API 为警报规则计费创建的隐藏 scheduleQueryRules 资源,任何修改操作(例如 PUT)将会失败。Additionally for the hidden scheduleQueryRules resources created for billing of alert rules using legacy Log Analytics API, any modification operation like PUT will fail. 作为 microsoft.insights/scheduledqueryrules 类型,伪规则可以满足使用旧版 Log Analytics API 创建的警报规则的计费目的。As the microsoft.insights/scheduledqueryrules type pseudo rules are for purpose of billing the alert rules created using legacy Log Analytics API.

后续步骤Next steps