在 Azure Monitor 中排查日志警报问题Troubleshoot log alerts in Azure Monitor

本文介绍如何解决 Azure Monitor 中日志警报的常见问题。This article shows you how to resolve common issues with log alerts in Azure Monitor. 它还提供了有关日志警报功能和配置的常见问题的解决方法。It also provides solutions to common problems with the functionality and configuration of log alerts.

术语“日志警报”描述基于 Azure Log Analytics 工作区Azure Application Insights 中的日志查询触发的规则。The term log alerts describes rules that fire based on a log query in an Azure Log Analytics workspace or in Azure Application Insights. Azure Monitor 中的日志警报中详细了解功能、术语和类型。Learn more about functionality, terminology, and types in Log alerts in Azure Monitor.

备注

本文不考虑 Azure 门户中显示警报规则已触发以及不是通过关联的操作组执行通知的情况。This article doesn't consider cases where the Azure portal shows an alert rule triggered and a notification is not performed by an associated action group. 对于此类情况,请参阅在 Azure 门户中创建和管理操作组中的详细信息。For such cases, see the details in Create and manage action groups in the Azure portal.

日志的数据引入时间Data ingestion time for logs

日志警报基于 Log AnalyticsApplication Insights 定期运行查询。A log alert periodically runs your query based on Log Analytics or Application Insights. 由于 Azure Monitor 需要处理来自数千个客户以及全球各种源的若干 TB 的数据,因此,该服务很容易发生不同的时间延迟。Because Azure Monitor processes many terabytes of data from thousands of customers from varied sources across the world, the service is susceptible to varying time delays. 有关详细信息,请参阅 Azure Monitor 日志中的数据引入时间For more information, see Data ingestion time in Azure Monitor logs.

如果系统发现所需的数据尚未引入,为了缓解延迟,它会等待一段时间,并重试警报查询多次。To mitigate delays, the system waits and retries the alert query multiple times if it finds the needed data is not yet ingested. 为系统设置的等待时间呈指数级递增。The system has an exponentially increasing wait time set. 日志警报只会在数据可用后才会触发,因此,延迟可能是日志数据引入速度缓慢造成的。The log alert is triggered only after the data is available, so the delay might be due to slow ingestion of log data.

配置了错误的时间段Incorrect time period configured

根据日志警报的术语一文中所述,配置中规定的时间段指定查询的时间范围。As described in the article on terminology for log alerts, the time period stated in the configuration specifies the time range for the query. 查询仅返回在此时间范围内创建的记录。The query returns only records that were created within this range.

时间段限制为日志查询提取的数据以防止滥用,并规避日志查询中使用的任何时间命令(例如 ago)。The time period restricts the data fetched for a log query to prevent abuse, and it circumvents any time command (like ago) used in a log query. 例如,如果时间段设置为 60 分钟,且在下午 1:15 运行查询,则在中午 12:15 和下午 1:15 之间创建的记录将用于日志查询。For example, If the time period is set to 60 minutes, and the query is run at 1:15 PM, only records created between 12:15 PM and 1:15 PM are used for the log query. 如果日志查询使用类似于 ago (1d) 的时间命令,则查询仍只使用在中午 12:15 和下午 1:15 之间的创建数据,因为时间段设置为该间隔。If the log query uses a time command like ago (1d), the query still only uses data between 12:15 PM and 1:15 PM because the time period is set to that interval.

请检查配置中的时间段是否与查询匹配。Check that the time period in the configuration matches your query. 对于前面所述的示例,如果日志查询使用 ago (1d) (如绿色标记所示),则时间段应设置为 24 小时或 1440 分钟(如红色标记所示)。For the example shown earlier, if the log query uses ago (1d) with the green marker, the time period should be set to 24 hours or 1,440 minutes (indicated in red). 此设置可确保查询按预期方式运行。This setting ensures that the query runs as intended.

时间段

设置“抑制警报”选项Suppress Alerts option is set

根据在 Azure 门户中创建日志警报规则一文中的步骤 8 所述,日志警报提供一个“抑制警报”选项,用于在配置的一段时间内抑制触发和通知操作。As described in step 8 of the article on creating a log alert rule in the Azure portal, log alerts provide a Suppress Alerts option to suppress triggering and notification actions for a configured amount of time. 因此,你可能认为某个警报未激发,As a result, you might think that an alert didn't fire. 但实际上它已激发,只不过是抑制了而已。In fact, it did fire but was suppressed.

阻止警报

指标度量警报规则不正确Metric measurement alert rule is incorrect

指标度量日志警报是日志警报的子类型,具有特殊的功能和受限的警报查询语法。Metric measurement log alerts are a subtype of log alerts that have special capabilities and a restricted alert query syntax. 指标度量日志警报的规则要求查询输出是指标时序。A rule for a metric measurement log alert requires the query output to be a metric time series. 即,输出是包含等量大小的不同时间段以及相应聚合值的表。That is, the output is a table with distinct, equally sized time periods along with corresponding aggregated values.

可以选择在该表中包含其他变量以及 AggregatedValueYou can choose to have additional variables in the table alongside AggregatedValue. 可以使用这些变量来为表排序。These variables can be used to sort the table.

例如,假设指标度量日志警报的规则已配置为:For example, suppose a rule for a metric measurement log alert was configured as:

  • 查询为 search *| summarize AggregatedValue = count() by $table, bin(timestamp, 1h)Query of search *| summarize AggregatedValue = count() by $table, bin(timestamp, 1h)
  • 时间段为 6 小时Time period of 6 hours
  • 阈值为 50Threshold of 50
  • 警报逻辑为三次连续违规Alert logic of three consecutive breaches
  • 选择 $table 作为聚合依据Aggregate Upon chosen as $table

由于命令中包含 summarize … by,并提供了两个变量(timestamp$table),系统将选择 $table 作为聚合依据Because the command includes summarize ... by and provides two variables (timestamp and $table), the system chooses $table for Aggregate Upon. 系统会按 $table 字段将结果表排序,如以下屏幕截图所示。The system sorts the result table by the $table field, as shown in the following screenshot. 然后查看每个表类型(例如 availabilityResults)的多个 AggregatedValue,以确定是否发生了三次或更多次的连续违规。Then it looks at the multiple AggregatedValue instances for each table type (like availabilityResults) to see if there were three or more consecutive breaches.

包含多个值的指标度量查询执行

由于聚合依据是基于 $table 定义的,因此数据已按 $table 列排序(如红框所示)。Because Aggregate Upon is defined on $table, the data is sorted on a $table column (indicated in red). 然后我们进行分组并查看“聚合依据”字段的类型。Then we group and look for types of the Aggregate Upon field.

例如,对于 $tableavailabilityResults 的值将视为一个绘图/实体(如橙色框所示)。For example, for $table, values for availabilityResults will be considered as one plot/entity (indicated in orange). 在此绘图/实体值中,警报服务将检查三次连续违规(如绿框所示)。In this value plot/entity, the alert service checks for three consecutive breaches (indicated in green). 违规时会对表值 availabilityResults 触发警报。The breaches trigger an alert for the table value availabilityResults.

同样,如果其他任何 $table 值发生三次连续违规,则会触发另一条警报通知。Similarly, if three consecutive breaches happen for any other value of $table, another alert notification is triggered for the same thing. 警报服务自动按时间排序一个绘图/实体中的值(如橙色框所示)The alert service automatically sorts the values in one plot/entity (indicated in orange) by time.

现在,假设指标度量日志警报的规则已修改,且查询为 search *| summarize AggregatedValue = count() by bin(timestamp, 1h)Now suppose that the rule for the metric measurement log alert was modified and the query was search *| summarize AggregatedValue = count() by bin(timestamp, 1h). 剩余的配置与前面相同,包括警报逻辑同样为三次连续违规。The rest of the configuration remained the same as before, including the alert logic for three consecutive breaches. 在这种情况下,“聚合依据”选项默认为 timestampThe Aggregate Upon option in this case is timestamp by default. 在查询中只为 summarize…by 提供了一个值(即 timestamp)。Only one value is provided in the query for summarize ... by (that is, timestamp). 与前面的示例类似,在执行结束时,输出将如下所示。Like the earlier example, the output at end of execution would be as illustrated as follows.

包含单个值的指标度量查询执行

由于聚合依据是基于 timestamp 定义的,因此数据已按 timestamp 列排序(如红框所示)。Because Aggregate Upon is defined on timestamp, the data is sorted on the timestamp column (indicated in red). 然后我们按 timestamp 进行分组。Then we group by timestamp. 例如,2018-10-17T06:00:00Z 的值将视为一个绘图/实体(如橙色框所示)。For example, values for 2018-10-17T06:00:00Z will be considered as one plot/entity (indicated in orange). 在此绘图/实体值中,警报服务找不到连续违规(因为每个 timestamp 值只包含一个条目)。In this value plot/entity, the alert service will find no consecutive breaches (because each timestamp value has only one entry). 因此永远不会触发警报。So the alert is never triggered. 在这种情况下,用户必须:In such a case, the user must either:

  • 添加一个虚拟变量或现有变量(例如 $table),以使用“聚合依据”字段正确执行排序。Add a dummy variable or an existing variable (like $table) to correctly sort by using the Aggregate Upon field.
  • 将警报规则重新配置为使用基于违规总数的警报逻辑。Reconfigure the alert rule to use alert logic based on total breach instead.

不必要地激发了日志警报Log alert fired unnecessarily

Azure 警报中查看 Azure Monitor 中配置的日志警报规则时,可能会意外触发该规则。A configured log alert rule in Azure Monitor might be triggered unexpectedly when you view it in Azure Alerts. 以下部分描述了某些常见原因。The following sections describe some common reasons.

部分数据触发了警报Alert triggered by partial data

Log Analytics 和 Application Insights 可能会发生引入和处理延迟。Log Analytics and Application Insights are subject to ingestion delays and processing. 在运行日志警报查询时,可能发现没有可用的数据,或者只有部分数据可用。When you run a log alert query, you might find that no data is available or only some data is available. 有关详细信息,请参阅 Azure Monitor 中的日志数据引入时间For more information, see Log data ingestion time in Azure Monitor.

根据警报规则的配置方式,如果在执行警报时日志中没有数据或者只有部分数据,则可能会错误地激发警报。Depending on how you configured the alert rule, misfiring might happen if there's no data or partial data in logs at the time of alert execution. 在这种情况下,我们建议你更改警报查询或配置。In such cases, we advise you to change the alert query or configuration.

例如,如果日志警报规则配置为当分析查询的结果数小于 5 时触发,则当没有任何数据(零个记录)或只有部分结果(一个记录)时,警报将会触发。For example, if you configure the log alert rule to be triggered when the number of results from an analytics query is less than 5, the alert is triggered when there's no data (zero record) or partial results (one record). 但是,在发生数据引入延迟后,具有完整数据的同一查询可能会提供包含 10 个记录的结果。But after the data ingestion delay, the same query with full data might provide a result of 10 records.

警报查询输出令人误解Alert query output is misunderstood

你在分析查询中提供日志警报的逻辑。You provide the logic for log alerts in an analytics query. 分析查询可以使用各种大数据和数学函数。The analytics query can use various big data and mathematical functions. 警报服务使用指定时间段内的数据按指定的间隔运行查询。The alert service runs your query at intervals specified with data for a specified time period. 警报服务根据警报类型对查询进行细微更改。The alert service makes subtle changes to the query based on the alert type. 可以在“配置信号逻辑”屏幕上的“要执行的查询”部分查看此更改: You can view this change in the Query to be executed section on the Configure signal logic screen:

要执行的查询

“要执行的查询”框显示日志警报服务运行的操作。The Query to be executed box is what the log alert service runs. 若要在创建警报之前了解警报查询输出的内容,可以通过 Analytics 门户Analytics API 运行指定的查询及时间跨度。If you want to understand what the alert query output might be before you create the alert, you can run the stated query and the timespan via the Analytics portal or the Analytics API.

已禁用日志警报Log alert was disabled

以下部分列出了 Azure Monitor 禁用日志警报规则的一些原因。The following sections list some reasons why Azure Monitor might disable the log alert rule.

在其中创建警报的资源不再存在Resource where the alert was created no longer exists

在 Azure Monitor 中创建的日志警报规则针对特定的资源,例如 Azure Log Analytics 工作区、Azure Application Insights 应用和 Azure 资源。Log alert rules created in Azure Monitor target a specific resource like an Azure Log Analytics workspace, an Azure Application Insights app, and an Azure resource. 日志警报服务将针对指定的目标运行规则中提供的分析查询。The log alert service will then run an analytics query provided in the rule for the specified target. 但是,在创建规则后,用户经常会在 Azure 中删除或移动日志警报规则的目标。But after rule creation, users often go on to delete from Azure--or move inside Azure--the target of the log alert rule. 由于警报规则的目标不再有效,因此规则执行也就会失败。Because the target of the alert rule is no longer valid, execution of the rule fails.

在这种情况下,Azure Monitor 会禁用日志警报,并确保在该规则持续相当长一段时间(例如一周)无法运行时,不会产生不必要的费用。In such cases, Azure Monitor disables the log alert and ensures that you're not billed unnecessarily when the rule can't run continually for sizable period (like a week). 可以通过 Azure 活动日志查看 Azure Monitor 禁用日志警报的确切时间。You can find out the exact time when Azure Monitor disabled the log alert via Azure Activity Log. 当 Azure Monitor 禁用日志警报规则时,会在 Azure 活动日志中添加一个事件。In Azure Activity Log, an event is added when Azure Monitor disables the log alert rule.

Azure 活动日志中的以下示例事件适用于因持续失败而被禁用的警报规则。The following sample event in Azure Activity Log is for an alert rule that has been disabled because of a continual failure.

{
    "caller": "Microsoft.Insights/ScheduledQueryRules",
    "channels": "Operation",
    "claims": {
        "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/spn": "Microsoft.Insights/ScheduledQueryRules"
    },
    "correlationId": "abcdefg-4d12-1234-4256-21233554aff",
    "description": "Alert: test-bad-alerts is disabled by the System due to : Alert has been failing consistently with the same exception for the past week",
    "eventDataId": "f123e07-bf45-1234-4565-123a123455b",
    "eventName": {
        "value": "",
        "localizedValue": ""
    },
    "category": {
        "value": "Administrative",
        "localizedValue": "Administrative"
    },
    "eventTimestamp": "2019-03-22T04:18:22.8569543Z",
    "id": "/SUBSCRIPTIONS/<subscriptionId>/RESOURCEGROUPS/<ResourceGroup>/PROVIDERS/MICROSOFT.INSIGHTS/SCHEDULEDQUERYRULES/TEST-BAD-ALERTS",
    "level": "Informational",
    "operationId": "",
    "operationName": {
        "value": "Microsoft.Insights/ScheduledQueryRules/disable/action",
        "localizedValue": "Microsoft.Insights/ScheduledQueryRules/disable/action"
    },
    "resourceGroupName": "<Resource Group>",
    "resourceProviderName": {
        "value": "MICROSOFT.INSIGHTS",
        "localizedValue": "Microsoft Insights"
    },
    "resourceType": {
        "value": "MICROSOFT.INSIGHTS/scheduledqueryrules",
        "localizedValue": "MICROSOFT.INSIGHTS/scheduledqueryrules"
    },
    "resourceId": "/SUBSCRIPTIONS/<subscriptionId>/RESOURCEGROUPS/<ResourceGroup>/PROVIDERS/MICROSOFT.INSIGHTS/SCHEDULEDQUERYRULES/TEST-BAD-ALERTS",
    "status": {
        "value": "Succeeded",
        "localizedValue": "Succeeded"
    },
    "subStatus": {
        "value": "",
        "localizedValue": ""
    },
    "submissionTimestamp": "2019-03-22T04:18:22.8569543Z",
    "subscriptionId": "<SubscriptionId>",
    "properties": {
        "resourceId": "/SUBSCRIPTIONS/<subscriptionId>/RESOURCEGROUPS/<ResourceGroup>/PROVIDERS/MICROSOFT.INSIGHTS/SCHEDULEDQUERYRULES/TEST-BAD-ALERTS",
        "subscriptionId": "<SubscriptionId>",
        "resourceGroup": "<ResourceGroup>",
        "eventDataId": "12e12345-12dd-1234-8e3e-12345b7a1234",
        "eventTimeStamp": "03/22/2019 04:18:22",
        "issueStartTime": "03/22/2019 04:18:22",
        "operationName": "Microsoft.Insights/ScheduledQueryRules/disable/action",
        "status": "Succeeded",
        "reason": "Alert has been failing consistently with the same exception for the past week"
    },
    "relatedEvents": []
}

日志警报中使用的查询无效Query used in a log alert is not valid

在 Azure Monitor 中创建为配置的一部分的每个日志警报规则必须指定警报服务要定期运行的分析查询。Each log alert rule created in Azure Monitor as part of its configuration must specify an analytics query that the alert service will run periodically. 在创建或更新规则时,分析查询可能使用了正确的语法。The analytics query might have correct syntax at the time of rule creation or update. 但有时,在一段时间后,日志警报规则中提供的查询可能会出现语法问题,从而导致规则执行失败。But sometimes, over a period of time, the query provided in the log alert rule can develop syntax issues and cause the rule execution to fail. 日志警报规则中提供的分析查询可能出现错误的一些常见原因包括:Some common reasons why an analytics query provided in a log alert rule can develop errors are:

Azure 顾问会警告此类行为。Azure Advisor warns you about this behavior. Azure 顾问会在“高可用性”类别下,针对特定的日志警报规则添加一条中度影响性的、说明为“请修复日志警报规则以确保执行监视”的建议”。A recommendation is added for the specific log alert rule on Azure Advisor, under the category of High Availability with medium impact and a description of "Repair your log alert rule to ensure monitoring".

备注

如果在 Azure 顾问提供建议七天后仍未纠正日志警报规则中的警报查询,则 Azure Monitor 会禁用日志警报,并确保在该规则持续相当长一段时间(7 天)无法运行时,不会产生不必要的费用。If an alert query in the log alert rule isn't rectified after Azure Advisor has provided a recommendation for seven days, Azure Monitor will disable the log alert and ensure that you're not billed unnecessarily when the rule can't run continually for a sizable period (7 days). 可以通过查看 Azure 活动日志中的事件,来了解 Azure Monitor 禁用日志警报规则的确切时间。You can find the exact time when Azure Monitor disabled the log alert rule by looking for an event in Azure Activity Log.

已达到警报规则配额Alert rule quota was reached

每个订阅和资源的日志搜索预警规则数目受此处所述的配额限制约束。The number of log search alert rules per subscription and resource are subject to the quota limits described here.

如果已达到配额限制,请执行以下步骤以帮助解决此问题。If you have reached the quota limit, the following steps may help resolve the issue.

  1. 尝试删除或禁用不再使用的日志搜索预警规则。Try deleting or disabling log search alert rules that aren’t used anymore.

  2. 如果需要增加配额限制,请继续创建支持请求,并提供以下信息:If you need the quota limit to be increased, please proceed to open a support request, and provide the following information:

    • 需要增加配额限制的订阅 IDSubscription Id(s) for which the quota limit need to be increased
    • 配额增加的原因Reason for quota increase
    • 增大配额的资源类型:“Log Analytics”和“Application Insights”,等 。Resource type for the quota increase: Log Analytics, Application Insights, etc.
    • 请求的配额限制Requested quota limit

检查新日志预警规则的当前使用情况To check the current usage of new log alert rules

通过 Azure 门户From the Azure portal

  1. 打开“警报”屏幕,然后单击“管理预警规则” Open the Alerts screen, and click Manage alert rules
  2. 使用“订阅”下拉列表控件筛选到相关订阅Filter to the relevant subscription using the Subscription dropdown control
  3. 请勿筛选到特定的资源组、资源类型或资源Make sure NOT to filter to a specific resource group, resource type or resource
  4. 在“信号类型”下拉列表控件中,选择“日志搜索”In the Signal type dropdown control, select 'Log Search'
  5. 验证“状态”下拉列表控件是否设置为“已启用”Verify that the Status dropdown control is set to ‘Enabled’
  6. 日志搜索预警规则总数将显示在规则列表上方The total number of log search alert rules will be displayed above the rules list

通过 APIFrom API

后续步骤Next steps