在 Azure Monitor 中排查日志警报问题Troubleshoot log alerts in Azure Monitor

本文介绍如何解决 Azure Monitor 中日志警报的常见问题。This article shows you how to resolve common issues with log alerts in Azure Monitor. 它还提供了有关日志警报功能和配置的常见问题的解决方法。It also provides solutions to common problems with the functionality and configuration of log alerts.

通过日志警报,用户可以使用 Log Analytics 查询按每个设置的频率评估资源日志,并根据结果触发警报。Log alerts allow users to use a Log Analytics query to evaluate resources logs every set frequency, and fire an alert based on the results. 规则可以使用操作组触发一个或多个操作。Rules can trigger one or more actions using Action Groups. 详细了解日志警报的功能和术语Learn more about functionality and terminology of log alerts.

备注

本文不考虑 Azure 门户中显示警报规则已触发以及不是通过关联的操作组执行通知的情况。This article doesn't consider cases where the Azure portal shows an alert rule triggered and a notification is not performed by an associated action group. 对于这类情况,请参阅此处,了解故障排除的详细信息。For such cases, see the details on troubleshooting here.

日志警报未激发Log alert didn't fire

日志的数据引入时间Data ingestion time for logs

Azure Monitor 处理来自世界各地的数 TB 的客户日志,这可能导致日志引入延迟Azure Monitor processes terabytes of customers' logs from across the world, which can cause logs ingestion latency.

日志是半结构化数据,本质上比指标的延迟更大。Logs are semi-structured data and inherently more latent than metrics. 如果触发警报的延迟超过 4 分钟,则应考虑使用指标警报If you're experiencing more than 4-minutes delay in fired alerts, you should consider using metric alerts. 你可以使用日志的指标警报将数据从日志发送到指标存储。You can send data to the metric store from logs using metric alerts for logs.

系统会多次重试警报评估以减少延迟。The system retries the alert evaluation multiple times to mitigate latency. 数据到达后,警报就会触发,这在大多数情况下并不等于日志记录时间。Once the data arrives, the alert fires, which in most cases don't equal the log record time.

配置的查询时间范围不正确Incorrect query time range configured

查询时间范围是在规则条件定义中设置的。Query time range is set in the rule condition definition. 此字段在工作区和 Application Insights 中称为“期间”,在所有其他资源类型中则称为“替代查询时间范围” 。This field is called Period for workspaces and Application Insights, and called Override query time range for all other resource types. 与日志分析一样,时间范围将查询数据限制在指定的时间段内。Like in log analytics, the time range limits query data to the specified period. 即使在查询中使用了 ago 命令,时间范围也将适用。Even if ago command is used in the query, the time range will apply.

例如,即使文本包含 ago(1d),查询也会扫描 60 分钟(当时间范围为 60 分钟时)。For example, a query scans 60 minutes, when time range is 60 minutes, even if the text contains ago(1d). 时间范围和查询时间筛选需要匹配。The time range and query time filtering need to match. 在这个例子中,将“期间” / “替代查询时间范围”更改为一天,将可以正常工作 。In the example case, changing the Period / Override query time range to one day, would work as expected.

时间段

预警规则中的操作处于静音状态Actions are muted in the alert rule

日志警报提供了一个选项,可以在设定的时间内禁用已触发的警报操作。Log alerts provide an option to mute fired alert actions for a set amount of time. 在工作区和 Application Insights 中,此字段称为“阻止警报”。This field is called Suppress alerts in workspaces and Application Insights. 在所有其他资源类型中,它称为“静音操作”。In all other resource types, it's called Mute actions.

一个常见问题是,你认为警报是由于服务问题而没有触发操作。A common issue is that you think that the alert didn't fire the actions because of a service issue. 甚至还被规则配置静音了。Even tough it was muted by the rule configuration.

阻止警报

使用旧版 Log Analytics API 进行拆分的指标度量预警规则Metric measurement alert rule with splitting using the legacy Log Analytics API

指标度量是一种基于汇总时序结果的日志警报类型。Metric measurement is a type of log alert that is based on summarized time series results. 这些规则允许按列分组拆分警报These rules allow grouping by columns to split alerts. 如果你使用的是旧版 Log Analytics API,则拆分将无法按预期工作。If you're using the legacy Log Analytics API, splitting won't work as expected. 不支持在旧版 API 中选择分组。Choosing the grouping in the legacy API isn't supported.

当前的 ScheduledQueryRules API 允许你在指标度量规则中设置“聚合依据”,这将按预期工作。The current ScheduledQueryRules API allows you to set Aggregate On in Metric measurement rules, which will work as expected.

不必要地激发了日志警报Log alert fired unnecessarily

Azure Monitor 中配置的日志预警规则可能会意外触发该规则。A configured log alert rule in Azure Monitor might be triggered unexpectedly. 以下部分描述了某些常见原因。The following sections describe some common reasons.

部分数据触发了警报Alert triggered by partial data

Azure Monitor 处理来自世界各地的数 TB 的客户日志,这可能导致日志引入延迟Azure Monitor processes terabytes of customers' logs from across the world, which can cause logs ingestion latency.

日志是半结构化数据,本质上比指标的延迟更大。Logs are semi-structured data and inherently more latent than metrics. 如果触发警报出现很多误触发,则应考虑使用指标警报If you're experiencing many misfires in fired alerts, you should consider using metric alerts. 你可以使用日志的指标警报将数据从日志发送到指标存储。You can send data to the metric store from logs using metric alerts for logs.

当你尝试检测日志中的数据时,日志警报效果最佳。Log alerts work best when you try to detect data in the logs. 当你试图检测日志中缺少数据时,它的效果就不太好。It works less well when you try to detect lack of data in the logs. 例如,对虚拟机检测信号发出警报。For example, alerting on virtual machine heartbeat.

尽管有内置的功能可以防止错误警报,但它们仍然可以在高延迟数据(约 30 分钟以上)和延迟峰值数据上发生。While there are builtin capabilities to prevent false alerts, they can still occur on very latent data (over ~30 minutes) and data with latency spikes.

查询优化问题Query optimization issues

警报服务将优化你的查询,以降低负载和警报延迟。The alerting service changes your query to optimize for lower load and alert latency. 生成警报流是为了将指示问题的结果转换为警报。The alert flow was built to transform the results that indicate the issue to an alert. 例如,在如下查询中:For example, in a case of a query like:

SecurityEvent
| where EventID == 4624

如果用户的意图是发出警报,那么当这个事件类型发生时,警报逻辑会将 count 追加到查询中。If the intent of the user is to alert, when this event type happens, the alerting logic appends count to the query. 要运行的查询将如下所示:The query that will run will be:

SecurityEvent
| where EventID == 4624
| count

无需向查询添加警报逻辑,这样做甚至可能导致问题。There's no need to add alerting logic to the query and doing that may even cause issues. 在上面的示例中,如果你在查询中包括 count,它将始终得到值 1,因为警报服务将执行 countcountIn the above example, if you include count in your query, it will always result in the value 1, since the alert service will do count of count.

优化的查询是日志警报服务运行的内容。The optimized query is what the log alert service runs. 你可以在 Log Analytics 门户API 中运行修改后的查询。You can run the modified query in Log Analytics portal or API.

在工作区和 Application Insights 中,它在条件窗格中称为“要执行的查询”。For workspaces and Application Insights, it's called Query to be executed in the condition pane. 在所有其他资源类型中,在条件选项卡中选择“查看最终警报查询”。In all other resource types, select See final alert Query in the condition tab.

要执行的查询

已禁用日志警报Log alert was disabled

以下部分列出了 Azure Monitor 禁用日志预警规则的一些原因。The following sections list some reasons why Azure Monitor might disable a log alert rule. 我们还提供了禁用规则时发送的活动日志示例We also included an example of the activity log that is sent when a rule is disabled.

警报范围不再存在或已被移动Alert scope no longer exists or was moved

当预警规则的范围资源不再有效时,该规则的执行将失败。When the scope resources of an alert rule are no longer valid, execution of the rule fails. 在这种情况下,还会停止计费。In this case, billing stops as well.

如果连续一周失败,Azure Monitor 将在一周后禁用日志警报。Azure Monitor will disable the log alert after a week if it fails continuously.

日志警报中使用的查询无效Query used in a log alert isn't valid

创建日志预警规则后,将验证查询的语法是否正确。When a log alert rule is created, the query is validated for correct syntax. 但有时,日志预警规则中提供的查询可能会开始失败。But sometimes, the query provided in the log alert rule can start to fail. 下面是一些常见原因:Some common reasons are:

  • 规则是通过 API 创建的,并且用户跳过了验证。Rules were created via the API and validation was skipped by the user.
  • 查询在多个资源上运行,并且已删除或移动一个或多个资源。The query runs on multiple resources and one or more of the resources was deleted or moved.
  • 查询失败,因为:The query fails because:
  • 查询语言的更改包含命令和函数的已修改格式。Changes in query language include a revised format for commands and functions. 因此,以前提供的查询不再有效。So the query provided earlier is no longer valid.

Azure 顾问会警告此类行为。Azure Advisor warns you about this behavior. 它添加了关于受影响的日志预警规则的建议。It adds a recommendation about the log alert rule affected. 使用的类别是“高可用性”,影响为“中等”,描述内容为“修复日志预警规则以确保监视”。The category used is 'High Availability' with medium impact and a description of 'Repair your log alert rule to ensure monitoring'.

已达到警报规则配额Alert rule quota was reached

每个订阅和资源的日志搜索预警规则数目受此处所述的配额限制约束。The number of log search alert rules per subscription and resource are subject to the quota limits described here.

如果已达到配额限制,请执行以下步骤以帮助解决此问题。If you've reached the quota limit, the following steps may help resolve the issue.

  1. 尝试删除或禁用不再使用的日志搜索预警规则。Try deleting or disabling log search alert rules that aren’t used anymore.

  2. 尝试使用按维度拆分警报以减少规则计数。Try to use splitting of alerts by dimensions to reduce rules count. 这些规则可以监视许多资源和检测案例。These rules can monitor many resources and detection cases.

  3. 如果需要提高配额限制,请继续创建支持请求,并提供以下信息:If you need the quota limit to be increased, continue to open a support request, and provide the following information:

    • 需要提高配额限制的订阅 ID 和资源 ID。Subscription IDs and Resource IDs for which the quota limit needs to be increased.
    • 配额增加的原因。Reason for quota increase.
    • 增大配额的资源类型:Log Analytics 和 Application Insights 等 。Resource type for the quota increase: Log Analytics, Application Insights, and so on.
    • 请求的配额限制。Requested quota limit.

检查新日志预警规则的当前使用情况To check the current usage of new log alert rules

通过 Azure 门户From the Azure portal

  1. 打开“警报”屏幕,然后选择“管理预警规则” Open the Alerts screen, and select Manage alert rules
  2. 使用“订阅”下拉列表控件筛选到相关订阅Filter to the relevant subscription using the Subscription dropdown control
  3. 请勿筛选到特定的资源组、资源类型或资源Make sure NOT to filter to a specific resource group, resource type, or resource
  4. 在“信号类型”下拉列表控件中,选择“日志搜索”In the Signal type dropdown control, select 'Log Search'
  5. 验证“状态”下拉列表控件是否设置为“已启用”Verify that the Status dropdown control is set to ‘Enabled’
  6. 日志搜索预警规则总数将显示在规则列表上方The total number of log search alert rules will be displayed above the rules list

通过 APIFrom API

禁用规则时的活动日志示例Activity log example when rule is disabled

如果查询连续七天失败,Azure Monitor 将禁用日志警报并停止规则的计费。If query fails for seven days continuously, Azure Monitor will disable the log alert and stop billing of the rule. 可以通过 Azure 活动日志查看 Azure Monitor 禁用日志警报的确切时间。You can find out the exact time when Azure Monitor disabled the log alert in the Azure Activity Log. 请参阅此示例:See this example:

{
    "caller": "Microsoft.Insights/ScheduledQueryRules",
    "channels": "Operation",
    "claims": {
        "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/spn": "Microsoft.Insights/ScheduledQueryRules"
    },
    "correlationId": "abcdefg-4d12-1234-4256-21233554aff",
    "description": "Alert: test-bad-alerts is disabled by the System due to : Alert has been failing consistently with the same exception for the past week",
    "eventDataId": "f123e07-bf45-1234-4565-123a123455b",
    "eventName": {
        "value": "",
        "localizedValue": ""
    },
    "category": {
        "value": "Administrative",
        "localizedValue": "Administrative"
    },
    "eventTimestamp": "2019-03-22T04:18:22.8569543Z",
    "id": "/SUBSCRIPTIONS/<subscriptionId>/RESOURCEGROUPS/<ResourceGroup>/PROVIDERS/MICROSOFT.INSIGHTS/SCHEDULEDQUERYRULES/TEST-BAD-ALERTS",
    "level": "Informational",
    "operationId": "",
    "operationName": {
        "value": "Microsoft.Insights/ScheduledQueryRules/disable/action",
        "localizedValue": "Microsoft.Insights/ScheduledQueryRules/disable/action"
    },
    "resourceGroupName": "<Resource Group>",
    "resourceProviderName": {
        "value": "MICROSOFT.INSIGHTS",
        "localizedValue": "Microsoft Insights"
    },
    "resourceType": {
        "value": "MICROSOFT.INSIGHTS/scheduledqueryrules",
        "localizedValue": "MICROSOFT.INSIGHTS/scheduledqueryrules"
    },
    "resourceId": "/SUBSCRIPTIONS/<subscriptionId>/RESOURCEGROUPS/<ResourceGroup>/PROVIDERS/MICROSOFT.INSIGHTS/SCHEDULEDQUERYRULES/TEST-BAD-ALERTS",
    "status": {
        "value": "Succeeded",
        "localizedValue": "Succeeded"
    },
    "subStatus": {
        "value": "",
        "localizedValue": ""
    },
    "submissionTimestamp": "2019-03-22T04:18:22.8569543Z",
    "subscriptionId": "<SubscriptionId>",
    "properties": {
        "resourceId": "/SUBSCRIPTIONS/<subscriptionId>/RESOURCEGROUPS/<ResourceGroup>/PROVIDERS/MICROSOFT.INSIGHTS/SCHEDULEDQUERYRULES/TEST-BAD-ALERTS",
        "subscriptionId": "<SubscriptionId>",
        "resourceGroup": "<ResourceGroup>",
        "eventDataId": "12e12345-12dd-1234-8e3e-12345b7a1234",
        "eventTimeStamp": "03/22/2019 04:18:22",
        "issueStartTime": "03/22/2019 04:18:22",
        "operationName": "Microsoft.Insights/ScheduledQueryRules/disable/action",
        "status": "Succeeded",
        "reason": "Alert has been failing consistently with the same exception for the past week"
    },
    "relatedEvents": []
}

后续步骤Next steps