排查 Azure Monitor 警报的问题Troubleshooting problems in Azure Monitor alerts

本文讨论 Azure Monitor 警报和通知的常见问题。This article discusses common problems in Azure Monitor alerting and notifications.

在监视数据中发现重要情况时,Azure Monitor 警报会主动通知你。Azure Monitor alerts proactively notify you when important conditions are found in your monitoring data. 有了警报,你就可以在系统的用户注意到问题之前确定和解决这些问题。They allow you to identify and address issues before the users of your system notice them. 有关警报的详细信息,请参阅 Azure 中的警报概述For more information on alerting, see Overview of alerts in Azure.

如果你遇到警报触发问题或警报未按预期触发问题,请参阅下面的文章。If you have a problem with an alert firing or not firing when expected, refer to the articles below. 可以在 Azure 门户中查看已触发的警报。You can see "fired" alerts in the Azure portal.

如果警报根据 Azure 门户按预期触发,但未出现正确的通知,请使用本文其余部分中的信息来解决该问题。If the alert fires as intended according to the Azure portal but the proper notifications do not occur, use the information in the rest of this article to troubleshoot that problem.

针对警报的操作或通知未按预期方式进行Action or notification on my alert did not work as expected

如果你可以在 Azure 门户中看到触发的警报,但其某些操作或通知存在问题,请参阅后面的部分。If you can see a fired alert in the Azure portal, but have an issue with some of its actions or notifications, see the following sections.

未收到预期的电子邮件Did not receive expected email

如果你可以在 Azure 门户中看到触发的警报,但未收到你已为其配置的相关电子邮件,请遵循以下步骤:If you can see a fired alert in the Azure portal, but did not receive the email that you have configured about it, follow these steps:

  1. 是否某个操作规则阻止了该电子邮件Was the email suppressed by an action rule?

    在门户中单击触发的警报进行检查,查看“历史记录”选项卡中是否有已阻止的操作组Check by clicking on the fired alert in the portal, and look at the history tab for suppressed action groups:

    警报操作规则 -“阻止”历史记录

  2. 操作类型是否为“向 Azure 资源管理器角色发送电子邮件”?Is the type of action "Email Azure Resource Manager Role"?

    此操作仅查看订阅范围的、类型为“用户”的 Azure 资源管理器角色分配。This action only looks at Azure Resource Manager role assignments that are at the subscription scope, and of type User. 请确保已在订阅级别分配角色,而不是在资源级别或资源组级别。Make sure that you have assigned the role at the subscription level, and not at the resource level or resource group level.

  3. 电子邮件服务器和邮箱是否接受外部电子邮件?Are your email server and mailbox accepting external emails?

    验证来自这三个地址的电子邮件是否未被阻止:Verify that emails from these three addresses are not blocked:

    • azure-noreply@microsoft.com
    • azureemail-noreply@microsoft.com
    • alerts-noreply@mail.windowsazure.cn

    内部邮件列表或通讯组列表阻止来自外部电子邮件地址的电子邮件的情况很常见。It is common that internal mailing lists or distribution lists block emails from external email addresses. 必须允许来自上述电子邮件地址的邮件。You must allow mail from the above email addresses.
    若要进行测试,请将一个常规工作电子邮件地址(不是邮件列表)添加到操作组,并查看警报是否到达该电子邮件地址。To test, add a regular work email address (not a mailing list) to the action group and see if alerts arrive to that email.

  4. 相关电子邮件是否被收件箱规则或垃圾邮件筛选器处理了?Was the email processed by inbox rules or a spam filter?

    请确保没有任何收件箱规则会删除这些电子邮件或将其移动到副文件夹。Verify that there are no inbox rules that delete those emails or move them to a side folder. 例如,收件箱规则可能会捕捉特定发件人或主题中的特定字词。For example, inbox rules could catch specific senders or specific words in the subject.

    另请检查:Also, check:

    • 电子邮件客户端(如 Outlook、Gmail)的垃圾邮件设置the spam settings of your email client (like Outlook, Gmail)
      • 电子邮件服务器(如 Exchange、Office 365、G-suite)的发件人限制/垃圾邮件设置/隔离设置the sender limits / spam settings / quarantine settings of your email server (like Exchange, Office 365, G-suite)
      • 若使用了电子邮件安全设备(如 Barracuda、Cisco),请检查其设置。the settings of your email security appliance, if any (like Barracuda, Cisco).
  5. 是否意外取消了对操作组的订阅?Have you accidentally unsubscribed from the action group?

    警报电子邮件提供从操作组取消订阅的链接。The alert emails provide a link to unsubscribe from the action group. 若要检查是否已意外地从此操作组取消订阅,请执行以下操作之一:To check if you have accidentally unsubscribed from this action group, either:

    1. 在门户中打开操作组,并检查“状态”列:Open the action group in the portal and check the Status column:

    操作组状态列

    1. 搜索电子邮件中是否有取消订阅确认信息:Search your email for the unsubscribe confirmation:

    从警报操作组取消订阅

    若要再次订阅,请使用收到的取消订阅确认电子邮件中的链接,或从操作组中删除电子邮件地址,然后再重新添加。To subscribe again - either use the link in the unsubscribe confirmation email you have received, or remove the email address from the action group, and then add it back again.

  6. 是否由于很多电子邮件都发送到一个电子邮件地址而受到速率限制?Have you been rated limited due to many emails going to a single email address?

    电子邮件有速率限制,发送给每个电子邮件地址的电子邮件不得超过每小时 100 封。Email is rate limited to no more than 100 emails every hour to each email address. 如果超过此阈值,则会删除其他电子邮件通知。If you pass this threshold, additional email notifications are dropped. 请检查是否已收到一封表明电子邮件地址暂时存在速率限制的邮件:Check if you have received a message indicating that your email address has been temporarily rate limited:

    电子邮件速率限制

    如果你想要在没有速率限制的情况下接收大量通知,请考虑使用不同的操作,如 webhook、逻辑应用、Azure 函数或自动化 runbook,这些都没有速率限制。If you would like to receive high-volume of notifications without rate limiting, consider using a different action, such as webhook, logic app, Azure function, or automation runbooks, none of which are rate limited.

未收到预期的短信、语音呼叫或推送通知Did not receive expected SMS, voice call, or push notification

如果你可以在门户中看到触发的警报,但未收到你已为其配置的相关短信、语音呼叫或推送通知,请遵循以下步骤:If you can see a fired alert in the portal, but did not receive the SMS, voice call or push notification that you have configured about it, follow these steps:

  1. 是否某个操作规则阻止了该操作?Was the action suppressed by an action rule?

    在门户中单击触发的警报进行检查,查看“历史记录”选项卡中是否有已阻止的操作组Check by clicking on the fired alert in the portal, and look at the history tab for suppressed action groups:

    警报操作规则 -“阻止”历史记录

    如果是无意中阻止了操作组,可以修改、禁用或删除操作规则。If that was unintentional, you can modify, disable, or delete the action rule.

  2. 短信/语音:你的电话号码是否正确?SMS / voice: Is your phone number correct?

    检查短信操作,以查明国家/地区代码或电话号码中是否有拼写错误。Check the SMS action for typos in the country code or phone number.

  3. 短信/语音:你是否受到了速率限制?SMS / voice: have you been rate limited?

    对于每个电话号码,短信和语音呼叫的速率限制是每 5 分钟不超过 1 个通知。SMS and voice calls are rate limited to no more than one notification every five minutes per phone number. 如果超过此阈值,则将丢弃通知。If you pass this threshold, the notifications will be dropped.

    • 语音呼叫 – 查看呼叫历史记录,了解你在过去 5 分钟内是否有来自 Azure 的不同呼叫。Voice call - check your call history and see if you had a different call from Azure in the preceding five minutes.
    • 短信 - 查看短信历史记录,了解其中是否有一条短信表明你的电话号码已受到速率限制。SMS - check your SMS history for a message indicating that your phone number has been rate limited.

    如果你想要在没有速率限制的情况下接收大量通知,请考虑使用不同的操作,如 webhook、逻辑应用、Azure 函数或自动化 runbook,这些都没有速率限制。If you would like to receive high-volume of notifications without rate limiting, consider using a different action, such as webhook, logic app, Azure function, or automation runbooks, none of which are rate limited.

  4. 短信:是否意外取消了对操作组的订阅?SMS: Have you accidentally unsubscribed from the action group?

    打开短信历史记录,检查是否已选择禁止传送来自此特定操作组(使用“DISABLE action_group_short_name”回复)或所有操作组(使用“STOP”回复)的短信。Open your SMS history and check if you have opted out of SMS delivery from this specific action group (using the DISABLE action_group_short_name reply) or from all action groups (using the STOP reply). 若要再次订阅,请发送相关短信命令(ENABLE action_group_short_name 或 START),或从操作组中删除短信操作,然后重新添加。To subscribe again, either send the relevant SMS command (ENABLE action_group_short_name or START), or remove the SMS action from the action group, and then add it back again. 有关详细信息,请参阅操作组中的短信警报行为For more information, see SMS alert behavior in action groups.

  5. 你的手机上是否意外阻止了通知?Have you accidentally blocked the notifications on your phone?

    大多数移动电话允许阻止来自特定电话号码或短代码的呼叫或短信,或阻止来自特定应用(如 Azure 移动应用)的推送通知。Most mobile phones allow you to block calls or SMS from specific phone numbers or short codes, or to block push notifications from specific apps (such as the Azure mobile app). 若要检查你的手机上是否意外阻止了通知,请搜索特定于你的手机操作系统和型号的文档,或使用其他手机和电话号码进行测试。To check if you accidentally blocked the notifications on your phone, search the documentation specific for your phone operating system and model, or test with a different phone and phone number.

预期另一类型的操作会触发,但它未触发Expected another type of action to trigger, but it did not

如果可以在门户中看到触发的警报,但其配置的操作未触发,请遵循以下步骤:If you can see a fired alert in the portal, but its configured action did not trigger, follow these steps:

  1. 是否某个操作规则阻止了该操作?Was the action suppressed by an action rule?

    在门户中单击触发的警报进行检查,查看“历史记录”选项卡中是否有已阻止的操作组Check by clicking on the fired alert in the portal, and look at the history tab for suppressed action groups:

    警报操作规则 -“阻止”历史记录

    如果是无意中阻止了操作组,可以修改、禁用或删除操作规则。If that was unintentional, you can modify, disable, or delete the action rule.

  2. Webhook 是否未触发?Did a webhook not trigger?

    1. 是否已阻止源 IP 地址?Have the source IP addresses been blocked?

      将需从其调用 Webhook 的 IP 地址加入允许列表。Add the IP addresses that the webhook is called from to your allow list.

    2. 你的 Webhook 终结点是否正常工作?Does your webhook endpoint work correctly?

      验证已配置的 Webhook 终结点是否正确,以及该终结点是否在正常运行。Verify the webhook endpoint you have configured is correct and the endpoint is working correctly. 检查 Webhook 日志或检测其代码,以便进行调查(例如,记录传入的有效负载)。Check your webhook logs or instrument its code so you could investigate (for example, log the incoming payload).

    3. Webhook 是否已停止响应或返回错误?Did your webhook became unresponsive or returned errors?

      Webhook 响应的超时期限为 10 秒。Our timeout period for a webhook response is 10 seconds. 当返回以下 HTTP 状态代码时,或者当 HTTP 终结点不响应时,最多会再重试 Webhook 调用两次:408、429、503、504。The webhook call will be retried up to two additional times when the following HTTP status codes are returned: 408, 429, 503, 504, or when the HTTP endpoint does not respond. 首次重试在 10 秒后发生。The first retry happens after 10 seconds. 第二次(也是最后一次)重试在 100 秒后发生。The second and final retry happens after 100 seconds. 如果第二次重试失败,则在 30 分钟内不会再次调用该终结点,不管执行什么操作。If the second retry fails, the endpoint will not be called again for 30 minutes for any action group.

操作或通知多次发生Action or notification happened more than once

如果多次收到针对某个警报的通知(如电子邮件或短信),或者该警报的操作(如 webhook 或 Azure 函数)已触发了多次,请遵循以下步骤:If you have received a notification for an alert (such as an email or an SMS) more than once, or the alert's action (such as webhook or Azure function) was triggered multiple times, follow these steps:

  1. 它是否确实是同一警报?Is it really the same alert?

    在某些情况下,会几乎同时触发多个类似的警报。In some cases, multiple similar alerts are fired at around the same time. 因此,看起来可能就会像同一警报多次触发了其操作。So, it might just seem like the same alert triggered its actions multiple times. 例如,活动日志警报规则可能会配置为在事件已启动时以及在事件已完成(成功或失败)时触发(不筛选事件状态字段)。For example, an activity log alert rule might be configured to fire both when an event has started, and when it has finished (succeeded or failed), by not filtering on the event status field.

    若要检查这些操作或通知是否来自不同警报,请查看警报详细信息,如警报时间戳以及警报 ID 或其相关 ID。或者,在门户中查看触发的警报的列表。To check if these actions or notifications came from different alerts, examine the alert details, such as its timestamp and either the alert id or its correlation id. Alternatively, check the list of fired alerts in the portal. 如果是这种情况,则需要调整警报规则逻辑或配置警报源。If that is the case, you would need to adapt the alert rule logic or otherwise configure the alert source.

  2. 此操作是否在多个操作组中重复?Does the action repeat in multiple action groups?

    触发警报时,其每个操作组都是单独处理的。When an alert is fired, each of its action groups is processed independently. 因此,如果某个操作(如电子邮件地址)出现在多个触发的操作组中,则每个操作组都会调用该操作一次。So, if an action (such as an email address) appears in multiple triggered action groups, it would be called once per action group.

    若要检查触发了哪些操作组,请查看警报的“历史记录”选项卡。在该选项卡中可以看到在警报规则中定义的操作组,以及通过操作规则添加到警报的操作组:To check which action groups were triggered, check the alert history tab. You would see there both action groups defined in the alert rule, and action groups added to the alert by action rules:

    操作在多个操作组中重复

操作或通知存在意外的内容Action or notification has an unexpected content

如果收到了警报,但认为警报的某些字段丢失或不正确,请按照以下步骤操作:If you have received the alert, but believe some of its fields are missing or incorrect, follow these steps:

  1. 是否为操作选择了正确的格式?Did you pick the correct format for the action?

    每种操作类型(电子邮件、Webhook 等)都有两种格式:默认格式(旧格式)和较新的通用架构格式Each action type (email, webhook, etc.) has two formats - the default, legacy format, and the newer common schema format. 创建操作组时,请指定每个操作所需的格式(操作组中的不同操作可能具有不同的格式)。When you create an action group, you specify the format you want per action - different actions in the action groups may have different formats.

    例如,对于 Webhook 操作:For example, for webhook action:

    Webhook 操作架构选项

    检查在操作级别指定的格式是否是你需要的格式。Check if the format specified at the action level is what you expect. 例如,你可能已开发响应警报的代码(Webhook、函数、逻辑应用等),这些代码需要一种格式,但是稍后你或其他人在操作中指定了另一种格式。For example, you may have developed code that responds to alerts (webhook, function, logic app, etc.), expecting one format, but later in the action you or another person specified a different format.

    另外,检查活动日志警报日志搜索警报(Application Insights 和日志分析)、指标警报通用警报架构以及已弃用的经典指标警报的有效负载格式 (JSON)。Also, check the payload format (JSON) for activity log alerts, for log search alerts (both Application Insights and log analytics), for metric alerts, for the common alert schema, and for the deprecated classic metric alerts.

  2. 活动日志警报:活动日志中是否有该信息?Activity log alerts: Is the information available in the activity log?

    活动日志警报是基于写入到 Azure 活动日志的事件(例如,有关创建、更新或删除 Azure 资源的事件、服务运行状况和资源运行状况事件,或者 Azure 顾问和 Azure Policy 发现的情况)的警报。Activity log alerts are alerts that are based on events written to the Azure Activity Log, such as events about creating, updating, or deleting Azure resources, service health and resource health events, or findings from Azure Advisor and Azure Policy. 如果你收到了基于活动日志的警报,但所需的某些字段丢失或不正确,请首先检查活动日志本身中的事件。If you have received an alert based on the activity log but some fields that you need are missing or incorrect, first check the events in the activity log itself. 如果 Azure 资源未在其活动日志事件中写入所需字段,则这些字段将不会包含在相应的警报中。If the Azure resource did not write the fields you are looking for in its activity log event, those fields will not be included in the corresponding alert.

操作规则未按预期发挥作用Action rule is not working as expected

如果可以在门户中看到触发的警报,但相关的操作规则未按预期发挥作用,请遵循以下步骤:If you can see a fired alert in the portal, but a related action rule did not work as expected, follow these steps:

  1. 操作规则是否已启用?Is the action rule enabled?

    检查“操作规则状态”列以验证是否已启用相关操作角色。Check the action rule status column to verify that the related action role is enabled.

    图形

    如果它未启用,则可以通过选择操作规则并单击“启用”来启用它。If it is not enabled, you can enable the action rule by selecting it and clicking Enable.

  2. 它是否为服务运行状况警报?Is it a service health alert?

    服务运行状况警报(监视服务 =“服务运行状况”)不受操作规则影响。Service health alerts (monitor service = "Service Health") are not affected by action rules.

  3. 操作规则对你的警报是否起作用?Did the action rule act on your alert?

    在门户中单击触发的警报检查该操作规则是否已处理你的警报,并查看“历史记录”选项卡。Check if the action rule has processed your alert by clicking on the fired alert in the portal, and look at the history tab.

    下面是一个阻止所有操作组的操作规则示例:Here is an example of action rule suppressing all action groups:

    警报操作规则 -“阻止”历史记录

    下面是一个添加另一操作组的操作规则示例:Here is an example of an action rule adding another action group:

    操作在多个操作组中重复

  4. 操作规则范围和筛选器是否与触发的警报匹配?Does the action rule scope and filter match the fired alert?

    如果你认为操作规则本来应触发但未触发,或者它本来不应触发但确实已触发,请仔细检查操作规则作用域和筛选条件与触发的警报的属性。If you think the action rule should have fired but didn't, or that it shouldn't have fired but it did, carefully examine the action rule scope and filter conditions versus the properties of the fired alert.

如何查找已触发的警报的警报 IDHow to find the alert ID of a fired alert

建立有关已触发的特定警报的案例(例如,你未收到其电子邮件通知)时,需要提供警报 ID。When opening a case about a specific fired alert (such as - if you did not receive its email notification), you will need to provide the alert ID.

若要查找该 ID,请遵循以下步骤:To locate it, follow these steps:

  1. 在 Azure 门户中,导航到已触发的警报列表,并查找该特定警报。In the Azure portal, navigate to the list of fired alerts, and find that specific alert. 可以使用筛选器来帮助查找。You can use the filters to help you locate it.

  2. 单击该警报打开警报详细信息。Click on the alert to open the alert details.

  3. 在第一个选项卡(摘要选项卡)的警报字段中向下滚动,直到找到该 ID,然后复制它。Scroll down in the alert fields of the first tab (the summary tab) until you locate it, and copy it. 该字段还包含一个“复制到剪贴板”帮助器按钮。That field also includes a "Copy to clipboard" helper button you can use.

    查找警报 ID

在 Azure 门户中创建、更新或删除操作规则时出现问题Problem creating, updating, or deleting action rules in the Azure portal

如果你在尝试创建、更新或删除操作规则时遇到错误,请遵循以下步骤:If you received an error while trying to create, update or delete an action rule, follow these steps:

  1. 是否遇到了权限错误?Did you receive a permission error?

    你应该拥有 监视参与者内置角色,或者与操作规则和警报相关的特定权限。You should either have the Monitoring Contributor built-in role, or the specific permissions related to action rules and alerts.

  2. 是否已验证操作规则参数?Did you verify the action rule parameters?

    查看操作规则文档操作规则 PowerShell Set-AzActionRule 命令。Check the action rule documentation, or the action rule PowerShell Set-AzActionRule command.

后续步骤Next steps