智能检测 - 失败异常Smart Detection - Failure Anomalies

如果 Web 应用的失败请求速率出现异常上升,Application Insights 会几乎实时地自动通知你。Application Insights automatically notifies you in near real time if your web app experiences an abnormal rise in the rate of failed requests. 它会对 HTTP 请求速率或报告为失败的依赖项调用的异常上升进行检测。It detects an unusual rise in the rate of HTTP requests or dependency calls that are reported as failed. 对于请求而言,失败的请求通常是响应代码为 400 或更高的请求。For requests, failed requests are usually those with response codes of 400 or higher. 为了帮助会审和诊断问题,通知中会提供失败及相关遥测的特征分析。To help you triage and diagnose the problem, an analysis of the characteristics of the failures and related telemetry is provided in the notification. 还提供指向 Application Insights 门户的链接,以供进一步诊断。There are also links to the Application Insights portal for further diagnosis.

此功能适用于在云中或你自己的服务器上托管并生成请求或依赖项遥测数据的任何 Web 应用,例如,在你具有用于调用 TrackRequest()TrackDependency() 的辅助角色时。This feature works for any web app, hosted in the cloud or on your own servers, that generates request or dependency telemetry - for example, if you have a worker role that calls TrackRequest() or TrackDependency().

在设置适用于项目的 Application Insights 后,如果应用生成特定最低遥测量,在进行切换和发送警报前,智能检测失败异常将花费 24 小时来了解应用的正常行为。After setting up Application Insights for your project, and provided your app generates a certain minimum amount of telemetry, Smart Detection of failure anomalies takes 24 hours to learn the normal behavior of your app, before it is switched on and can send alerts.

下面是一个示例警报。Here's a sample alert.

围绕失败显示群集分析的示例智能检测警报

Note

默认情况下,会收到比该示例中更短的格式邮件。By default, you get a shorter format mail than this example. 但是可以切换为这一详细格式But you can switch to this detailed format.

请注意,它会指示:Notice that it tells you:

  • 相较于正常应用行为的失败率。The failure rate compared to normal app behavior.
  • 受影响的用户数,因此你知道需要有多担心。How many users are affected - so you know how much to worry.
  • 与失败关联的特征模式。A characteristic pattern associated with the failures. 在此示例中,有特定的响应代码、请求名称(操作)和应用版本。In this example, there's a particular response code, request name (operation) and app version. 这会立即通知在代码中开始查找的位置。That immediately tells you where to start looking in your code. 其他可能性可能是特定的浏览器或客户端操作系统。Other possibilities could be a specific browser or client operating system.
  • 似乎与特征失败相关联的异常、日志跟踪和依赖项失败(数据库或其他外部组件)。The exception, log traces, and dependency failure (databases or other external components) that appear to be associated with the characterized failures.
  • 直接指向 Application Insights 中遥测的相关搜索的链接。Links directly to relevant searches on the telemetry in Application Insights.

故障异常 v2Failure Anomalies v2

新版的“故障异常”警报规则现已提供。A new version of the Failure Anomalies alert rule is now available. 这个新版本正运行在新的 Azure 警报平台上,与现有的版本相比,它引入了各种改进功能。This new version is running on the new Azure alerting platform and introduces a variety of improvements over the existing version.

此版本的新增功能What's new in this version?

  • 更快地检测问题Faster detection of issues
  • 一组更丰富的操作 - 警报规则使用包含电子邮件和 Webhook 操作的关联操作组(名为“Application Insights 智能检测”)创建,可进行扩展以在警报引发时触发其他操作。A richer set of actions - The alert rule is created with an associated Action Group named "Application Insights Smart Detection" that contains email and webhook actions, and can be extended to trigger additional actions when the alert fires.
  • 更有针对性的通知 - 从此警报规则发送的电子邮件通知现在默认发送给与订阅的“监视阅读者”和“监视参与者”角色关联的用户。More focused notifications - Email notifications sent from this alert rule are now sent by default to users associated with the subscription's Monitoring Reader and Monitoring Contributor roles. 此处提供了与此内容相关的详细信息。More information on this is available here.
  • 通过 ARM 模板简化配置 - 请参阅此处的示例。Easier configuration via ARM templates - See example here.
  • 常见警报架构支持 - 从此警报规则发送的通知遵循常见警报架构Common alert schema support - Notifications sent from this alert rule follow the common alert schema.
  • 统一的电子邮件模板 - 从此警报规则发送的电子邮件通知具有与其他警报类型一致的外观。Unified email template - Email notifications from this alert rule have a consistent look & feel with other alert types. 经此更改,用于获取“故障异常”警报(包含详细的诊断信息)的选项不再可用。With this change, the option to get Failure Anomalies alerts with detailed diagnostics information is no longer available.

如何获取新的版本?How do I get the new version?

  • 如今,新建的 Application Insights 资源是用新版的“故障异常”警报规则预配的。Newly created Application Insights resources are now provisioned with the new version of the Failure Anomalies alert rule.

Note

新版的“故障异常”警报规则始终是免费的。The new version of the Failure Anomalies alert rule remains free. 此外,由关联的“Application Insights 智能检测”操作组触发的电子邮件和 Webhook 操作也是免费的。In addition, email and webhook actions triggered by the associated "Application Insights Smart Detection" Action Group are free as well.

智能检测的优点Benefits of Smart Detection

普通指标警报会通知你可能存在问题。Ordinary metric alerts tell you there might be a problem. 但是,智能检测将开始诊断工作,并执行以往都需要你自行进行的大量分析。But Smart Detection starts the diagnostic work for you, performing a lot of the analysis you would otherwise have to do yourself. 结果将整齐地打包,以帮助你快速找到问题的根源。You get the results neatly packaged, helping you to get quickly to the root of the problem.

工作原理How it works

智能检测监视从应用收到的遥测数据,特别是失败率。Smart Detection monitors the telemetry received from your app, and in particular the failure rates. 此规则计算 Successful request 属性为 False 的请求数,和 Successful call 属性为 False 的依赖项调用数。This rule counts the number of requests for which the Successful request property is false, and the number of dependency calls for which the Successful call property is false. 对于请求而言,默认情况下,Successful request == (resultCode < 400)(除非已将自定义代码写入筛选器

应用性能具有典型的行为模式。Your app's performance has a typical pattern of behavior. 某些请求或依赖项调用更容易出现失败,而且总体失败率可能会随着负载的增加而上升。Some requests or dependency calls will be more prone to failure than others; and the overall failure rate may go up as load increases.

由于遥测数据从 Web 应用提供给 Application Insights,因此智能检测会将当前行为与过去几天看到的模式进行比较。As telemetry comes into Application Insights from your web app, Smart Detection compares the current behavior with the patterns seen over the past few days. 如果通过与先前性能比较观察到失败率中有异常上升,将触发分析。If an abnormal rise in failure rate is observed by comparison with previous performance, an analysis is triggered.

分析触发后,服务将对失败的请求执行群集分析,以尝试标识特征化失败的值的模式。When an analysis is triggered, the service performs a cluster analysis on the failed request, to try to identify a pattern of values that characterize the failures. 在上面的示例中,分析发现大多数失败都是关于特定结果代码、请求名称、服务器 URL 主机和角色实例。In the example above, the analysis has discovered that most failures are about a specific result code, request name, Server URL host, and role instance. 相比之下,分析已发现客户端操作系统属性分布在多个值上,因此它未列出。By contrast, the analysis has discovered that the client operating system property is distributed over multiple values, and so it is not listed.

当使用这些遥测调用检测服务时,分析器查找与已标识群集中的请求关联的异常和依赖项失败,以及与这些请求关联的任何跟踪日志的示例。When your service is instrumented with these telemetry calls, the analyzer looks for an exception and a dependency failure that are associated with requests in the cluster it has identified, together with an example of any trace logs associated with those requests.

生成的分析以警报形式发送给用户,除非已将它配置为不这样做。The resulting analysis is sent to you as alert, unless you have configured it not to.

手动设置的警报一样,可以检查警报状态并在 Application Insights 资源的“警报”边栏选项卡中配置它。Like the alerts you set manually, you can inspect the state of the alert and configure it in the Alerts blade of your Application Insights resource. 但与其他警报不同,无需设置或配置智能检测。But unlike other alerts, you don't need to set up or configure Smart Detection. 如果需要,可以禁用它或更改其目标电子邮件地址。If you want, you can disable it or change its target email addresses.

警报逻辑详细信息Alert logic details

确定是否应当触发警报时评估的主要因素包括:The primary factors that are evaluated to determine if an alert should be triggered are:

  • 对 20 分钟滚动时间窗口中的请求/依赖项的失败百分比进行分析。Analysis of the failure percentage of requests/dependencies in a rolling time window of 20 minutes.
  • 将过去 20 分钟的失败百分比与过去 40 分钟和过去 7 天的失败率进行比较,并寻找超过标准偏差 x 倍的重大偏差。A comparison of the failure percentage of the last 20 minutes to the rate in the last 40 minutes and the past seven days, and looking for significant deviations that exceed X-times that standard deviation.
  • 对最小失败百分比使用自适应限制,该限制根据应用的请求/依赖项的数量而变化。Using an adaptive limit for the minimum failure percentage, which varies based on the app’s volume of requests/dependencies.

配置警报Configure alerts

可以禁用智能检测、更改电子邮件收件人、创建 webhook,或者选择启用更详细的警报消息。You can disable Smart Detection, change the email recipients, create a webhook, or opt in to more detailed alert messages.

打开“警报”页。Open the Alerts page. 包括失败异常以及已手动设置的任何警报,并可以查看其当前是否处于警报状态。Failure Anomalies is included along with any alerts that you have set manually, and you can see whether it is currently in the alert state.

在“概述”页上单击“警报”磁贴。

单击警报以配置它。Click the alert to configure it.

配置

请注意,可以禁用智能检测,但不能删除它(或创建另一个)。Notice that you can disable Smart Detection, but you can't delete it (or create another one).

详细的警报Detailed alerts

如果选择“获取更详细的诊断”,电子邮件将包含更多诊断信息。If you select "Get more detailed diagnostics" then the email will contain more diagnostic information. 有时,可以仅通过电子邮件中的数据诊断问题。Sometimes you'll be able to diagnose the problem just from the data in the email.

存在更详细的警告消息可能包含敏感信息的轻微危险性,因为它包括异常和跟踪消息。There's a slight risk that the more detailed alert could contain sensitive information, because it includes exception and trace messages. 但是,只有代码允许敏感信息包含在这些消息中,才会发生这种情形。However, this would only happen if your code could allow sensitive information into those messages.

会审和诊断警报Triaging and diagnosing an alert

警报指示已检测到失败请求中有异常上升。An alert indicates that an abnormal rise in the failed request rate was detected. 应用或其环境很可能存在某些问题。It's likely that there is some problem with your app or its environment.

根据请求百分比和受影响用户数,可以确定问题的紧急程度。From the percentage of requests and number of users affected, you can decide how urgent the issue is. 在上面的示例中,将 22.5% 的失败率与 1% 的正常失败率比较,指示一些不好的事情正在进行。In the example above, the failure rate of 22.5% compares with a normal rate of 1%, indicates that something bad is going on. 另一方面,只有 11 位用户受到影响。On the other hand, only 11 users were affected. 如果它是你的应用,能够评估情况的严重情况。If it were your app, you'd be able to assess how serious that is.

在许多情况下,能够从提供的请求名称、异常、依赖项失败和跟踪数据快速诊断问题。In many cases, you will be able to diagnose the problem quickly from the request name, exception, dependency failure and trace data provided.

存在其他一些提示。There are some other clues. 例如,该示例中的依赖项失败率与异常率 (89.3%) 相同。For example, the dependency failure rate in this example is the same as the exception rate (89.3%). 这表明异常直接由依赖项失败引发,让你对清楚地了解应从代码中的哪个位置开始查找。This suggests that the exception arises directly from the dependency failure - giving you a clear idea of where to start looking in your code.

若要进一步调查,每个部分中的链接可直接转到搜索页,该页面已针对相关请求、异常、依赖项或跟踪进行筛选。To investigate further, the links in each section will take you straight to a search page filtered to the relevant requests, exception, dependency or traces. 或者,可以打开 Azure 门户,导航到应用的 Application Insights 资源并打开“失败”边栏选项卡。Or you can open the Azure portal, navigate to the Application Insights resource for your app, and open the Failures blade.

在此示例中,单击“查看依赖项失败详细信息”链接将打开 Application Insights 搜索边栏选项卡。In this example, clicking the 'View dependency failures details' link opens the Application Insights search blade. 它显示具有根本原因示例的 SQL 语句:在必填字段中提供了 NULL,并且在保存操作期间未通过验证。It shows the SQL statement that has an example of the root cause: NULLs were provided at mandatory fields and did not pass validation during the save operation.

诊断搜索

查看最近的警报Review recent alerts

单击“智能检测” 以转到最近警报:Click Smart Detection to get to the most recent alert:

警报摘要

区别是什么...What's the difference ...

智能检测失败异常对其他类似但又不同的 Application Insight 功能进行补充。Smart Detection of failure anomalies complements other similar but distinct features of Application Insights.

  • 指标警报由你设置,并且可监视各种指标,如 CPU 占用率、请求速率、页面加载时间等。Metric Alerts are set by you and can monitor a wide range of metrics such as CPU occupancy, request rates, page load times, and so on. 可以将它们用于发出警告,例如在需要添加更多资源时。You can use them to warn you, for example, if you need to add more resources. 相比之下,智能检测失败异常涵盖小范围的关键指标(当前仅失败请求速率),设计成一旦 Web 应用的失败请求速率相较于 Web 应用的正常行为而言显著增加,便会以近实时方式通知你。By contrast, Smart Detection of failure anomalies covers a small range of critical metrics (currently only failed request rate), designed to notify you in near real time manner once your web app's failed request rate increases significantly compared to web app's normal behavior.

    智能检测自动调整其阈值以响应现行条件。Smart Detection automatically adjusts its threshold in response to prevailing conditions.

    智能检测将开始诊断工作。Smart Detection starts the diagnostic work for you.

  • 智能检测性能异常还使用计算机智能发现指标中的异常模式,使得你无需执行任何配置。Smart Detection of performance anomalies also uses machine intelligence to discover unusual patterns in your metrics, and no configuration by you is required. 但与智能检测失败异常不同,智能检测性能异常的目的是查找可能不能提供很好服务的使用情况复写体分段,例如通过特定类型浏览器上的特定页面。But unlike Smart Detection of failure anomalies, the purpose of Smart Detection of performance anomalies is to find segments of your usage manifold that might be badly served - for example, by specific pages on a specific type of browser. 将每日执行分析,如果找到任何结果,则很可能紧急程度远低于警报。The analysis is performed daily, and if any result is found, it's likely to be much less urgent than an alert. 相比之下,会对传入的遥测数据连续执行失败异常分析,如果服务器失败率超出预期值,会在几分钟内通知你。By contrast, the analysis for failure anomalies is performed continuously on incoming telemetry, and you will be notified within minutes if server failure rates are greater than expected.

如果收到智能检测警报If you receive a Smart Detection alert

为什么会收到此警报?Why have I received this alert?

  • 我们检测到与之前一段时间的正常基线相比,失败请求率中出现异常上升。We detected an abnormal rise in failed requests rate compared to the normal baseline of the preceding period. 对失败和关联遥测进行分析后,我们认为存在问题并且你应当给予关注。After analysis of the failures and associated telemetry, we think that there is a problem that you should look into.

通知是否表示肯定存在问题?Does the notification mean I definitely have a problem?

  • 我们尝试针对应用中断或降级发出警报,但只有可以完全了解语义以及对应用或用户的影响。We try to alert on app disruption or degradation, but only you can fully understand the semantics and the impact on the app or users.

所以你们会查看我的数据? So, you guys look at my data?

  • 否。No. 该服务完全是自动的。The service is entirely automatic. 只有你会收到通知。Only you get the notifications. 数据是私有数据。Your data is private.

是否需要订阅此警报?Do I have to subscribe to this alert?

  • 否。No. 发送请求遥测的每个应用程序都有智能检测警报规则。Every application that sends request telemetry has the Smart Detection alert rule.

是否可以取消订阅或者获取已发送至同事的通知?Can I unsubscribe or get the notifications sent to my colleagues instead?

  • 是,在“警报”规则中,单击“智能检测”规则可配置它。Yes, In Alert rules, click the Smart Detection rule to configure it. 可以禁用警报,或更改警报的收件人。You can disable the alert, or change recipients for the alert.

我丢失了电子邮件。在哪里可以找到门户中的通知?I lost the email. Where can I find the notifications in the portal?

  • 在活动日志中。In the Activity logs. 在 Azure 中,打开应用的 Application Insights 资源,并选择“活动日志”。In Azure, open the Application Insights resource for your app, then select Activity logs.

一些警报关于已知问题,我不希望接收它们。Some of the alerts are about known issues and I do not want to receive them.

  • 我们对积压工作会有警报抑制。We have alert suppression on our backlog.

后续步骤Next steps

这些诊断工具可帮助检查应用中的遥测数据:These diagnostic tools help you inspect the telemetry from your app:

智能检测是完全自动执行的。Smart detections are completely automatic. 但是或许你想要设置更多的警报?But maybe you'd like to set up some more alerts?