智能检测 - 失败异常Smart Detection - Failure Anomalies

如果 Web 应用的失败请求速率出现异常上升,那么 Application Insights 会近乎实时地自动通知你。Application Insights automatically alerts you in near real time if your web app experiences an abnormal rise in the rate of failed requests. 它会对 HTTP 请求速率或报告为失败的依赖项调用的异常上升进行检测。It detects an unusual rise in the rate of HTTP requests or dependency calls that are reported as failed. 对于请求而言,失败的请求通常是响应代码为 400 或更高的请求。For requests, failed requests usually have response codes of 400 or higher. 为了帮助会审和诊断问题,通知详情中会提供失败及相关应用程序数据的特征分析。To help you triage and diagnose the problem, an analysis of the characteristics of the failures and related application data is provided in the alert details. 还提供指向 Application Insights 门户的链接,以供进一步诊断。There are also links to the Application Insights portal for further diagnosis. 该功能不需要任何设置或配置,因为它使用机器学习算法来预测正常的失败率。The feature needs no set-up nor configuration, as it uses machine learning algorithms to predict the normal failure rate.

此功能适用于在云或你自己的服务器上托管的任何 Web 应用,以及生成应用程序请求或依赖项数据的任何 web 应用。This feature works for any web app, hosted in the cloud or on your own servers, that generate application request or dependency data. 例如,如果你的辅助角色调用 TrackRequest()TrackDependency()For example, if you have a worker role that calls TrackRequest() or TrackDependency().

在设置适用于项目的 Application Insights 后,如果应用生成特定最低数据量,在进行切换和发送警报前,智能检测失败异常将花费 24 小时来了解应用的正常行为。After setting up Application Insights for your project, and if your app generates a certain minimum amount of data, Smart Detection of failure anomalies takes 24 hours to learn the normal behavior of your app, before it is switched on and can send alerts.

下面是一个示例警报:Here's a sample alert:

警报详细信息将告诉你:The alert details will tell you:

  • 相较于正常应用行为的失败率。The failure rate compared to normal app behavior.
  • 受影响的用户数,因此你知道需要有多担心。How many users are affected - so you know how much to worry.
  • 与失败关联的特征模式。A characteristic pattern associated with the failures. 在此示例中,有特定的响应代码、请求名称(操作)和应用程序版本。In this example, there's a particular response code, request name (operation), and application version. 这会立即通知在代码中开始查找的位置。That immediately tells you where to start looking in your code. 其他可能性可能是特定的浏览器或客户端操作系统。Other possibilities could be a specific browser or client operating system.
  • 似乎与特征失败相关联的异常、日志跟踪和依赖项失败(数据库或其他外部组件)。The exception, log traces, and dependency failure (databases or other external components) that appear to be associated with the characterized failures.
  • 直接指向 Application Insights 中数据的相关搜索的链接。Links directly to relevant searches on the data in Application Insights.

智能检测的优点Benefits of Smart Detection

普通指标警报会通知你可能存在问题。Ordinary metric alerts tell you there might be a problem. 但是,智能检测将开始诊断工作,并执行以往都需要你自行完成的大量分析。But Smart Detection starts the diagnostic work for you, performing much the analysis you would otherwise have to do yourself. 结果将整齐地打包,以帮助你快速找到问题的根源。You get the results neatly packaged, helping you to get quickly to the root of the problem.

工作原理How it works

智能检测可监视从应用收到的数据,特别是失败率。Smart Detection monitors the data received from your app, and in particular the failure rates. 此规则计算 Successful request 属性为 False 的请求数,和 Successful call 属性为 False 的依赖项调用数。This rule counts the number of requests for which the Successful request property is false, and the number of dependency calls for which the Successful call property is false. 对于请求而言,默认情况下,Successful request == (resultCode < 400)(除非已将自定义代码写入筛选器或生成自己的 TrackRequest 调用)。For requests, by default, Successful request == (resultCode < 400) (unless you have written custom code to filter or generate your own TrackRequest calls).

应用性能具有典型的行为模式。Your app's performance has a typical pattern of behavior. 某些请求或依赖项调用更容易出现失败,而且总体失败率可能会随着负载的增加而上升。Some requests or dependency calls will be more prone to failure than others; and the overall failure rate may go up as load increases. 智能检测使用机器学习来查找这些异常。Smart Detection uses machine learning to find these anomalies.

由于数据从 Web 应用提供给 Application Insights,因此智能检测会将当前行为与过去几天看到的模式进行比较。As data comes into Application Insights from your web app, Smart Detection compares the current behavior with the patterns seen over the past few days. 如果通过与先前性能比较观察到失败率中有异常上升,将触发分析。If an abnormal rise in failure rate is observed by comparison with previous performance, an analysis is triggered.

分析触发后,服务将对失败的请求执行群集分析,以尝试标识特征化失败的值的模式。When an analysis is triggered, the service performs a cluster analysis on the failed request, to try to identify a pattern of values that characterize the failures.

在上面的示例中,分析发现大多数失败都是关于特定结果代码、请求名称、服务器 URL 主机和角色实例。In the example above, the analysis has discovered that most failures are about a specific result code, request name, Server URL host, and role instance.

当使用这些调用检测服务时,分析器查找与已标识群集中的请求关联的异常和依赖项失败,以及与这些请求关联的任何跟踪日志的示例。When your service is instrumented with these calls, the analyzer looks for an exception and a dependency failure that are associated with requests in the cluster it has identified, together with an example of any trace logs associated with those requests.

生成的分析以警报形式发送给用户,除非已将它配置为不这样做。The resulting analysis is sent to you as alert, unless you have configured it not to.

手动设置的警报一样,你可以检查触发警报的状态(如果问题已修复,则可以解决此问题)。Like the alerts you set manually, you can inspect the state of the fired alert, which can be resolved if the issue is fixed. 在 Application Insights 资源的“警报”页配置警报规则。Configure the alert rules in the Alerts page of your Application Insights resource. 但与其他警报不同,无需设置或配置智能检测。But unlike other alerts, you don't need to set up or configure Smart Detection. 如果需要,可以禁用它或更改其目标电子邮件地址。If you want, you can disable it or change its target email addresses.

警报逻辑详细信息Alert logic details

警报是由我们的专有机器学习算法触发的,因此我们不能共享确切的实现细节。The alerts are triggered by our proprietary machine learning algorithm so we can't share the exact implementation details. 话虽如此,但我们知道你有时需要详细了解基础逻辑的工作原理。With that said, we understand that you sometimes need to know more about how the underlying logic works. 确定是否应当触发警报时评估的主要因素包括:The primary factors that are evaluated to determine if an alert should be triggered are:

  • 对 20 分钟滚动时间窗口中的请求/依赖项的失败百分比进行分析。Analysis of the failure percentage of requests/dependencies in a rolling time window of 20 minutes.
  • 将过去 20 分钟的失败百分比与过去 40 分钟和过去 7 天的失败率进行比较,并寻找超过标准偏差 x 倍的重大偏差。A comparison of the failure percentage of the last 20 minutes to the rate in the last 40 minutes and the past seven days, and looking for significant deviations that exceed X-times that standard deviation.
  • 对最小失败百分比使用自适应限制,该限制根据应用的请求/依赖项的数量而变化。Using an adaptive limit for the minimum failure percentage, which varies based on the app’s volume of requests/dependencies.
  • 如果 8-24 小时内不再检测到问题,则有逻辑可以自动解决已触发的警报监视器条件。There is logic that can automatically resolve the fired alert monitor condition, if the issue is no longer detected for 8-24 hours.

配置警报Configure alerts

可以从端口或者使用 Azure 资源管理器禁用智能检测警报规则(请参阅模板示例)。You can disable Smart Detection alert rule from the portal or using Azure Resource Manager (see template example).

该警报规则是使用名为“Application Insights 智能检测”的关联操作组创建的(该操作组包含电子邮件和 webhook 操作),并且可以进行扩展以在警报触发时触发其他操作。This alert rule is created with an associated Action Group named "Application Insights Smart Detection" that contains email and webhook actions, and can be extended to trigger additional actions when the alert fires.

备注

该警报规则现在默认向与订阅的监视查阅者和监视参与者角色关联的用户发送电子邮件通知。Email notifications sent from this alert rule are now sent by default to users associated with the subscription's Monitoring Reader and Monitoring Contributor roles. 此处介绍了有关详细信息。More information on this is available here. 通过该警报规则发送的通知需遵循常见警报架构Notifications sent from this alert rule follow the common alert schema.

打开“警报”页。Open the Alerts page. 包括失败异常警报规则以及已手动设置的任何警报,并可以查看其当前是否处于警报状态。Failure Anomalies alert rules are included along with any alerts that you have set manually, and you can see whether it is currently in the alert state.

单击警报以配置它。Click the alert to configure it.

请注意,你可以禁用或删除失败异常警报规则,但不能在同一 Application Insights 资源上创建另一个同样的规则。Notice that you can disable or delete a Failure Anomalies alert rule, but you can't create another one on the same Application Insights resource.

失败异常警报 Webhook 有效负载示例Example of Failure Anomalies alert webhook payload

{
    "properties": {
        "essentials": {
            "severity": "Sev3",
            "signalType": "Log",
            "alertState": "New",
            "monitorCondition": "Resolved",
            "monitorService": "Smart Detector",
            "targetResource": "/subscriptions/4f9b81be-fa32-4f96-aeb3-fc5c3f678df9/resourcegroups/test-group/providers/microsoft.insights/components/test-rule",
            "targetResourceName": "test-rule",
            "targetResourceGroup": "test-group",
            "targetResourceType": "microsoft.insights/components",
            "sourceCreatedId": "1a0a5b6436a9b2a13377f5c89a3477855276f8208982e0f167697a2b45fcbb3e",
            "alertRule": "/subscriptions/4f9b81be-fa32-4f96-aeb3-fc5c3f678df9/resourcegroups/test-group/providers/microsoft.alertsmanagement/smartdetectoralertrules/failure anomalies - test-rule",
            "startDateTime": "2019-10-30T17:52:32.5802978Z",
            "lastModifiedDateTime": "2019-10-30T18:25:23.1072443Z",
            "monitorConditionResolvedDateTime": "2019-10-30T18:25:26.4440603Z",
            "lastModifiedUserName": "System",
            "actionStatus": {
                "isSuppressed": false
            },
            "description": "Failure Anomalies notifies you of an unusual rise in the rate of failed HTTP requests or dependency calls."
        },
        "context": {
            "DetectionSummary": "An abnormal rise in failed request rate",
            "FormattedOccurenceTime": "2019-10-30T17:50:00Z",
            "DetectedFailureRate": "50.0% (200/400 requests)",
            "NormalFailureRate": "0.0% (over the last 30 minutes)",
            "FailureRateChart": [["2019-10-30T05:20:00Z",
            0],
            ["2019-10-30T05:40:00Z",
            100],
            ["2019-10-30T06:00:00Z",
            0],
            ["2019-10-30T06:20:00Z",
            0],
            ["2019-10-30T06:40:00Z",
            100],
            ["2019-10-30T07:00:00Z",
            0],
            ["2019-10-30T07:20:00Z",
            0],
            ["2019-10-30T07:40:00Z",
            100],
            ["2019-10-30T08:00:00Z",
            0],
            ["2019-10-30T08:20:00Z",
            0],
            ["2019-10-30T08:40:00Z",
            100],
            ["2019-10-30T17:00:00Z",
            0],
            ["2019-10-30T17:20:00Z",
            0],
            ["2019-10-30T09:00:00Z",
            0],
            ["2019-10-30T09:20:00Z",
            0],
            ["2019-10-30T09:40:00Z",
            100],
            ["2019-10-30T10:00:00Z",
            0],
            ["2019-10-30T10:20:00Z",
            0],
            ["2019-10-30T10:40:00Z",
            100],
            ["2019-10-30T11:00:00Z",
            0],
            ["2019-10-30T11:20:00Z",
            0],
            ["2019-10-30T11:40:00Z",
            100],
            ["2019-10-30T12:00:00Z",
            0],
            ["2019-10-30T12:20:00Z",
            0],
            ["2019-10-30T12:40:00Z",
            100],
            ["2019-10-30T13:00:00Z",
            0],
            ["2019-10-30T13:20:00Z",
            0],
            ["2019-10-30T13:40:00Z",
            100],
            ["2019-10-30T14:00:00Z",
            0],
            ["2019-10-30T14:20:00Z",
            0],
            ["2019-10-30T14:40:00Z",
            100],
            ["2019-10-30T15:00:00Z",
            0],
            ["2019-10-30T15:20:00Z",
            0],
            ["2019-10-30T15:40:00Z",
            100],
            ["2019-10-30T16:00:00Z",
            0],
            ["2019-10-30T16:20:00Z",
            0],
            ["2019-10-30T16:40:00Z",
            100],
            ["2019-10-30T17:30:00Z",
            50]],
            "ArmSystemEventsRequest": "/subscriptions/4f9b81be-fa32-4f96-aeb3-fc5c3f678df9/resourceGroups/test-group/providers/microsoft.insights/components/test-rule/query?query=%0d%0a++++++++++++++++systemEvents%0d%0a++++++++++++++++%7c+where+timestamp+%3e%3d+datetime(%272019-10-30T17%3a20%3a00.0000000Z%27)+%0d%0a++++++++++++++++%7c+where+itemType+%3d%3d+%27systemEvent%27+and+name+%3d%3d+%27ProactiveDetectionInsight%27+%0d%0a++++++++++++++++%7c+where+dimensions.InsightType+in+(%275%27%2c+%277%27)+%0d%0a++++++++++++++++%7c+where+dimensions.InsightDocumentId+%3d%3d+%27718fb0c3-425b-4185-be33-4311dfb4deeb%27+%0d%0a++++++++++++++++%7c+project+dimensions.InsightOneClassTable%2c+%0d%0a++++++++++++++++++++++++++dimensions.InsightExceptionCorrelationTable%2c+%0d%0a++++++++++++++++++++++++++dimensions.InsightDependencyCorrelationTable%2c+%0d%0a++++++++++++++++++++++++++dimensions.InsightRequestCorrelationTable%2c+%0d%0a++++++++++++++++++++++++++dimensions.InsightTraceCorrelationTable%0d%0a++++++++++++&api-version=2018-04-20",
            "LinksTable": [{
                "Link": "<a href=\"https://portal.azure.cn/#blade/AppInsightsExtension/ProactiveDetectionFeedBlade/ComponentId/{\"SubscriptionId\":\"4f9b81be-fa32-4f96-aeb3-fc5c3f678df9\",\"ResourceGroup\":\"test-group\",\"Name\":\"test-rule\"}/SelectedItemGroup/718fb0c3-425b-4185-be33-4311dfb4deeb/SelectedItemTime/2019-10-30T17:50:00Z/InsightType/5\" target=\"_blank\">View full details in Application Insights</a>"
            }],
            "SmartDetectorId": "FailureAnomaliesDetector",
            "SmartDetectorName": "Failure Anomalies",
            "AnalysisTimestamp": "2019-10-30T17:52:32.5802978Z"
        },
        "egressConfig": {
            "displayConfig": [{
                "rootJsonNode": null,
                "sectionName": null,
                "displayControls": [{
                    "property": "DetectionSummary",
                    "displayName": "What was detected?",
                    "type": "Text",
                    "isOptional": false,
                    "isPropertySerialized": false
                },
                {
                    "property": "FormattedOccurenceTime",
                    "displayName": "When did this occur?",
                    "type": "Text",
                    "isOptional": false,
                    "isPropertySerialized": false
                },
                {
                    "property": "DetectedFailureRate",
                    "displayName": "Detected failure rate",
                    "type": "Text",
                    "isOptional": false,
                    "isPropertySerialized": false
                },
                {
                    "property": "NormalFailureRate",
                    "displayName": "Normal failure rate",
                    "type": "Text",
                    "isOptional": false,
                    "isPropertySerialized": false
                },
                {
                    "chartType": "Line",
                    "xAxisType": "Date",
                    "yAxisType": "Percentage",
                    "xAxisName": "",
                    "yAxisName": "",
                    "property": "FailureRateChart",
                    "displayName": "Failure rate over last 12 hours",
                    "type": "Chart",
                    "isOptional": false,
                    "isPropertySerialized": false
                },
                {
                    "defaultLoad": true,
                    "displayConfig": [{
                        "rootJsonNode": null,
                        "sectionName": null,
                        "displayControls": [{
                            "showHeader": false,
                            "columns": [{
                                "property": "Name",
                                "displayName": "Name"
                            },
                            {
                                "property": "Value",
                                "displayName": "Value"
                            }],
                            "property": "tables[0].rows[0][0]",
                            "displayName": "All of the failed requests had these characteristics:",
                            "type": "Table",
                            "isOptional": false,
                            "isPropertySerialized": true
                        }]
                    }],
                    "property": "ArmSystemEventsRequest",
                    "displayName": "",
                    "type": "ARMRequest",
                    "isOptional": false,
                    "isPropertySerialized": false
                },
                {
                    "showHeader": false,
                    "columns": [{
                        "property": "Link",
                        "displayName": "Link"
                    }],
                    "property": "LinksTable",
                    "displayName": "Links",
                    "type": "Table",
                    "isOptional": false,
                    "isPropertySerialized": false
                }]
            }]
        }
    },
    "id": "/subscriptions/4f9b81be-fa32-4f96-aeb3-fc5c3f678df9/resourcegroups/test-group/providers/microsoft.insights/components/test-rule/providers/Microsoft.AlertsManagement/alerts/7daf8739-ca8a-4562-b69a-ff28db4ba0a5",
    "type": "Microsoft.AlertsManagement/alerts",
    "name": "Failure Anomalies - test-rule"
}

会审和诊断警报Triage and diagnose an alert

警报指示已检测到失败请求中有异常上升。An alert indicates that an abnormal rise in the failed request rate was detected. 应用或其环境很可能存在某些问题。It's likely that there is some problem with your app or its environment.

若要进一步调查,单击该页面的“查看 Application Insights 中的完整详细信息”链接可直接转到搜索页,该页面已针对相关请求、异常、依赖项或跟踪进行筛选。To investigate further, click on 'View full details in Application Insights' the links in this page will take you straight to a search page filtered to the relevant requests, exception, dependency, or traces.

你还可以打开 Azure 门户,导航到应用的 Application Insights 资源并打开“失败”页。You can also open the Azure portal, navigate to the Application Insights resource for your app, and open the Failures page.

单击“诊断失败”有助于获取更多信息并解决问题。Clicking on 'Diagnose failures' will help you get more details and resolve the issue.

根据请求百分比和受影响用户数,可以确定问题的紧急程度。From the percentage of requests and number of users affected, you can decide how urgent the issue is. 在上面的示例中,将 78.5% 的失败率与 2.2% 的正常失败率比较,说明有一些不好的情况正在发生。In the example above, the failure rate of 78.5% compares with a normal rate of 2.2%, indicates that something bad is going on. 另一方面,只有 46 位用户受到影响。On the other hand, only 46 users were affected. 如果它是你的应用,你能够评估情况的严重性。If it was your app, you'd be able to assess how serious that is.

在许多情况下,能够从提供的请求名称、异常、依赖项失败和跟踪数据快速诊断问题。In many cases, you will be able to diagnose the problem quickly from the request name, exception, dependency failure, and trace data provided.

在此示例中,由于达到请求限制,SQL 数据库中出现异常。In this example, there was an exception from SQL Database due to request limit being reached.

查看最近的警报Review recent alerts

在 Application Insights 资源页面单击“警报”以获取最新触发的警报:Click Alerts in the Application Insights resource page to get to the most recent fired alerts:

区别是什么...What's the difference ...

智能检测失败异常对其他类似但又不同的 Application Insight 功能进行补充。Smart Detection of failure anomalies complements other similar but distinct features of Application Insights.

  • 指标警报由你设置,并且可监视各种指标,如 CPU 占用率、请求速率、页面加载时间等。Metric Alerts are set by you and can monitor a wide range of metrics such as CPU occupancy, request rates, page load times, and so on. 可以将它们用于发出警告,例如在需要添加更多资源时。You can use them to warn you, for example, if you need to add more resources. 相比之下,智能检测失败异常涵盖小范围的关键指标(当前仅失败请求速率),设计成一旦 Web 应用的失败请求速率相较于 Web 应用的正常行为而言增加,便会以近乎实时的方式通知你。By contrast, Smart Detection of failure anomalies covers a small range of critical metrics (currently only failed request rate), designed to notify you in near real-time manner once your web app's failed request rate increases compared to web app's normal behavior. 与指标警报不同,智能检测会自动设置和更新行为中响应更改的阈值。Unlike metric alerts, Smart Detection automatically sets and updates thresholds in response changes in the behavior. 智能检测还会启动诊断工作,从而为你节省解决问题的时间。Smart Detection also starts the diagnostic work for you, saving you time in resolving issues.

  • 智能检测性能异常还使用计算机智能发现指标中的异常模式,使得你无需执行任何配置。Smart Detection of performance anomalies also uses machine intelligence to discover unusual patterns in your metrics, and no configuration by you is required. 但与智能检测失败异常不同,智能检测性能异常的目的是查找可能不能提供很好服务的使用情况复写体分段,例如通过特定类型浏览器上的特定页面。But unlike Smart Detection of failure anomalies, the purpose of Smart Detection of performance anomalies is to find segments of your usage manifold that might be badly served - for example, by specific pages on a specific type of browser. 将每日执行分析,如果找到任何结果,则很可能紧急程度远低于警报。The analysis is performed daily, and if any result is found, it's likely to be much less urgent than an alert. 相比之下,会对传入的应用程序数据连续执行失败异常分析,如果服务器失败率超出预期值,会在几分钟内通知你。By contrast, the analysis for failure anomalies is performed continuously on incoming application data, and you will be notified within minutes if server failure rates are greater than expected.

如果收到智能检测警报If you receive a Smart Detection alert

为什么会收到此警报?Why have I received this alert?

  • 我们检测到与之前一段时间的正常基线相比,失败请求率中出现异常上升。We detected an abnormal rise in failed requests rate compared to the normal baseline of the preceding period. 对失败和关联应用程序数据进行分析后,我们认为存在问题并且你应当给予关注。After analysis of the failures and associated application data, we think that there is a problem that you should look into.

通知是否表示肯定存在问题?Does the notification mean I definitely have a problem?

  • 我们尝试针对应用中断或降级发出警报,但只有可以完全了解语义以及对应用或用户的影响。We try to alert on app disruption or degradation, but only you can fully understand the semantics and the impact on the app or users.

那么,你是否正在查看我的应用程序数据?So, you are looking at my application data?

  • 不是。No. 该服务完全是自动的。The service is entirely automatic. 只有你会收到通知。Only you get the notifications. 数据是私有数据。Your data is private.

是否需要订阅此警报?Do I have to subscribe to this alert?

  • 不是。No. 发送请求数据的每个应用程序都有智能检测警报规则。Every application that sends request data has the Smart Detection alert rule.

是否可以取消订阅或者获取已发送至同事的通知?Can I unsubscribe or get the notifications sent to my colleagues instead?

  • 是,在“警报”规则中,单击“智能检测”规则可配置它。Yes, In Alert rules, click the Smart Detection rule to configure it. 可以禁用警报,或更改警报的收件人。You can disable the alert, or change recipients for the alert.

我丢失了电子邮件。在哪里可以找到门户中的通知?I lost the email. Where can I find the notifications in the portal?

  • 在活动日志中。In the Activity logs. 在 Azure 中,打开应用的 Application Insights 资源,并选择“活动日志”。In Azure, open the Application Insights resource for your app, then select Activity logs.

一些警报关于已知问题,我不希望接收它们。Some of the alerts are about known issues and I do not want to receive them.

后续步骤Next steps

这些诊断工具可帮助检查应用中的数据:These diagnostic tools help you inspect the data from your app:

智能检测是自动执行的。Smart detections are automatic. 但是或许你想要设置更多的警报?But maybe you'd like to set up some more alerts?