智能检测 - 性能异常Smart Detection - Performance Anomalies

Application Insights 可自动分析 Web 应用程序的性能,并在出现潜在问题时发出警告。Application Insights automatically analyzes the performance of your web application, and can warn you about potential problems. 我们会通过邮件将智能检测通知发送给你。You might be reading this because you received one of our smart detection notifications.

不需要对此功能进行任何特殊设置,只需在应用中配置 Application Insights 即可(在 ASP.NETJavaNode.js网页代码中)。This feature requires no special setup, other than configuring your app for Application Insights (on ASP.NET, Java, or Node.js, and in web page code). 在应用生成足够多的遥测数据后,此功能会激活。It is active when your app generates enough telemetry.

我何时会收到智能检测通知?When would I get a smart detection notification?

Application Insights 已根据以下依据之一检测到应用程序出现性能下降:Application Insights has detected that the performance of your application has degraded in one of these ways:

  • 响应时间延长 - 应用响应请求的速度比平时要慢。Response time degradation - Your app has started responding to requests more slowly than it used to. 这种变化可能是在瞬间发生的,例如,最新的部署中出现回归。The change might have been rapid, for example because there was a regression in your latest deployment. 也有可能是逐渐发生的,原因可能是内存泄漏。Or it might have been gradual, maybe caused by a memory leak.
  • 依赖项持续时间延长 - 应用对 REST API、数据库或其他依赖项发出调用。Dependency duration degradation - Your app makes calls to a REST API, database, or other dependency. 依赖项的响应速度比平时要慢。The dependency is responding more slowly than it used to.
  • 性能模式变慢 - 应用似乎出现了只会影响某些请求的性能问题。Slow performance pattern - Your app appears to have a performance issue that is affecting only some requests. 例如,页面在某种浏览器中的加载速度比在其他浏览器中要慢,或者通过某个特定服务器处理请求的速度较慢。For example, pages are loading more slowly on one type of browser than others; or requests are being served more slowly from one particular server. 目前,我们的算法会分析页面加载时间、请求响应时间和依赖项响应时间。Currently, our algorithms look at page load times, request response times, and dependency response times.

智能检测需要至少 8 天的、具有适当数量的遥测数据,才能建立正常性能的基准。Smart Detection requires at least 8 days of telemetry at a workable volume in order to establish a baseline of normal performance. 因此,在运行应用程序这么长时间后,任何严重的问题都会导致发出通知。So, after your application has been running for that period, any significant issue will result in a notification.

收到通知是否意味着我的应用肯定有问题?Does my app definitely have a problem?

不是,通知并不意味着应用肯定有问题。No, a notification doesn't mean that your app definitely has a problem. 这只是关于可能需要密切关注的某些内容的建议。It's simply a suggestion about something you might want to look at more closely.

如何解决问题?How do I fix it?

通知包含诊断信息。The notifications include diagnostic information. 下面是一个示例:Here's an example:

下面是“服务器响应时间延长”检测的示例

  1. 会审Triage. 通知会显示有多少用户或多少操作受到影响。The notification shows you how many users or how many operations are affected. 这可以帮助你对问题分配优先级。This can help you assign a priority to the problem.

  2. 范围Scope. 该问题是影响所有流量,还是只影响某些页面?Is the problem affecting all traffic, or just some pages? 它是否只出现在特定的浏览器或位置中?Is it restricted to particular browsers or locations? 可以从通知中获取此信息。This information can be obtained from the notification.

  3. 诊断Diagnose. 通常,通知的诊断信息会提示问题的性质。Often, the diagnostic information in the notification will suggest the nature of the problem. 例如,如果请求率较高时响应速度变慢,则表示服务器或依赖项过载。For example, if response time slows down when request rate is high, that suggests your server or dependencies are overloaded.

    否则,可在 Application Insights 中打开“性能”边栏选项卡,Otherwise, open the Performance blade in Application Insights.

配置电子邮件通知Configure Email Notifications

智能检测通知默认已启用,将发送到对 Application Insights 资源拥有所有者、参与者和读取者访问权限的用户。Smart Detection notifications are enabled by default and sent to those who have owners, contributors and readers access to the Application Insights resource. 若要更改此配置,请在电子邮件通知中单击“配置”,或者在 Application Insights 中打开“智能检测”设置。 To change this, either click Configure in the email notification, or open Smart Detection settings in Application Insights.

智能检测设置

  • 可使用“智能检测”电子邮件中的“取消订阅”链接来停止接收电子邮件通知。 You can use the unsubscribe link in the Smart Detection email to stop receiving the email notifications.

每天只会针对每个 Application Insights 资源发送一封有关智能检测性能异常的电子邮件。Emails about Smart Detections performance anomalies are limited to one email per day per Application Insights resource. 只有当天至少检测到一个新问题时,才会发送电子邮件。The email will be sent only if there is at least one new issue that was detected on that day. 将不会收到任何重复的消息。You won't get repeats of any message.

常见问题FAQ

  • 那么,Azure 员工会查看我的数据?So, Azure staff look at my data?

    • 否。No. 该服务完全是自动的。The service is entirely automatic. 只有你会收到通知。Only you get the notifications. 数据是私有数据。Your data is private.
  • 是否分析由 Application Insights 收集的所有数据?Do you analyze all the data collected by Application Insights?

    • 目前不会。Not at present. 目前,我们分析请求响应时间、依赖项响应时间和页面加载时间。Currently, we analyze request response time, dependency response time and page load time. 其他指标的分析功能正在规划中,今后有望推出。Analysis of additional metrics is on our backlog looking forward.
  • 此功能适用于哪些类型的应用程序?What types of application does this work for?

    • 在生成相应遥测数据的任何应用程序中都可以检测到这些降级。These degradations are detected in any application that generates the appropriate telemetry. 如果已在 Web 应用中安装 Application Insights,则可以自动跟踪请求和依赖项。If you installed Application Insights in your web app, then requests and dependencies are automatically tracked.
  • 是否可以创建自己的异常检测规则或自定义现有的规则?Can I create my own anomaly detection rules or customize existing rules?

    • 目前不可以,但可以:Not yet, but you can:
  • 执行分析的频率是多少?How often is the analysis performed?

    • 我们每天针对前一天(UTC 时区整天)的遥测数据运行分析。We run the analysis daily on the telemetry from the previous day (full day in UTC timezone).
  • 那么这是否会替换指标警报So does this replace metric alerts?

    • 否。No. 我们不确定检测用户视为异常的每个行为。We don't commit to detecting every behavior that you might consider abnormal.
  • 如果不执行任何操作来响应通知,是否会收到提醒?If I don't do anything in response to a notification, will I get a reminder?

    • 否,仅会收到关于每个问题的消息一次。No, you get a message about each issue only once. 如果问题持续出现,它会在“智能检测源”边栏选项卡中更新。If the issue persists it will be updated in the Smart Detection feed blade.
  • 我丢失了电子邮件。在哪里可以找到门户中的通知?I lost the email. Where can I find the notifications in the portal?

    • 在应用的 Application Insights 概述中,单击“智能检测” 磁贴。In the Application Insights overview of your app, click the Smart Detection tile. 在该磁贴中可以找到过去最长 90 天的所有通知。There you'll be able to find all notifications up to 90 days back.

如何提高性能?How can I improve performance?

正如你从自己的经验所知,缓慢和失败的响应对于网站用户而言是最大的困扰之一。Slow and failed responses are one of the biggest frustrations for web site users, as you know from your own experience. 因此,必须解决问题。So, it's important to address the issues.

会审Triage

首先,这是否很重要?First, does it matter? 如果某个页面的加载速度总是很慢,但仅 1% 的站点用户曾需要查看它,或许可以考虑更重要的事情。If a page is always slow to load, but only 1% of your site's users ever have to look at it, maybe you have more important things to think about. 另一方面,如果只有 1% 的用户打开该页面,但它每次都引发异常,这可能值得调查。On the other hand, if only 1% of users open it, but it throws exceptions every time, that might be worth investigating.

使用影响声明(受影响的用户或流量百分比)作为一般原则,但请注意这并非全部。Use the impact statement (affected users or % of traffic) as a general guide, but be aware that it isn't the whole story. 收集其他证据以确认。Gather other evidence to confirm.

考虑问题的参数。Consider the parameters of the issue. 如果它是地理位置相关的,请设置包括该区域在内的可用性测试:可能仅该区域存在网络问题。If it's geography-dependent, set up availability tests including that region: there might simply be network issues in that area.

诊断缓慢页面加载Diagnose slow page loads

问题在哪儿?Where is the problem? 是服务器响应缓慢,页面过长,还是浏览器需要执行大量工作才能显示它?Is the server slow to respond, is the page very long, or does the browser have to do a lot of work to display it?

打开“浏览器”指标边栏选项卡。Open the Browsers metric blade. 浏览器页面加载时间的分段显示会显示时间的进展如何。The segmented display of browser page load time shows where the time is going.

  • 如果发送请求时间过高,则服务器响应速度缓慢,或者请求是包含大量数据的 post 请求。If Send Request Time is high, either the server is responding slowly, or the request is a post with a lot of data. 查看性能指标以调查响应时间。Look at the performance metrics to investigate response times.
  • 设置依赖项跟踪以查看缓慢是否由于外部服务或数据库引起的。Set up dependency tracking to see whether the slowness is due to external services or your database.
  • 如果接收响应占主导地位,则页面及其依赖部分(JavaScript、CSS、图像等(但并非异步加载的数据))较长。If Receiving Response is predominant, your page and its dependent parts - JavaScript, CSS, images and so on (but not asynchronously loaded data) are long. 设置可用性测试,并确保设置用于加载依赖部分的选项。Set up an availability test, and be sure to set the option to load dependent parts. 当获得一些结果时,打开某一结果的详细信息,展开它以查看不同文件的加载时间。When you get some results, open the detail of a result and expand it to see the load times of different files.
  • 客户端处理时间表明脚本运行缓慢。High Client Processing time suggests scripts are running slowly. 如果原因并不明显,请考虑添加一些计时代码并发送 trackMetric 调用中的时间。If the reason isn't obvious, consider adding some timing code and send the times in trackMetric calls.

改进缓慢的页面Improve slow pages

存在其上充满关于改进服务器响应和页面加载时间的建议的 Web,因此我们不尝试在此处全都重复一遍。There's a web full of advice on improving your server responses and page load times, so we won't try to repeat it all here. 以下是一些你可能已经知道的提示,仅供考虑:Here are a few tips that you probably already know about, just to get you thinking:

  • 由于文件过大导致加载缓慢:以异步方式加载脚本和其他部分。Slow loading because of big files: Load the scripts and other parts asynchronously. 使用脚本捆绑。Use script bundling. 将主页拆分为单独加载其数据的多个小组件。Break the main page into widgets that load their data separately. 对于长表格,不要发送普通旧 HTML:使用脚本以 JSON 或其他紧凑格式请求数据,并适当地填充表格。Don't send plain old HTML for long tables: use a script to request the data as JSON or other compact format, then fill the table in place. 有很好的框架,可以帮助实现所有这些目标。There are great frameworks to help with all this. (当然,它们还需要大脚本。)(They also entail big scripts, of course.)
  • 缓慢的服务器依赖项:考虑组件的地理位置。Slow server dependencies: Consider the geographical locations of your components. 例如,如果使用的是 Azure,请确保 Web 服务器和数据库位于同一区域。For example, if you're using Azure, make sure the web server and the database are in the same region. 查询检索的信息是否超出自己所需的信息?Do queries retrieve more information than they need? 缓存或批处理是否有帮助?Would caching or batching help?
  • 容量问题:查看响应时间和请求计数的服务器指标。Capacity issues: Look at the server metrics of response times and request counts. 如果响应时间的峰值与请求计数的峰值不成比例,则服务器很有可能是外延式。If response times peak disproportionately with peaks in request counts, it's likely that your servers are stretched.

服务器响应时间延长Server Response Time Degradation

响应时间延长通知提供以下信息:The response time degradation notification tells you:

  • 与此操作的正常响应时间相比的响应时间。The response time compared to normal response time for this operation.
  • 受影响的用户数量。How many users are affected.
  • 检测当天和 7 天前此操作的平均响应时间与第 90 百分位响应时间。Average response time and 90th percentile response time for this operation on the day of the detection and 7 days before.
  • 检测当天和 7 天前此操作的请求计数。Count of this operation requests on the day of the detection and 7 days before.
  • 此操作的降级与相关依赖项的降级之间的关联。Correlation between degradation in this operation and degradations in related dependencies.
  • 帮助你诊断问题的链接。Links to help you diagnose the problem.
    • 帮助你查看哪些地方需要花费操作时间的探查器跟踪(如果在检测期间已收集此操作的探查器跟踪示例,则会提供链接)。Profiler traces to help you view where operation time is spent (the link is available if Profiler trace examples were collected for this operation during the detection period).
    • 指标资源管理器中的性能报告,可在其中分解此操作的时间范围/筛选器。Performance reports in Metric Explorer, where you can slice and dice time range/filters for this operation.
    • 搜索此调用以查看特定调用属性。Search for this call to view specific call properties.
    • 失败报告 - 如果计数大于 1,则表示此操作可能由于性能下降而发生失败。Failure reports - If count > 1 this mean that there were failures in this operation that might have contributed to performance degradation.

依赖项持续时间延长Dependency Duration Degradation

新型应用程序越来越多地采用微服务设计方案,在许多情况下这会导致严重依赖于外部服务。Modern application more and more adopt micro services design approach, which in many cases leads to heavy reliability on external services. 例如,如果应用程序依赖于某个数据平台,或者,即使机器人服务是由自己构建的,它也可能会依赖于某个认知服务提供程序来使机器人能够以更加类似于人类的方式交互,并依赖于某个数据存储服务来让机器人提取解答。For example, if your application relies on some data platform or even if you build your own bot service you will probably relay on some cognitive services provider to enable your bots to interact in more human ways and some data store service for bot to pull the answers from.

依赖项降级通知的示例:Example dependency degradation notification:

下面是“依赖项持续时间延长”检测的示例

请注意,它会指示:Notice that it tells you:

  • 与此操作的正常响应时间相比的持续时间The duration compared to normal response time for this operation
  • 受影响的用户数量How many users are affected
  • 检测当天和 7 天前此依赖项的平均持续时间与第 90 百分位持续时间Average duration and 90th percentile duration for this dependency on the day of the detection and 7 days before
  • 检测当天和 7 天前的依赖项调用数Number of dependency calls on the day of the detection and 7 days before
  • 帮助你诊断问题的链接Links to help you diagnose the problem
    • 指标资源管理器中为此依赖项提供的性能报告Performance reports in Metric Explorer for this dependency
    • 搜索此依赖项调用以查看调用属性Search for this dependency calls to view calls properties
    • 失败报告 - 如果计数大于 1,则表示在检测期间可能由于持续时间延长而导致依赖项调用失败。Failure reports - If count > 1 this means that there were failed dependency calls during the detection period that might have contributed to duration degradation.
    • 打开 Analytics,其中包含用于计算此依赖项持续时间和计数的查询Open Analytics with queries that calculate this dependency duration and count

智能检测执行模式速度缓慢问题Smart Detection of slow performing patterns

Application Insights 可以找到只会影响一部分用户,或者只会在某些情况下影响用户的性能问题。Application Insights finds performance issues that might only affect some portion of your users, or only affect users in some cases. 例如,如果页面在某种浏览器中的加载速度比在其他浏览器中要慢,或者通过特定服务器处理请求时速度较慢,则它会发出通知。For example, notification about pages load is slower on one type of browser than on other types of browsers, or if requests are served more slowly from a particular server. 它还可以发现与属性组合关联的问题,例如在某个地理区域中使用特定操作系统的客户端上加载页面的速度缓慢。It can also discover problems associated with combinations of properties, such as slow page loads in one geographical area for clients using particular operating system.

类似于这样的异常现象,仅通过检查数据很难检测,但它们比你想象的更为普遍。Anomalies like these are very hard to detect just by inspecting the data, but are more common than you might think. 通常仅在收到客户投诉时,它们才会显现出来。Often they only surface when your customers complain. 到那时就为时已晚:受影响用户已转用竞争对手的产品!By that time, it's too late: the affected users are already switching to your competitors!

目前,我们的算法关注页面加载时间、服务器上的请求响应时间和依赖项响应时间。Currently, our algorithms look at page load times, request response times at the server, and dependency response times.

不必设置任何阈值或配置规则。You don't have to set any thresholds or configure rules. 机器学习和数据挖掘算法用于检测异常模式。Machine learning and data mining algorithms are used to detect abnormal patterns.

从电子邮件警报,单击链接以在 Azure 中打开诊断报告

  • When 显示检测到问题的时间。When shows the time the issue was detected.

  • What 描述:What describes:

    • 检测到的问题;The problem that was detected;
    • 我们发现显示问题行为的事件集的特征。The characteristics of the set of events that we found displayed the problem behavior.
  • 表格会将性能不佳的事件集与所有其他事件的平均行为进行比较。The table compares the poorly-performing set with the average behavior of all other events.

单击这些链接以在相关报告上打开指标资源管理器和搜索,按缓慢执行集的时间和属性进行筛选。Click the links to open Metric Explorer and Search on relevant reports, filtered on the time and properties of the slow performing set.

修改时间范围和筛选器以浏览遥测。Modify the time range and filters to explore the telemetry.

后续步骤Next steps

这些诊断工具可帮助检查应用中的遥测数据:These diagnostic tools help you inspect the telemetry from your app:

智能检测是完全自动执行的。Smart detections are completely automatic. 但是或许你想要设置更多的警报?But maybe you'd like to set up some more alerts?