Azure Monitor 指标的指标聚合和显示说明Azure Monitor Metrics metrics aggregation and display explained

本文介绍支持 Azure Monitor 平台指标的 Azure Monitor 时序数据库中指标的聚合。This article explains the aggregation of metrics in the Azure Monitor time-series database that back Azure Monitor platform metrics. 本文还适用于标准 Application Insights 指标This article also applies to standard Application Insights metrics.

这个主题比较复杂,但无需理解本文中的所有信息即可有效使用 Azure Monitor 指标。This is a complex topic and not necessary to understand all the information in this article to use Azure Monitor metrics effectively.

概述和术语Overview and terms

将指标添加到图表时,指标资源管理器会自动预先选择其默认聚合。When you add a metric to a chart, metrics explorer automatically pre-selects its default aggregation. 默认值在基本方案中适用,但可以使用不同的聚合来获得有关指标的更多见解。The default makes sense in the basic scenarios, but you can use different aggregations to gain more insights about the metric. 查看图表上的不同聚合时需要了解指标资源管理器处理它们的方式。Viewing different aggregations on a chart requires that you understand how metrics explorer handles them.

首先明确定义几个术语:Let's define a few terms clearly first:

  • 指标值 - 为特定资源收集的单个度量值。Metric value - A single measurement value gathered for a specific resource.
  • 时序数据库 - 针对存储和检索所有包含值和相应时间戳的数据点而优化的数据库。Time-Series database - A database optimized for the storage and retrieval of data points all containing a value and a corresponding time-stamp.
  • 时间段 - 一段普通时间。Time period - A generic period of time.
  • 时间间隔 - 收集两个指标值之间的时间段。Time interval - The period of time between the gathering of two metric values.
  • 时间范围 - 图表上显示的时间段。Time range - The time period displayed on a chart. 典型默认值为 24 小时。Typical default is 24 hours. 仅提供特定范围。Only specific ranges are available.
  • 时间粒度 - 用于将值聚合在一起以便在图表上显示的时间段。Time granularity or time grain - The time period used to aggregate values together to allow display on a chart. 仅提供特定范围。Only specific ranges are available. 当前最小值为 1 分钟。Current minimum is 1 minute. 时间粒度值应小于所选时间范围才有用,否则整个图表仅显示一个值。The time granularity value should be smaller than the selected time range to be useful, otherwise just one value is shown for the entire chart.
  • 聚合类型 - 通过多个指标值计算得出的一种统计信息。Aggregation type - A type of statistic calculated from multiple metric values.
  • 聚合 - 获取多个输入值,然后使用这些值通过聚合类型定义的规则生成单个输出值的过程。Aggregate - The process of taking multiple input values and then using them to produce a single output value via the rules defined by the aggregation type. 例如,获取多个值的平均值。For example, taking an average of multiple values.

过程摘要Summary of process

指标是一系列使用时间戳存储的值。Metrics are a series of values stored with a time-stamp. 在 Azure 中,大多数指标存储在 Azure 指标时序数据库中。In Azure, most metrics are stored in the Azure Metrics time-series database. 绘制图表时,会从数据库检索所选指标的值,然后基于所选的时间粒度单独聚合这些值。When you plot a chart, the values of the selected metrics are retrieved from the database and then separately aggregated based on the chosen time granularity (also known as time grain). 使用指标资源管理器时间选取器面板选择时间粒度的大小。You select the size of the time granularity using the Metrics Explorer time picker panel. 如果没有进行显式选择,则会根据当前选择的时间范围自动选择时间粒度。If you don’t make an explicit selection, the time granularity is automatically selected based on the currently selected time range. 选择后,在每个时间粒度间隔期间捕获的指标值将聚合并放置在图表上 - 每个间隔一个数据点。Once selected, the metric values that were captured during each time granularity interval are aggregated and placed onto the chart - one datapoint per interval.

聚合类型Aggregation types

指标资源管理器中提供了五种基本的聚合类型。There are five basic aggregation types available in the metrics explorer. 指标资源管理器将隐藏不相关且无法用于给定指标的聚合。Metrics explorer hides the aggregations that are irrelevant and cannot be used for a given metric.

  • Sum - 在聚合间隔内捕获的所有值的总和。Sum - the sum of all values captured over the aggregation interval. 有时称为总聚合。Sometimes referred to as the Total aggregation.
  • Count - 在聚合间隔内捕获的度量的数目。Count - the number of measurements captured over the aggregation interval. Count 不会查看度量值,而只会查看记录数。Count doesn't look at the value of the measurement, only the number of records.
  • Average - 在聚合间隔内捕获的指标值的平均值。Average - the average of the metric values captured over the aggregation interval. 对于大多数指标,此值为 Sum/Count。For most metrics, this value is Sum/Count.
  • Min - 在聚合间隔内捕获的最小值。Min - the smallest value captured over the aggregation interval.
  • Max - 在聚合间隔内捕获的最大值。Max - the largest value captured over the aggregation interval.

例如,假设一个图表显示了在过去 24 小时的时间范围内使用 SUM 聚合的 VM 的“网络输出总量”指标 。For example, suppose a chart is showing the Network Out Total metric for a VM using the SUM aggregation over the last 24-hour time span. 如以下屏幕截图所示,可以从图表的右上方更改时间范围和粒度。The time range and granularity can be changed from the upper right of the chart as seen in the following screenshot.

显示时间范围和时间粒度选取器的屏幕截图

如果时间粒度为 30 分钟,时间范围为 24 小时:For time granularity = 30 minutes and the time range = 24 hours:

  • 通过 48 个数据点绘制图表。The chart is drawn from 48 datapoints. 即 24 小时 x 每小时 2 个数据点(60 分钟/30 分钟)聚合 1 分钟数据点。That is 24 hours x 2 datapoints per hour (60min/30min) aggregated 1-minute datapoints.
  • 折线图在图表绘图区域中连接 48 个点。The line chart connects 48 dots in the chart plot area.
  • 每个数据点表示在每个相关的 30 分钟时间段内发出的所有网络输出字节数的总和。Each datapoint represents the sum of all network out bytes sent out during each of the relevant 30-min time periods.

显示时间范围设置为 24 小时,时间粒度设置为 30 分钟的折线图上的数据的屏幕截图

单击此部分中的图像可查看大图。Click on the images in this section to see larger versions.

如果将时间粒度切换成 15 分钟,则将通过 96 个聚合数据点绘制图表。If you switch the time granularity to 15 minutes, the chart is drawn from 96 aggregated data points. 也就是说,60 分钟/15 分钟 = 每小时 4 个数据点 x 24 小时。That is, 60min/15min = 4 datapoints per hour x 24 hours.

显示时间范围设置为 24 小时,时间粒度设置为 15 分钟的折线图上的数据的屏幕截图

如果时间粒度为 5 分钟,你将获得 24 x (60/5) = 288 个点。For time granularity of 5 minutes, you get 24 x (60/5) = 288 points.

显示时间范围设置为 24 小时,时间粒度设置为 5 分钟的折线图上的数据的屏幕截图

如果时间粒度为 1 分钟(图表上可能的最小值),你将获得 24 x 60/1 = 1440 个点。For time granularity of 1 minute (the smallest possible on the chart), you get 24 x 60/1 = 1440 points.

显示时间范围设置为 24 小时,时间粒度设置为 1 分钟的折线图上的数据的屏幕截图

如前面的屏幕截图所示,图表的这些汇总看起来有所不同。The charts look different for these summations as shown in the previous screenshots. 请注意,此 VM 相对于时间范围的其余部分,短时间内会有很多输出。Notice how this VM has a lot of output in a small time period relative to the rest of the time window.

使用时间粒度,你可以调整图表上的“信噪比”。The time granularity allows you to adjust the "signal-to-noise" ratio on a chart. 更高的聚合可消除噪点并使尖峰趋于平滑。Higher aggregations remove noise and smooth out spikes. 注意底部 1 分钟图表的变化,以及它们如何随着使用更高粒度值而趋于平滑。Notice the variations at the bottom 1-minute chart and how they smooth out as you go to higher granularity values.

在将此数据发送到其他系统(例如警报)时,此平滑行为非常重要。This smoothing behavior is important when you send this data to other systems--for example, alerts. 通常,你不希望当 CPU 在极短时间内达到超过 90% 的尖峰时收到警报。Typically, you usually don't want to be alerted by very short spikes in CPU time over 90%. 但是,如果 CPU 在 5 分钟内保持为 90%,这可能很重要。But if the CPU stays at 90% for 5 minutes, that's likely important. 如果在 CPU(或任何指标)上设置警报规则,则将时间粒度设置为更大的值可减少收到的错误警报数。If you set up an alert rule on CPU (or any metric), making the time granularity larger can reduce the number of false alerts you receive.

请务必确定工作负载的正常值以了解最佳时间间隔。It is important to establish what's "normal" for your workload to know what time interval is best. 这是动态警报的优势之一,这里不对此主题进行介绍。This is one of the benefits of dynamic alerts, which is a different topic not covered here.

系统如何收集指标How the system collects metrics

数据收集因指标而异。Data collection varies by metric.

度量值收集频率Measurement collection frequency

有两种类型的收集周期。There are two types of collection periods.

  • 定期 - 以不会发生变化的一致时间间隔收集指标。Regular - The metric is gathered at a consistent time interval that does not vary.

  • 基于活动 - 根据特定类型的事务发生的时间收集指标。Activity-based - The metric is gathered based on when a transaction of a certain type occurs. 每个事务都有一个指标条目和一个时间戳。Each transaction has a metric entry and a time stamp. 由于收集并非定期进行,因此,在给定的时间段内存在不同的记录数。They are not gathered at regular intervals so there are a varying number of records over a given time period.

粒度Granularity

最小时间间隔为 1 分钟,但基础系统可以根据指标更快地捕获数据。The minimum time interval is 1 minute, but the underlying system may capture data faster depending on the metric. 例如,按固定的间隔每隔 15 秒跟踪一次 CPU 百分比。For example, CPU percentage is tracked every 15 seconds at a regular interval. 由于 HTTP 失败作为事务进行跟踪,因此它们很容易超过一分钟一次以上。Because HTTP failures are tracked as transactions, they can easily exceed many more than one a minute. 其他指标(如 SQL 存储)每隔 20 分钟捕获一次。Other metrics such as SQL Storage are captured every 20 minutes. 此选择取决于单个资源提供程序和类型。This choice is up to the individual resource provider and type. 大多数指标都尝试提供可能的最小间隔。Most try to provide the smallest interval possible.

维度、拆分和筛选Dimensions, splitting, and filtering

为每个单独的资源捕获指标。Metrics are captured for each individual resource. 但是,收集、存储和用图表绘制指标的级别可能会有所不同。However, the level at which the metrics are collected, stored, and able to be charted may vary. 此级别由“指标维度”中提供的其他指标表示。This level is represented by additional metrics available in metrics dimensions. 每个单独的资源提供程序都可以定义其收集的数据的详细程度。Each individual resource provider gets to define how detailed the data they collect is. Azure Monitor 仅定义应如何显示和存储此类详细信息。Azure Monitor only defines how such detail should be presented and stored.

在指标资源管理器中绘制指标图表时,可以选择按维度“拆分”图表。When you chart a metric in metric explorer, you have the option to "split" the chart by a dimension. 拆分图表意味着查看基础数据以获取更多详细信息,并在指标资源管理器中查看绘制成图表或筛选的数据。Splitting a chart means that you are looking into the underlying data for more detail and seeing that data charted or filtered in metric explorer.

例如,Microsoft.ApiManagement/service 将“位置”作为多种指标的维度。For example, Microsoft.ApiManagement/service has Location as a dimension for many metrics.

  • “容量”就是一个这样的指标。Capacity is one such metric. 具有“位置”维度意味着,基础系统将存储每个位置的容量的指标记录,而不是总量的一个指标记录。Having the Location dimension implies that the underlying system is storing a metric record for the capacity of each location, rather than just one for the aggregate amount. 然后,你可以在指标图表中检索或拆分该信息。You can then retrieve or split out that information in a metric chart.

  • 查看“网关请求的总持续时间”,有 2 个维度“位置”和“主机名”,再次让你了解持续时间的位置以及它所来自的主机名 。Looking at Overall Duration of Gateway Requests, there are 2 dimensions Location and Hostname, again letting you know the location of a duration and which hostname it came from.

  • “请求”是一个更灵活的指标,具有 7 个不同的维度。One of the more flexible metrics, Requests, has 7 different dimensions.

有关每个指标和可用维度的详细信息,请查看 Azure Monitor 支持的指标一文。Check the Azure Monitor metrics supported article for details on each metric and the dimensions available. 另外,每个资源提供程序和类型的相关文档可能会提供有关维度及其度量内容的其他信息。In addition, the documentation for each resource provider and type may provide additional information on the dimensions and what they measure.

可以结合使用拆分和筛选来深入了解问题。You can use splitting and filtering together to dig into a problem. 下面是一个图形示例,其中显示了资源组中一组 VM 的平均磁盘写入字节数。Below is an example of a graphic showing the Avg Disk Write Bytes for a group of VMs in a resource group. 我们汇总了具有此指标的所有 VM,但我们可能想要深入了解哪些 VM 是造成早上 6 点左右高峰的真正原因。We have a rollup of all the VMs with this metric, but we may want to dig into see which are actually responsible for the peaks around 6AM. 它们是否是相同的计算机?Are they the same machine? 涉及多少台计算机?How many machines are involved?

显示 Contoso Hotels 资源组中所有虚拟机的磁盘写入字节总数的屏幕截图

单击此部分中的图像可查看大图。Click on the images in this section to see larger versions.

应用拆分时,可以看到基础数据,但这些数据有点混乱。When we apply splitting, we can see the underlying data, but it's a bit of a mess. 原来有 20 个 VM 被聚合到上面的图表中。Turns out there are 20 VMs being aggregated into the chart above. 在这种情况下,我们将鼠标悬停在早上 6 点的大高峰上,此高峰告知我们原因为 CH-DCVM11。In this case, we've used our mouse to hover over the large peak at 6AM that tells us that CH-DCVM11 is the cause. 但是,由于其他 VM 打乱了图表,因此很难看到与该 VM 相关的其余数据。but it's hard to see the rest of the data associated with that VM because of other VMs cluttering the chart.

屏幕截图,显示按虚拟机名称拆分的 Contoso Hotels 资源组中所有虚拟机的磁盘写入字节总数

使用筛选可以清理图表,以查看实际发生的情况。Using filtering allows us to clean up the chart to see what's really happening. 可以选中或取消选中要查看的 VM。You can check or uncheck the VMs you want to see. 请注意虚线。Notice the dotted lines. 后面部分将介绍这些内容。Those are mentioned in a later section.

屏幕截图,显示按虚拟机名称拆分和筛选的 Contoso Hotels 资源组中所有虚拟机的磁盘写入字节总数

有关如何在指标资源管理器图表中显示拆分维度数据的详细信息,请参阅指标资源管理器的高级功能 - 筛选和拆分For more information on how to show split dimension data on a metric explorer chart, see Advanced features of metrics explorer- filters and splitting.

NULL 值和零值NULL and zero values

当系统需要来自资源的指标数据但未收到该数据时,它将记录一个 NULL 值。When the system expects metric data from a resource but doesn't receive it, it records a NULL value. NULL 值不同于零值,它在计算聚合和绘制图表时非常重要。NULL is different than a zero value, which becomes important in the calculation of aggregations and charting. NULL 值不算作有效度量值。NULL values are not counted as valid measurements.

NULL 值在不同图表上以不同方式显示。NULLs show up differently on different charts. 散点图跳过在图表上显示点。Scatter plots skip showing a dot on the chart. 条形图跳过显示条形。Bar charts skip showing the bar. 在折线图中,NULL 值可显示为点或虚线,如上一部分屏幕截图中所示。On line charts, NULL can show up as dotted or dashed lines like those shown in the screenshot in the previous section. 计算包含 NULL 值的平均值时,要用于计算平均值的数据点较少。When calculating averages that include NULLs, there are fewer data points to take the average from. 此行为有时会导致图表上的值意外下降,但通常比将值转换为零并用作有效数据点的情况要好很多。This behavior can sometimes result in an unexpected drop in values on a chart, though usually less so than if the value was converted to a zero and used as a valid datapoint.

使用平台指标,每个资源提供程序都可以根据最适合给定指标的情况决定是使用零值还是 NULL 值。With platform metrics, each resource provider decides whether to use zeros or NULLs based on what makes the most sense for a given metric.

Azure Monitor 警报使用资源提供程序写入指标数据库的值,因此先查看数据,了解资源提供程序如何处理 NULL 值非常重要。Azure Monitor alerts use the values the resource provider writes to the metric database, so it's important to know how the resource provider handles NULLs by viewing the data first.

聚合的工作方式How aggregation works

前面系统中的指标图表显示了不同类型的聚合数据。The metrics charts in the previous system show different types of aggregated data. 系统对数据进行预先聚合,以便在不需要大量重复计算的情况下更快地显示请求的图表。The system pre-aggregates the data so that the requested charts can show quicker without a lot of repeated computation.

在本示例中:In this example:

  • 我们需要收集名为“HTTP 失败”的虚构事务指标 We are collecting a fictitious transactional metric called HTTP failures
  • “服务器”是“HTTP 失败”指标的一个维度。Server is a dimension for the HTTP failures metric.
  • 我们有 3 个服务器 - 服务器 A、B 和 C。We have 3 servers - Server A, B, and C.

为了简化说明,我们将仅从 SUM 聚合类型开始。To simplify the explanation, we'll start with the SUM aggregation type only.

亚分钟到 1 分钟聚合Sub minute to 1-minute aggregation

首先收集原始指标数据并将其存储在 Azure Monitor 指标数据库中。First raw metric data is collected and stored in the Azure Monitor metrics database. 在此用例中,由于“服务器”是一个维度,因此每个服务器都存储了带时间戳的事务记录。In this case, each server has transaction records stored with a timestamp because Server is a dimension. 假设你作为客户可以查看的最小时间段为 1 分钟,则这些时间戳将首先聚合为每个单独服务器的 1 分钟指标值。Given that the smallest time period you can view as a customer is 1 minute, those timestamps are first aggregated into 1-minute metric values for each individual server. 服务器 B 的聚合过程如下图所示。The aggregation process for Server B is shown in the graphic below. 服务器 A 和 C 以相同方式进行聚合,并具有不同数据。Servers A and C are done in the same way and have different data.

显示亚分钟事务条目到 1 分钟聚合的屏幕截图。

生成的 1 分钟聚合值作为新条目存储在指标数据库中,以便可以收集这些值以供后续计算。The resulting 1-minute aggregated values are stored as new entries in the metrics database so they can be gathered for later calculations.

显示跨服务器维度的多个 1 分钟聚合条目的屏幕截图。服务器 A、B 和 C 单独显示

维度聚合Dimension aggregation

然后按维度折叠 1 分钟计算,并再次存储为单个记录。The 1-minute calculations are then collapsed by dimension and again stored as individual records. 在此示例中,所有单个服务器的所有数据都聚合为 1 分钟间隔指标,并存储在指标数据库中,以供在之后的聚合中使用。In this case, all the data from all the individual servers are aggregated into a 1-minute interval metric and stored in the metrics database for use in later aggregations.

屏幕截图,显示聚合为 1 分钟“所有服务器”条目的服务器 A、B 和 C 的多个 1 分钟聚合条目

为清楚起见,下表显示了聚合方法。For clarity, the following table shows the method of aggregation.

周期Period 服务器 AServer A 服务器 BServer B 服务器 CServer C Sum (A+B+C)Sum (A+B+C)
第 1 分钟Minute 1 11 11 11 33
第 2 分钟Minute 2 00 55 11 66
第 3 分钟Minute 3 00 55 11 66
第 4 分钟Minute 4 22 33 44 99
第 5 分钟Minute 5 11 00 33 44
第 6 分钟Minute 6 11 00 44 55
第 7 分钟Minute 7 11 22 44 77
第 8 分钟Minute 8 00 11 00 11
第 9 分钟Minute 9 11 11 44 66
第 10 分钟Minute 10 22 11 00 33

上面仅显示了一个维度,但此相同的聚合和存储过程适用于指标支持的所有维度。Only one dimension is shown above, but this same aggregation and storage process occurs for all dimensions that a metric supports.

  • 将值收集到该维度的 1 分钟聚合集中。Collect values into 1-minute aggregated set by that dimension. 存储这些值。Store those values.
  • 将维度折叠为 1 分钟聚合 SUM。Collapse the dimension into a 1-minute aggregated SUM. 存储这些值。Store those values.

接下来介绍称为 NetworkAdapter 的 HTTP 失败的另一个维度。Let's introduce another dimension of HTTP failures called NetworkAdapter. 假设每个服务器的适配器数各不相同。Let's say we had a varying number of adapters per server.

  • 服务器 A 具有 1 个适配器Server A has 1 adapter
  • 服务器 B 具有 2 个适配器Server B has 2 adapters
  • 服务器 C 具有 3 个适配器Server C has 3 adapters

我们将分别收集以下事务的数据。We'd collect data for the following transactions separately. 它们将标记为:They would be marked with:

  • 时间A time
  • A value
  • 事务来自的服务器The server the transaction came from
  • 事务来自的适配器The adapter that the transaction came from

然后,将每个亚分钟流聚合为 1 分钟时序值,并存储在 Azure Monitor 指标数据库中:Each of those subminute streams would then be aggregated into 1-minute time-series values and stored in the Azure Monitor metric database:

  • 服务器 A,适配器 1Server A, Adapter 1
  • 服务器 B,适配器 1Server B, Adapter 1
  • 服务器 B,适配器 2Server B, Adapter 2
  • 服务器 C,适配器 1Server C, Adapter 1
  • 服务器 C,适配器 2Server C, Adapter 2
  • 服务器 C,适配器 3Server C, Adapter 3

此外,还将存储以下折叠的聚合:In addition, the following collapsed aggregations would also be stored:

  • 服务器 A,适配器 1(由于没有要折叠的内容,它将再次存储)Server A, Adapter 1 (because there is nothing to collapse, it would be stored again)
  • 服务器 B,适配器 1+2Server B, Adapter 1+2
  • 服务器 C,适配器 1+2+3Server C, Adapter 1+2+3
  • 所有服务器,所有适配器Servers ALL, Adapters ALL

这表明具有大量维度的指标具有更多的聚合。This shows that metrics with large numbers of dimensions have a larger number of aggregations. 了解所有排列并不重要,只需了解推理即可。It's not important to know all the permutations, just understand the reasoning. 系统需要同时存储单独的数据和聚合数据,以便快速检索以访问任何图表。The system wants to have both the individual data and the aggregated data stored for quick retrieval for access on any chart. 系统会根据你选择要显示的内容选择最相关的存储聚合或基础原始数据。The system picks either the most relevant stored aggregation or the underlying raw data depending on what you choose to display.

无维度的聚合Aggregation with no dimensions

由于此指标具有“服务器”维度,因此,你可以通过拆分和筛选来访问上述服务器 A、B 和 C 的基础数据,如本文前面所述。Because this metric has a dimension Server, you can get to the underlying data for server A, B, and C above via splitting and filtering, as explained earlier in this article. 如果指标没有“服务器”维度,则你作为客户只能访问图表上以黑色显示的聚合 1 分钟总和。If the metric didn't have Server as a dimension, you as a customer could only access the aggregated 1-minute sums shown in black on the diagram. 即 3、6、6、9 等值。系统也不会执行基础工作来聚合拆分值,它不会在指标资源管理器中使用这些值或通过指标 REST API 发送它们。That is, the values of 3, 6, 6, 9, etc. The system also would not do the underlying work to aggregate split values it would never use them in metric explorer or send them out via the metrics REST API.

查看超过 1 分钟的时间粒度Viewing time granularities above 1 minute

如果你需要更大粒度的指标,系统将使用 1 分钟聚合总和来计算更大时间粒度的总和。If you ask for metrics at a larger granularity, the system uses the 1-minute aggregated sums to calculate the sums for the larger time granularities. 下面的虚线显示了 2 分钟和 5 分钟时间粒度的求和方法。Below, dotted lines show the summation method for the 2-minute and 5-minute time granularities. 同样,为了简单起见,我们只显示 SUM 聚合类型。Again, we are showing just the SUM aggregation type for simplicity.

屏幕截图,显示聚合为 2 分钟和 5 分钟时间段的跨服务器维度的多个 1 分钟聚合条目。

对于 2 分钟时间粒度。For the 2-minute time granularity.

周期Period 总和Sums
第 1 分钟和第 2 分钟Minute 1 & 2 (3 + 6) = 9(3 + 6) = 9
第 3 分钟和第 4 分钟Minute 3 & 4 (6 + 9) = 15(6 + 9) = 15
第 4 分钟和第 5 分钟Minute 4 & 5 (4 + 5) = 9(4 + 5) = 9
第 6 分钟和第 7 分钟Minute 6 & 7 (7 + 1) = 8(7 + 1) = 8
第 8 分钟和第 9 分钟Minute 8 & 9 (6 + 3) = 9(6 + 3) = 9

对于 5 分钟时间粒度。For 5-minute time granularity.

周期Period 总和Sums
第 1 分钟到第 5 分钟Minute 1 through 5 3 + 6 + 6 + 9 + 4 = 283 + 6 + 6 + 9 + 4 = 28
第 6 分钟到第 10 分钟Minute 6 through 10 5 + 7 + 1 + 6 + 3 = 225 + 7 + 1 + 6 + 3 = 22

系统使用存储的聚合数据,从而获得最佳性能。The system uses the stored aggregated data that gives the best performance.

下面是上述 1 分钟聚合过程的大图,其中,为提高可读性,略去了一些箭头。Below is the larger diagram for the above 1-minute aggregation process, with some of the arrows left out to improve readability.

显示上述 3 个屏幕截图的整合的屏幕截图。以 1 分钟、2 分钟和 5 分钟间隔聚合的跨服务器维度的多个 1 分钟聚合条目。服务器 A、B 和 C 单独显示

更复杂的示例More complex example

下面是一个更大的示例,它使用了称为 HTTP 响应时间(以毫秒为单位)的虚构指标值。Following is a larger example using values for a fictitious metric called HTTP Response time in milliseconds. 这里,我们将介绍其他复杂性级别。Here we introduce additional levels of complexity.

  1. 我们将展示 Sum、Count、Min 和 Max 的聚合,以及 Average 的计算。We show aggregation for Sum, Count, Min, and Max and the calculation for Average.
  2. 我们将展示 NULL 值以及它们对计算的影响。We show NULL values and how they affect calculations.

请看下面的示例。Consider the following example. 框和箭头显示了一些示例,说明如何聚合和计算这些值。The boxes and arrows show examples of how the values are aggregated and calculated.

上一部分中所述的相同的 1 分钟预先聚合过程适用于 Sums、Count、Minimum 和 Maximum。The same 1-minute preaggregation process as described in the previous section occurs for Sums, Count, Minimum, and Maximum. 但是,Average 不是预先聚合的。However, Average is NOT pre-aggregated. 需要使用聚合数据对该值重新进行计算,以避免计算错误。It is recalculated using aggregated data to avoid calculation errors.

屏幕截图,显示聚合和计算从 1 分钟到 10 分钟的总和、计数、最小值、最大值和平均值的复杂示例。

考虑上面突出显示的 1 分钟聚合的第 6 分钟。Consider minute 6 for the 1-minute aggregation as highlighted above. 这一分钟是服务器 B 脱机并停止报告数据(可能因为重启)的时间点。This minute is the point where Server B went offline and stopped reporting data, perhaps due to a reboot.

从上面的第 6 分钟开始,计算的 1 分钟聚合类型为:From Minute 6 above, the calculated 1-minute aggregation types are:

聚合类型Aggregation type Value 说明Notes
SUMSum 53+20=7353+20=73
计数Count 22 显示 NULL 的效果。Shows the effect of NULLs. 如果服务器已联机,则值为 3。The value would have been 3 if the server had been online.
最小值Minimum 2020
最大值Maximum 5353
平均值Average 73/273 / 2 始终是总和除以计数。Always the Sum divided by the Count. 不会存储该值,并且始终使用该粒度的聚合数字针对每个时间粒度重新计算该值。It's never stored and always recalculated for each time granularity using the aggregated numbers for that granularity. 请注意上面突出显示的 5 分钟和 10 分钟时间粒度的重新计算。Notice the recalculation for the 5-minute and 10-minute time granularities as highlighted above.

红色文本颜色指示可能被视为超出正常范围的值,并显示它们如何随时间粒度的增加而(或未能)传播。The red text color indicates values that might be considered out of the normal range and shows how they propagate (or fail to) as the time-granularity goes up. 请注意,Min 和 Max 指示存在潜在异常,而 Average 和 Sums 则随着时间粒度的增加而丢失该信息 。Notice how the Min and Max indicate there are underlying anomalies while the Average and Sums lose that information as your time granularity goes up.

你还可以看到,计算平均值时,使用 NULL 值比使用零值要好。You can also see that the NULLs give a better calculation of average than if zeros were used instead.

备注

虽然在此示例中并非是这样,但在捕获的指标值始终为 1 的情况下,Count 等同于 Sum 。Though not the case in this example, Count is equal to Sum in cases where a metric is always captured with the value of 1. 当指标跟踪事务事件的出现次数时,这种情况很常见 - 例如,本文前面的示例中提到的 HTTP 失败数。This is common when a metric tracks the occurrence of a transactional event--for example, the number of HTTP failures mentioned in a previous example in this article.

后续步骤Next steps