指标顾问常见问题解答Metrics Advisor frequently asked questions

实例的成本是多少?What is the cost of my instance?

目前可在预览版中免费使用实例。There currently isn't a cost to use your instance during the preview.

为什么演示网站是只读的?Why is the demo website readonly?

演示网站公开可用。The demo website is publicly available. 将此实例设为只读,以防止意外上传任何数据。This instance is made read-only to prevent accidental upload of any data.

为什么无法创建资源?Why can't I create the resource? “定价层”不可用且显示“已为此订阅创建 1个 S0”?The "Pricing tier" is unavailable and it says "You have already created 1 S0 for this subscription"?

F0 资源已存在时显示的消息

在公共预览版中,只能在订阅下为每个区域创建一个指标顾问实例。During public preview, only one instance of Metrics Advisor can be created per region under a subscription.

如果已在同一区域中使用相同的订阅创建了实例,可以尝试在不同的区域中或使用不同的订阅创建新实例。If you already have an instance created in the same region using the same subscription, you can try a different region or a different subscription to create a new instance. 也可以删除现有实例以创建新实例。You can also delete an existing instance to create a new one.

如果已删除现有实例,但仍出现错误,请在资源删除后等待约 20 分钟,然后再创建新实例。If you have already deleted the existing instance but still see the error, please wait for about 20 minutes after resource deletion before you create a new instance.

基本概念Basic concepts

什么是多维时序数据?What is multi-dimensional time-series data?

请参阅词汇表中的多维指标定义。See the Multi-dimensional metric definition in the glossary.

指标顾问启动异常情况检测需要多少数据?How much data is needed for Metrics Advisor to start anomaly detection?

至少需要一个数据点才能触发异常情况检测。At minimum, one data point can trigger anomaly detection. 但这并不能实现最佳的准确性。This doesn't bring the best accuracy, however. 此服务将使用在创建数据馈送时指定为“填充间隙”规则的值来假定以前的数据点的时间窗口。The service will assume a window of previous data points using the value you've specified as the "fill-gap" rule during data feed creation.

建议在要检测的时间戳之前包含一些数据。We recommend having some data before the timestamp that you want detection on. 根据数据的粒度,建议的数据量如下变化。Based on the granularity of your data, the recommended data amount varies as below.

粒度Granularity 建议的检测数据量Recommended data amount for detection
少于 5 分钟Less than 5 minutes 4 天的数据4 days of data
5 分钟至 1 天5 minutes to 1 day 28 天的数据28 days of data
1 天以上至 31 天More than 1 day, to 31 days 4 年的数据4 years of data
大于 31 天Greater than 31 days 48 年的数据48 years of data

为什么指标顾问无法从历史数据中检测到异常?Why Metrics Advisor doesn't detect anomalies from historical data?

指标顾问旨在检测实时传送视频流数据。Metrics Advisor is designed for detecting live streaming data. 此服务将回顾并对其运行异常情况检测的历史数据的最大长度受到限制。There's a limitation of the maximum length of historical data that the service will look back and run anomaly detection. 这意味着只有在某个最早时间戳之后的数据点才会具有异常情况检测结果。It means only data points after a certain earliest timestamp will have anomaly detection results. 最早的时间戳取决于数据的粒度。That earliest timestamp depends on the granularity of your data.

根据数据的粒度,将具有异常情况检测结果的历史数据的长度如下所示。Based on the granularity of your data, the lengths of the historical data that will have anomaly detection results are as below.

粒度Granularity 用于异常情况检测的历史数据的最大长度Maximum length of historical data for anomaly detection
少于 5 分钟Less than 5 minutes 加入时间 - 13 小时Onboard time - 13 hours
5 分钟至不到 1 小时5 minutes to less than 1 hour 加入时间 - 4 天Onboard time - 4 days
1 小时至不到 1 天1 hour to less than 1 day 加入时间 - 14 天Onboard time - 14 days
1 天1 day 加入时间 - 28 天Onboard time - 28 days
1 天以上,不到 31 天Greater than 1 day, less than 31 days 加入时间 - 2 年Onboard time - 2 years
大于 31 天Greater than 31 days 加入时间 - 24 年Onboard time - 24 years

更多概念和技术术语More concepts and technical terms

另请参阅词汇表,以获取详细信息。Also see the Glossary for more information.

如何编写有效的查询以引入数据?How do I write a valid query for ingesting my data?

要使指标顾问引入数据,你需要创建一个查询,使其返回单个时间戳对应数据的维度。For Metrics Advisor to ingest your data, you will need to create a query that returns the dimensions of your data at a single timestamp. 指标顾问将多次运行此查询,以从每个时间戳获取数据。Metrics advisor will run this query multiple times to get the data from each timestamp.

请注意,在给定的时间戳位置,查询应为每个维度组合返回最多一条记录。Note that the query should return at most one record for each dimension combination, at a given timestamp. 返回的所有记录必须具有相同的时间戳。All records returned must have the same timestamp. 查询不应返回重复的记录。There should be no duplicate records returned by the query.

例如,假设你为每日指标创建了以下查询:For example, suppose you create the query below, for a daily metric:

select timestamp, city, category, revenue from sampledata where Timestamp >= @StartTime and Timestamp < dateadd(DAY, 1, @StartTime)

确保为时序使用正确的粒度。Be sure to use the correct granularity for your time series. 对于每小时指标,将使用:For an hourly metric, you would use:

select timestamp, city, category, revenue from sampledata where Timestamp >= @StartTime and Timestamp < dateadd(hour, 1, @StartTime)

请注意,这些查询仅返回单个时间戳位置的数据,并且包含指标顾问引入的所有维度组合。Note that these queries only return data at a single timestamp, and contain all of the dimension combinations to be ingested by Metrics Advisor.

带有一个时间戳的查询结果

如何将峰值和低谷检测为异常?How do I detect spikes & dips as anomalies?

如果预定义了硬阈值,则实际上可以在异常情况检测配置中手动设置“硬阈值”。If you have hard thresholds predefined, you could actually manually set "hard threshold" in anomaly detection configurations. 如果没有阈值,则可以使用由 AI 提供支持的“智能检测”。If there's no thresholds, you could use "smart detection" which is powered by AI. 有关详细信息,请参阅优化检测配置Please refer to tune the detecting configuration for details.

如何将常规(周期性)模式的不一致检测为异常?How do I detect inconformity with regular (seasonal) patterns as anomalies?

“智能检测”能够了解数据模式,包括周期性模式。"Smart detection" is able to learn the pattern of your data including seasonal patterns. 然后,它将那些不符合常规模式的数据点检测为异常。It then detects those data points that don't conform to the regular patterns as anomalies. 有关详细信息,请参阅优化检测配置Please refer to tune the detecting configuration for details.

如何将平线检测为异常?How do I detect flat lines as anomalies?

如果数据通常极不稳定且波动很大,你希望在数据变得十分稳定甚至变成一条平线时收到警告,则可以将“更改阈值”配置为在更改极小时检测此类数据点。If your data is normally quite unstable and fluctuates a lot, and you want to be alerted when it turns too stable or even becomes a flat line, "Change threshold" is able to be configured to detect such data points when the change is too tiny. 有关详细信息,请参阅异常情况检测配置Please refer to anomaly detection configurations for details.

高级概念Advanced concepts

指标顾问如何为多维指标构建事件树?How does Metric Advisor build an incident tree for multi-dimensional metrics?

指标可以按维度划分为多个时序。A metric can be split into multiple time series by dimensions. 例如,为团队拥有的所有服务监视指标 Response latencyFor example, the metric Response latency is monitored for all services owned by the team. 可使用 Service 类别丰富指标的维度,因此我们按 Service1Service2 等划分 Response latencyThe Service category could be used as a dimension to enrich the metric, so we get Response latency split by Service1, Service2, and so on. 每个服务都可以部署在多个数据中心的不同计算机上,因此可以按 MachineData center 进一步划分指标。Each service could be deployed on different machines in multiple data centers, so the metric could be further split by Machine and Data center.

服务Service 数据中心Data center 计算机Machine
S1S1 DC1DC1 M1M1
S1S1 DC1DC1 M2M2
S1S1 DC2DC2 M3M3
S1S1 DC2DC2 M4M4
S2S2 DC1DC1 M1M1
S2S2 DC1DC1 M2M2
S2S2 DC2DC2 M5M5
S2S2 DC2DC2 M6M6
......

Response latency 总数开始,我们可以按 ServiceData centerMachine 向下钻取指标。Starting from the total Response latency, we can drill down into the metric by Service, Data center and Machine. 但是,对于服务所有者而言,使用路径 Service -> Data center -> Machine 可能更有意义;而对于基础结构工程师而言,使用路径 Data Center -> Machine -> Service 可能更有意义。However, maybe it makes more sense for service owners to use the path Service -> Data center -> Machine, or maybe it makes more sense for infrastructure engineers to use the path Data Center -> Machine -> Service. 这完全取决于用户的个人业务需求。It all depends on the individual business requirements of your users.

在指标顾问中,用户可以指定其要从分层拓扑的一个节点向下钻取或汇总的任何路径。In Metric Advisor, users can specify any path they want to drill down or rollup from one node of the hierarchical topology. 更准确地说,分层拓扑是有向无环图而非树结构。More precisely, the hierarchical topology is a directed acyclic graph rather than a tree structure. 完整的分层拓扑由所有潜在的维度组合组成,如下所示:There's a full hierarchical topology that consists of all potential dimension combinations, like this:

分层拓扑图,由多个相互连接的顶点和边组成,这些顶点和边的维度标注为 S、DC 和 M,其对应数字范围为 1 到 6

从理论上讲,如果维度 Service 具有 Ls 个非重复值,维度 Data center 具有 Ldc 个非重复值,维度 Machine 具有 Lm 个非重复值,则分层拓扑中可能有 (Ls + 1) * (Ldc + 1) * (Lm + 1) 个维度组合。In theory, if the dimension Service has Ls distinct values, dimension Data center has Ldc distinct values, and dimension Machine has Lm distinct values, then there could be (Ls + 1) * (Ldc + 1) * (Lm + 1) dimension combinations in the hierarchical topology.

但通常并非所有维度组合都是有效的,这可能会显着降低复杂性。But usually not all dimension combinations are valid, which can significantly reduce the complexity. 目前,如果用户自己聚合指标,则不限制维度的数量。Currently if users aggregate the metric themselves, we don't limit the number of dimensions. 如果需要使用指标顾问提供的汇总功能,则维度的数量不应超过 6。If you need to use the rollup functionality provided by Metrics Advisor, the number of dimensions shouldn't be more than 6. 但是,我们将按指标维度扩展的时序数量限制为小于 10,000。However, we limit the number of time series expanded by dimensions for a metric to less than 10,000.

诊断页面中的“事件树”工具仅显示已检测到异常的节点,而不显示整个拓扑。The Incident tree tool in the diagnostics page only shows nodes where an anomaly has been detected, rather than the whole topology. 这是为了帮助你专注于当前问题。This is to help you focus on the current issue. 它也可能会不显示指标内的所有异常,而是会基于贡献显示最常见的异常。It also may not show all anomalies within the metric, and instead will display the top anomalies based on contribution. 这样,我们可以快速找出异常数据的影响、范围和传播路径。In this way, we can quickly find out the impact, scope, and the spread path of the abnormal data. 这大大减少了需要重点关注的异常数量,并帮助用户了解并找到其关键问题。Which significantly reduces the number of anomalies we need to focus on, and helps users to understand and locate their key issues.

例如,当 Service = S2 | Data Center = DC2 | Machine = M5 上发生异常时,异常的偏差会影响也已检测到异常的父节点 Service= S2,但异常不会影响 DC2 上的整个数据中心以及 M5 上的所有服务。For example, when an anomaly occurs on Service = S2 | Data Center = DC2 | Machine = M5, the deviation of the anomaly impacts the parent node Service= S2 which also has detected the anomaly, but the anomaly doesn't affect the entire data center at DC2 and all services on M5. 将按照以下屏幕截图中的方式构建事件树,最常见的异常是在 Service = S2 上捕获的,根本原因可以按两条通向 Service = S2 | Data Center = DC2 | Machine = M5 的路径进行分析。The incident tree would be built as in the below screenshot, the top anomaly is captured on Service = S2, and root cause could be analyzed in two paths which both lead to Service = S2 | Data Center = DC2 | Machine = M5.

5 个标记的顶点,具有两条不同的路径,这些路径通过边与标记为 S2 的公共节点相连。最常见的异常是在 Service = S2 上捕获的,根本原因可以按两条通向 Service = S2 | Data Center = DC2 | Machine = M5 的路径进行分析

后续步骤Next Steps