autocluster 插件autocluster plugin

T | evaluate autocluster()

autocluster 在数据中查找离散属性(维度)的常见模式。autocluster finds common patterns of discrete attributes (dimensions) in the data. 然后,它将原始查询的结果(无论是 100 行还是 10 万行)减少到少量模式。It then reduces the results of the original query, whether it's 100 or 100k rows, to a small number of patterns. 开发该插件的目的是为了帮助分析故障(如异常或崩溃),但该插件也有可能用于处理任何筛选出的数据集。The plugin was developed to help analyze failures (such as exceptions or crashes) but can potentially work on any filtered data set.

备注

autocluster 主要基于以下文章中的 Seed-Expand 算法:使用离散属性的遥测数据挖掘算法autocluster is largely based on the Seed-Expand algorithm from the following paper: Algorithms for Telemetry Data Mining using Discrete Attributes.

语法Syntax

T | evaluate autocluster( arguments )T | evaluate autocluster( arguments )

返回Returns

autocluster 插件返回一组(通常为一小组)模式。The autocluster plugin returns a (usually small) set of patterns. 这些模式在多个离散属性间捕获具有共享常见值的数据部分。The patterns capture portions of the data with shared common values across multiple discrete attributes. 结果中的每种模式均由一行表示。Each pattern in the results is represented by a row.

第一列是段 ID。The first column is the segment ID. 后两列是此模式从原始查询外所捕获行的数量和百分比。The next two columns are the count and percentage of rows out of the original query that are captured by the pattern. 其余列均来自原始查询。The remaining columns are from the original query. 它们的值是来自列中的特定值,或者是表示变量值的通配符值(默认为 null)。Their value is either a specific value from the column, or a wildcard value (which are by default null) meaning variable values.

这些模式并不是截然不同的,有可能会重叠,且通常不包括所有原始行。The patterns aren't distinct, may be overlapping, and usually don't cover all the original rows. 某些行可能不属于任何模式。Some rows may not fall under any pattern.

提示

在输入管道中使用 whereproject 可将数据缩减到仅剩所需数据。Use where and project in the input pipe to reduce the data to just what you're interested in.

找到所需行时,可通过将该行的特定值添加到 where 筛选器,来对其进行深入研究。When you find an interesting row, you might want to drill into it further by adding its specific values to your where filter.

参数Arguments

备注

所有参数均为可选。All arguments are optional.

T | evaluate autocluster([ SizeWeight , WeightColumn , NumSeeds , CustomWildcard , CustomWildcard , ...])T | evaluate autocluster([ SizeWeight , WeightColumn , NumSeeds , CustomWildcard , CustomWildcard , ...])

所有参数都为可选参数,但必须按上述方式进行排序。All arguments are optional, but they must be ordered as above. 若要指示应当使用默认值,请输入字符串波形值“~”(请参阅表中的“示例”列)。To indicate that the default value should be used, put the string tilde value '~' (see the "Example" column in the table).

参数Argument 类型、范围、默认值Type, range, default 说明Description 示例Example
SizeWeightSizeWeight 0 < double < 1 [默认值:0.5]0 < double < 1 [default: 0.5] 允许控制泛型(高覆盖率)和信息性(多共享的)值之间的平衡。Gives you some control over the balance between generic (high coverage) and informative (many shared) values. 如果增加该值,它通常会减少模式数量,并且每个模式所占覆盖率百分比可能会更大。If you increase the value, it usually reduces the number of patterns, and each pattern tends to cover a larger percentage coverage. 如果减少该值,它通常会产生更多特定模式,这些模式具有更多共享值,而覆盖率百分比会比较小。If you decrease the value, it usually produces more specific patterns with more shared values, and a smaller percentage coverage. 该底层公式是加权几何平均值,介于规一化泛型分数和信息性分数之间,以 SizeWeight1-SizeWeight 为权重The under-the-hood formula is a weighted geometric mean, between the normalized generic score and the informative score with weights SizeWeight and 1-SizeWeight T | evaluate autocluster(0.8)
WeightColumnWeightColumn column_namecolumn_name 根据指定的权重考虑输入中的每一行(默认情况下每行具有权重“1”)。Considers each row in the input according to the specified weight (by default each row has a weight of '1'). 该参数必须是数值列(如 int、long、real)的名称。The argument must be a name of a numeric column (such as int, long, real). 权重列的常见用法是对已嵌入每一行的数据进行采样或存储/聚合。A common usage of a weight column is to take into account sampling or bucketing/aggregation of the data that is already embedded into each row. T | evaluate autocluster('~', sample_Count)
NumSeedsNumSeeds int [默认值:25]int [default: 25] 种子数决定算法初始本地搜索点的数量。The number of seeds determines the number of initial local search points of the algorithm. 在有些情况下(具体取决于数据结构),如果增加种子数,那么,结果的数量(或质量)会由于搜索空间扩大而提高(代价是降低了查询速度)。In some cases, depending on the structure of the data and if you increase the number of seeds, then the number (or quality) of the results increases through the expanded search space with a slower query tradeoff. 该值在两个方向上对结果的影响都会渐渐减少,因此,如果将它减小到 5 以下,它带来的性能改进将可以忽略不计。The value has diminishing results in both directions, so if you decrease it to below five, it will achieve negligible performance improvements. 如果增加到 50 以上,它将很难生成更多的模式。If you increase to above 50, it will rarely generate additional patterns. T | evaluate autocluster('~', '~', 15)
CustomWildcardCustomWildcard "any_value_per_type""any_value_per_type" 为结果表中的特定类型设置通配符值。Sets the wildcard value for a specific type in the results table. 它将指示当前模式对此列没有限制。It will indicate that the current pattern doesn't have a restriction on this column. 默认值为 null,因为字符串默认值是一个空字符串。The default is null, since the string default is an empty string. 如果默认值是数据中的正常值,则应使用其他通配符值(如 *)。If the default is a good value in the data, a different wildcard value should be used (such as *). T | evaluate autocluster('~', '~', '~', '*', int(-1), double(-1), long(0), datetime(1900-1-1))

示例Examples

使用 autoclusterUsing autocluster

StormEvents 
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage
| evaluate autocluster(0.6)
段 IDSegmentId 计数Count 百分比Percent 状态State EventTypeEventType 损害Damage
00 22782278 38.738.7 冰雹Hail NO
11 512512 8.78.7 雷雨大风Thunderstorm Wind YESYES
22 898898 15.315.3 德克萨斯TEXAS

使用自定义通配符Using custom wildcards

StormEvents 
| where monthofyear(StartTime) == 5
| extend Damage = iff(DamageCrops + DamageProperty > 0 , "YES" , "NO")
| project State , EventType , Damage 
| evaluate autocluster(0.2, '~', '~', '*')
段 IDSegmentId 计数Count 百分比Percent 状态State EventTypeEventType 损害Damage
00 22782278 38.738.7 * 冰雹Hail NO
11 512512 8.78.7 * 雷雨大风Thunderstorm Wind YESYES
22 898898 15.315.3 德克萨斯TEXAS * *

另请参阅See also