使用 hll() 和 tdigest()Using hll() and tdigest()

假设要计算过去 7 天内每日不同用户的计数。Suppose you want to calculate the count of distinct users every day over the last seven days. 你可以每天运行一次 summarize dcount(user),并将时间跨度筛选为过去七天。You can run summarize dcount(user) once a day with a span filtered to the last seven days. 此方法效率较低,因为每次运行计算时与上一计算之间有六天的重叠。This method is inefficient, because each time the calculation is run, there's a six-day overlap with the previous calculation. 你还可以计算每天的聚合,然后将这些聚合组合在一起。You can also calculate an aggregate for each day, and then combine these aggregates. 此方法要求你“记住”最后六个结果,但效率要高得多。This method requires you to "remember" the last six results, but it's much more efficient.

如上所述的分区查询对于简单聚合(例如 count()sum())很容易。Partitioning queries as described is easy for simple aggregates, such as count() and sum(). 它还可用于复杂聚合,例如 dcount()percentiles()It can also be useful for complex aggregates, such as dcount() and percentiles(). 本主题说明了 Kusto 如何支持此类计算。This topic explains how Kusto supports such calculations.

下面的示例展示了如何使用 hll/tdigest,表明在某些情况下使用这些命令是高效的:The following examples show how to use hll/tdigest and demonstrate that using these commands is highly performant in some scenarios:

备注

在某些情况下,hlltdigest 聚合函数生成的动态对象可能会很大,会超出编码策略中的默认 MaxValueSize 属性值。In some cases, the dynamic objects generated by the hll or the tdigest aggregate functions may be big and exceed the default MaxValueSize property in the encoding policy. 如果是这样,则该对象将作为 null 引入。If so, the object will be ingested as null. 例如,在准确度为 4 的情况下保存 hll 函数的输出时,hll 对象的大小超出了默认的 MaxValueSize (1MB)。For example, when persisting the output of hll function with accuracy level 4, the size of the hll object exceeds the default MaxValueSize, which is 1MB. 若要避免此问题,请修改列的编码策略,如下例所示。To avoid this issue, modify the encoding policy of the column as shown in the following examples.

range x from 1 to 1000000 step 1
| summarize hll(x,4)
| project sizeInMb = estimate_data_size(hll_x) / pow(1024,2)
sizeInMbsizeInMb
1.00005245208741.0000524520874

在应用此类策略之前将此对象引入到表中会引入 null:Ingesting this object into a table before applying this kind of policy will ingest null:

.set-or-append MyTable <| range x from 1 to 1000000 step 1
| summarize hll(x,4)
MyTable
| project isempty(hll_x)
Column1Column1
11

若要避免引入 null,请使用特殊编码策略类型 bigobject,它将 MaxValueSize 重写为 2 MB,如下所示:To avoid ingesting null, use the special encoding policy type bigobject, which overrides the MaxValueSize to 2 MB like this:

.alter column MyTable.hll_x policy encoding type='bigobject'

现在将值引入到上面的同一个表中:Ingesting a value now to the same table above:

.set-or-append MyTable <| range x from 1 to 1000000 step 1
| summarize hll(x,4)

成功引入第二个值:ingests the second value successfully:

MyTable
| project isempty(hll_x)
Column1Column1
11
00

示例Example

这里有一个表 PageViewsHllTDigest,其中包含每小时查看的页面的 hll 值。There is a table, PageViewsHllTDigest, containing hll values of Pages viewed in each hour. 你希望将这些值分段成 12hYou want these values binned to 12h. 使用 hll_merge() 聚合函数合并 hll 值,将时间戳分段成 12hMerge the hll values using the hll_merge() aggregate function, with the timestamp binned to 12h. 使用函数 dcount_hll 返回最终的 dcount 值:Use the function dcount_hll to return the final dcount value:

PageViewsHllTDigest
| summarize merged_hll = hll_merge(hllPage) by bin(Timestamp, 12h)
| project Timestamp , dcount_hll(merged_hll)
TimestampTimestamp dcount_hll_merged_hll
2016-05-01 12:00:00.00000002016-05-01 12:00:00.0000000 2005627520056275
2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 3879762338797623
2016-05-02 12:00:00.00000002016-05-02 12:00:00.0000000 3931605639316056
2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 1368562113685621

将时间戳分段成 1dTo bin timestamp for 1d:

PageViewsHllTDigest
| summarize merged_hll = hll_merge(hllPage) by bin(Timestamp, 1d)
| project Timestamp , dcount_hll(merged_hll)
TimestampTimestamp dcount_hll_merged_hll
2016-05-01 00:00:00.00000002016-05-01 00:00:00.0000000 2005627520056275
2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 6413518364135183
2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 1368562113685621

可以对 tdigest 的值执行同一查询,这些值表示每小时的 BytesDeliveredThe same query may be done over the values of tdigest, which represent the BytesDelivered in each hour:

PageViewsHllTDigest
| summarize merged_tdigests = merge_tdigests(tdigestBytesDel) by bin(Timestamp, 12h)
| project Timestamp , percentile_tdigest(merged_tdigests, 95, typeof(long))
TimestampTimestamp percentile_tdigest_merged_tdigests
2016-05-01 12:00:00.00000002016-05-01 12:00:00.0000000 170200170200
2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 152975152975
2016-05-02 12:00:00.00000002016-05-02 12:00:00.0000000 181315181315
2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 146817146817

示例Example

如果数据集太大,则会达到 Kusto 限制。在这种情况下,你需要对数据集运行定期查询,但需要运行常规查询来针对大型数据集计算 percentile()dcount()Kusto limits are reached with datasets that are too large, where you need to run periodic queries over the dataset, but run the regular queries to calculate percentile() or dcount() over large datasets.

若要解决此问题,可以将新增加的数据作为 hlltdigest 值添加到一个临时表中,当所需的运算为 dcounttdigest() 时,请使用 hll();当所需的运算为 percentile 时,请使用 set/appendupdate policyTo solve this problem, newly added data may be added to a temp table as hll or tdigest values using hll() when the required operation is dcount or tdigest() when the required operation is percentile using set/append or update policy. 在这种情况下,dcounttdigest 的中间结果会保存到另一个数据集中,该数据集应小于大的目标数据集。In this case, the intermediate results of dcount or tdigest are saved into another dataset, which should be smaller than the target large one.

若要解决此问题,可以将新增加的数据作为 hlltdigest 值添加到一个临时表中,当所需的运算为 dcount 时,请使用 hll()To solve this problem, newly added data may be added to a temp table as hll or tdigest values using hll() when the required operation is dcount. 在这种情况下,dcount 的中间结果保存到另一个数据集中,该数据集应小于大的目标数据集。In this case, the intermediate results of dcount are saved into another dataset, which should be smaller than the target large one.

当需要获取这些值的最终结果时,可以让查询使用 hll/tdigest 合并:hll-merge()/tdigest_merge()When you need to get the final results of these values, the queries may use hll/tdigest mergers: hll-merge()/tdigest_merge(). 然后,在获取合并值之后,可以对这些合并值调用 percentile_tdigest() / dcount_hll(),以获取 dcount 或 percentile 运算的最终结果。Then, after getting the merged values, percentile_tdigest() / dcount_hll() may be invoked on these merged values to get the final result of dcount or percentiles.

假设有一个表 PageViews,每天向其中引入数据。在 date = datetime(2016-05-01 18:00:00.0000000) 过后,你每天都要计算每分钟查看的页面的非重复计数。Assuming there's a table, PageViews, into which data is ingested daily, every day on which you want to calculate the distinct count of pages viewed per minute later than date = datetime(2016-05-01 18:00:00.0000000).

运行以下查询:Run the following query:

PageViews   
| where Timestamp > datetime(2016-05-01 18:00:00.0000000)
| summarize percentile(BytesDelivered, 90), dcount(Page,2) by bin(Timestamp, 1d)
TimestampTimestamp percentile_BytesDelivered_90percentile_BytesDelivered_90 dcount_Pagedcount_Page
2016-05-01 00:00:00.00000002016-05-01 00:00:00.0000000 8363483634 2005627520056275
2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 8277082770 6413518364135183
2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 7292072920 1368562113685621

你每次运行此查询时(例如,你希望一天运行多次此查询),此查询都会聚合所有值。This query will aggregate all the values every time you run this query (for example, if you want to run it many times a day).

如果你使用更新策略或 set/append 命令将 hlltdigest 值(这些值是 dcount 和 percentile 运算的中间结果)保存到临时表 PageViewsHllTDigest 中,则可以仅合并这些值,然后通过以下查询使用 dcount_hll/percentile_tdigestIf you save the hll and tdigest values (which are the intermediate results of dcount and percentile) into a temp table, PageViewsHllTDigest, using an update policy or set/append commands, you may only merge the values and then use dcount_hll/percentile_tdigest using the following query:

PageViewsHllTDigest
| summarize  percentile_tdigest(merge_tdigests(tdigestBytesDel), 90), dcount_hll(hll_merge(hllPage)) by bin(Timestamp, 1d)
TimestampTimestamp percentile_tdigest_merge_tdigests_tdigestBytesDel dcount_hll_hll_merge_hllPage
2016-05-01 00:00:00.00000002016-05-01 00:00:00.0000000 8422484224 2005627520056275
2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 8348683486 6413518364135183
2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 7224772247 1368562113685621

此查询的性能应当更高,因为它针对更小的表运行。This query should be more performant, as it runs over a smaller table. 在此示例中,第一个查询针对大约 215M 记录运行,而第二个查询仅针对 32 条记录运行:In this example, the first query runs over ~215M records, while the second one runs over just 32 records:

示例Example

保留查询。The Retention Query. 假设你有一个表,其中汇总了用户查看每个维基百科页面的时间(样本大小为 10M),并且对于每个 date1 date2,你想要查明在 date1 和 date2 这两天查看的页面数相对于在 date1 (date1 < date2) 这天查看的页面数的比率。Assume you have a table that summarizes when each Wikipedia page was viewed (sample size is 10M), and you want to find for each date1 date2 the percentage of pages reviewed in both date1 and date2 relative to the pages viewed on date1 (date1 < date2).

琐细的方法是使用 join 和 summarize 运算符:The trivial way uses join and summarize operators:

// Get the total pages viewed each day
let totalPagesPerDay = PageViewsSample
| summarize by Page, Day = startofday(Timestamp)
| summarize count() by Day;
// Join the table to itself to get a grid where 
// each row shows foreach page1, in which two dates
// it was viewed.
// Then count the pages between each two dates to
// get how many pages were viewed between date1 and date2.
PageViewsSample
| summarize by Page, Day1 = startofday(Timestamp)
| join kind = inner
(
    PageViewsSample
    | summarize by Page, Day2 = startofday(Timestamp)
)
on Page
| where Day2 > Day1
| summarize count() by Day1, Day2
| join kind = inner
    totalPagesPerDay
on $left.Day1 == $right.Day
| project Day1, Day2, Percentage = count_*100.0/count_1
第 1 天Day1 第 2 天Day2 百分比Percentage
2016-05-01 00:00:00.00000002016-05-01 00:00:00.0000000 2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 34.064572597525534.0645725975255
2016-05-01 00:00:00.00000002016-05-01 00:00:00.0000000 2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 16.61836896010116.618368960101
2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 14.629137648963614.6291376489636

以上查询用了大约 18 秒的时间。The above query took ~18 seconds.

使用 hll()hll_merge()dcount_hll() 函数时,等效的查询将在大约 1.3 秒后结束,这表明 hll 函数将上述查询的速度提高了大约 13 倍:When you use the hll(), hll_merge(), and dcount_hll() functions, the equivalent query will end after ~1.3 seconds and show that the hll functions speeds up the query above by ~14 times:

let Stats=PageViewsSample | summarize pagehll=hll(Page, 2) by day=startofday(Timestamp); // saving the hll values (intermediate results of the dcount values)
let day0=toscalar(Stats | summarize min(day)); // finding the min date over all dates.
let dayn=toscalar(Stats | summarize max(day)); // finding the max date over all dates.
let daycount=tolong((dayn-day0)/1d); // finding the range between max and min
Stats
| project idx=tolong((day-day0)/1d), day, pagehll
| mv-expand pidx=range(0, daycount) to typeof(long)
// Extend the column to get the dcount value from hll'ed values for each date (same as totalPagesPerDay from the above query)
| extend key1=iff(idx < pidx, idx, pidx), key2=iff(idx < pidx, pidx, idx), pages=dcount_hll(pagehll)
// For each two dates, merge the hll'ed values to get the total dcount over each two dates, 
// This helps to get the pages viewed in both date1 and date2 (see the description below about the intersection_size)
| summarize (day1, pages1)=arg_min(day, pages), (day2, pages2)=arg_max(day, pages), union_size=dcount_hll(hll_merge(pagehll)) by key1, key2
| where day2 > day1
// To get pages viewed in date1 and also date2, look at the merged dcount of date1 and date2, subtract it from pages of date1 + pages on date2.
| project pages1, day1,day2, intersection_size=(pages1 + pages2 - union_size)
| project day1, day2, Percentage = intersection_size*100.0 / pages1
day1day1 day2day2 百分比Percentage
2016-05-01 00:00:00.00000002016-05-01 00:00:00.0000000 2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 33.229849451057833.2298494510578
2016-05-01 00:00:00.00000002016-05-01 00:00:00.0000000 2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 16.977383021366716.9773830213667
2016-05-02 00:00:00.00000002016-05-02 00:00:00.0000000 2016-05-03 00:00:00.00000002016-05-03 00:00:00.0000000 14.516002035000614.5160020350006

备注

由于 hll 函数的误差,查询结果不是 100% 准确。The results of the queries are not 100% accurate due to the error of the hll functions. 有关误差的详细信息,请参阅 dcount()For more information about the errors, see dcount().