dcount()(聚合函数)dcount() (aggregation function)
返回摘要组中标量表达式占用的非重复值数的估计值。Returns an estimate for the number of distinct values that are taken by a scalar expression in the summary group.
备注
dcount()
聚合函数主要用于估算大型集的基数。The dcount()
aggregation function is primarily useful for estimating the cardinality of huge sets. 它以性能确定准确性,并可能在执行间返回不同的结果。It trades performance for accuracy, and may return a result that varies between executions. 输入的顺序可能会影响其输出。The order of inputs may have an effect on its output.
语法Syntax
... |
summarize
dcount
(
Expr
[, Accuracy
])
...... |
summarize
dcount
(
Expr
[, Accuracy
])
...
参数Arguments
- Expr:要对其非重复值进行计数的标量表达式。Expr : A scalar expression whose distinct values are to be counted.
- 准确度 :一个可选的
int
文本,用于定义请求的估计准确度。Accuracy : An optionalint
literal that defines the requested estimation accuracy. 有关支持的值,请参阅下文。See below for supported values. 如果未指定,则使用默认值1
。If unspecified, the default value1
is used.
返回Returns
返回对组中 Expr
的非重复值数的估计值。Returns an estimate of the number of distinct values of Expr
in the group.
示例Example
PageViewLog | summarize countries=dcount(country) by continent
获取按 G
分组的 V
的非重复值的精确计数。Get an exact count of distinct values of V
grouped by G
.
T | summarize by V, G | summarize count() by G
由于 V
的非重复值与 G
的非重复值的数量相乘,因此此计算需要大量内部内存。This calculation requires a great amount of internal memory, since distinct values of V
are multiplied by the number of distinct values of G
.
这可能会导致内存错误或执行时间过长。It may result in memory errors or large execution times.
dcount()
提供了快速可靠的替代方法:dcount()
provides a fast and reliable alternative:
T | summarize dcount(B) by G | count
估计准确度Estimation accuracy
dcount()
聚合函数使用 HyperLogLog (HLL) 算法的变体,该算法对集基数进行随机估算。The dcount()
aggregate function uses a variant of the HyperLogLog (HLL) algorithm, which does a stochastic estimation of set cardinality. 该算法提供一个“旋钮”,可用于平衡每个内存大小的准确度和执行时间:The algorithm provides a "knob" that can be used to balance accuracy and execution time per memory size:
精确度Accuracy | 错误 (%)Error (%) | 条目数Entry count |
---|---|---|
00 | 1.61.6 | 212212 |
11 | 0.80.8 | 214214 |
22 | 0.40.4 | 216216 |
33 | 0.280.28 | 217217 |
44 | 0.20.2 | 218218 |
备注
“条目数”列是 HLL 实现中 1 字节计数器的数目。The "entry count" column is the number of 1-byte counters in the HLL implementation.
如果集基数足够小,则该算法包括以下有关执行理想计数(零错误)的规定:The algorithm includes some provisions for doing a perfect count (zero error), if the set cardinality is small enough:
- 当准确度等级为
1
时,将返回 1000 个值When the accuracy level is1
, 1000 values are returned - 当准确度等级为
2
时,将返回 8000 个值When the accuracy level is2
, 8000 values are returned
错误边界基于概率,而不是基于理论界限。The error bound is probabilistic, not a theoretical bound. 值是错误分布的标准偏差 (sigma),99.7% 的估计值的相对误差小于 3 x sigma。The value is the standard deviation of error distribution (the sigma), and 99.7% of the estimations will have a relative error of under 3 x sigma.
下图显示所有受支持的准确度设置的相对估计误差的概率分布函数,以百分比为单位:The following image shows the probability distribution function of the relative estimation error, in percentages, for all supported accuracy settings: