approx_count_distinct
aggregate function
Applies to: Databricks SQL Databricks Runtime
Returns the estimated number of distinct values in expr
within the group.
The implementation uses the dense version of the HyperLogLog++ (HLL++) algorithm, a state of the art cardinality estimation algorithm.
Results are accurate within a default value of 5%, which derives from the value
of the maximum relative standard deviation, although this is configurable with
the relativeSD
parameter as mentioned below.
Syntax
approx_count_distinct(expr[, relativeSD]) [FILTER ( WHERE cond ) ]
This function can also be invoked as a window function using the OVER
clause.
Arguments
expr
: Can be of any type for which equivalence is defined.relativeSD
: Defines the maximum relative standard deviation allowed.cond
: An optional boolean expression filtering the rows used for aggregation.
Returns
A BIGINT.
Examples
> SELECT approx_count_distinct(col1) FROM VALUES (1), (1), (2), (2), (3) tab(col1);
3
> SELECT approx_count_distinct(col1) FILTER(WHERE col2 = 10)
FROM VALUES (1, 10), (1, 10), (2, 10), (2, 10), (3, 10), (1, 12) AS tab(col1, col2);
3