count_min_sketch aggregate function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Returns a count-min sketch of all values in the group in column with the epsilon, confidence and seed.

In Databricks SQL and Databricks Runtime 13.3 LTS and above this function supports named parameter invocation.

Syntax

count_min_sketch ( [ALL | DISTINCT] column, epsilon, confidence, seed ) [FILTER ( WHERE cond ) ]

This function can also be invoked as a window function using the OVER clause.

Arguments

  • column: An expression that evaluates to an integral numeric, STRING, or BINARY.
  • epsilon: A DOUBLE literal greater than 0 describing the relative error.
  • confidence: A DOUBLE literal greater than 0 and less than 1.
  • seed: An INTEGER literal.
  • cond: An optional boolean expression filtering the rows used for aggregation.

Returns

A BINARY.

Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

If DISTINCT is specified the function operates only on a unique set of expr values.

Examples

-- Named parameter invocation
> SELECT hex(count_min_sketch(column => col, confidence => 0.5d, epsilon => 0.5d, seed => 1)) FROM VALUES (1), (2), (1) AS tab(col);
0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000

> SELECT hex(count_min_sketch(DISTINCT col, 0.5d, 0.5d, 1)) FROM VALUES (1), (2), (1) AS tab(col);
0000000100000000000000020000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000010000000000000000