GROUP BY 子句
适用于: Databricks SQL Databricks Runtime
使用 GROUP BY
子句基于一组指定的分组表达式对行进行分组,并基于一个或多个指定的聚合函数对行组计算聚合。
Databricks SQL 还支持高级聚合,以通过 GROUPING SETS
、CUBE
和 ROLLUP
子句对同一输入记录集执行多次聚合。
分组表达式和高级聚合可以在 GROUP BY
子句中混合使用,并在 GROUPING SETS
子句中嵌套使用。
请参阅混合/嵌套分组分析部分中的详细信息。
将 FILTER
子句附加到聚合函数时,只会将匹配的行传递给该函数。
语法
GROUP BY ALL
GROUP BY group_expression [, ...] [ WITH ROLLUP | WITH CUBE ]
GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } ( grouping_set [, ...] ) } [, ...]
grouping_set
{ expression |
( [ expression [, ...] ] ) }
而聚合函数被定义为
aggregate_name ( [ DISTINCT ] expression [, ...] ) [ FILTER ( WHERE boolean_expression ) ]
参数
ALL
适用于: Databricks SQL Databricks Runtime 12.2 LTS 及更高版本
一种简写表示法,用于将所有不包含聚合函数的
SELECT
列表表达式作为group_expression
添加。 如果不存在此类表达式,则GROUP BY ALL
等同于省略导致全局聚合的GROUP BY
子句。GROUP BY ALL
不保证生成一组可以解析的组表达式。 如果生成的子句格式不正确,Azure Databricks 将引发 UNRESOLVED_ALL_IN_GROUP_BY 或 MISSING_AGGREGATION。group_expression
指定将行分组到一起的条件。 基于分组表达式的结果值执行行的分组。 分组表达式可以是列别名称(如
GROUP BY a
)、列位置(如GROUP BY 0
)或表达式(如GROUP BY a + b
)。 如果group_expression
包含聚合函数,则 Azure Databricks 会引发 GROUP_BY_AGGREGATE 错误。grouping_set
分组集由括号中的零个或多个逗号分隔的表达式指定。 如果分组集只有一个元素,则可以省略括号。 例如,
GROUPING SETS ((a), (b))
和GROUPING SETS (a, b)
相同。GROUPING SETS
对
GROUPING SETS
之后指定的每个分组集的行进行分组。 例如:GROUP BY GROUPING SETS ((warehouse), (product))
在语义上等效于GROUP BY warehouse
和GROUP BY product
的结果的并集。该子句是
UNION ALL
的简写形式,其中UNION ALL
运算符的每个段执行GROUPING SETS
子句中指定的每个分组集的聚合。同样,
GROUP BY GROUPING SETS ((warehouse, product), (product), ())
在语义上等效于GROUP BY warehouse, product
、GROUP BY product
和全局聚合的结果的并集。
注意
对 Hive 兼容 Databricks SQL,允许 GROUP BY ... GROUPING SETS (...)
。 通常会忽略 GROUP BY
表达式,但如果它们包含 GROUPING SETS
表达式以外的其他表达式,则额外的表达式将包含在分组表达式中,并且值始终为 null。 例如,SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b)
,列 c 的输出始终为 null。
ROLLUP
在一个语句中指定多个级别的聚合。 此子句用于基于多个分组集计算聚合。
ROLLUP
是GROUPING SETS
的速记。 例如:GROUP BY warehouse, product WITH ROLLUP
或GROUP BY ROLLUP(warehouse, product)
相当于GROUP BY GROUPING SETS((warehouse, product), (warehouse), ())
。当
GROUP BY ROLLUP(warehouse, product, (warehouse, location))
相当于
GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ())
。ROLLUP
规范的 N 个元素将得到 N+1 个GROUPING SETS
。CUBE
CUBE
子句用于根据GROUP BY
子句中指定的分组列的组合执行聚合。CUBE
是GROUPING SETS
的速记。 例如:GROUP BY warehouse, product WITH CUBE
或GROUP BY CUBE(warehouse, product)
相当于GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ())
。GROUP BY CUBE(warehouse, product, (warehouse, location))
等效于以下查询:GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (product, warehouse, location), (warehouse), (product), (warehouse, product), ())
CUBE
规范的 N 个元素将得到 2^N 个GROUPING SETS
。aggregate_name
聚合函数名称(MIN、MAX、COUNT、SUM、AVG 等)。
DISTINCT
在将输入行传递给聚合函数之前,将其中的重复项删除。
FILTER
筛选
WHERE
子句中boolean_expression
计算结果为 true 的输入行,并将其传递给聚合函数;将放弃其他行。
混合/嵌套的分组分析
GROUP BY
子句可以包括多个 group_expressions 和多个 CUBE
、ROLLUP
和 GROUPING SETS
。
GROUPING SETS
还可以具有嵌套 CUBE
、ROLLUP
或 GROUPING SETS
的子句。 例如:
GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location)), GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))
CUBE
和 ROLLUP
只是 GROUPING SETS
的语法。
请参阅上述部分,了解如何将 CUBE
和 ROLLUP
转换为 GROUPING SETS
。
group_expression
可以在此上下文中视为单个组的 GROUPING SETS
。
对于 GROUP BY
子句中的多个 GROUPING SETS
,Databricks SQL 通过执行原始的 GROUPING SETS
的叉积生成单个 GROUPING SETS
。
对于嵌套在 GROUPING SETS
子句中的 GROUPING SETS
,Databricks SQL 采用其分组集并将其条带化。 例如,以下查询:
GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ());
GROUP BY warehouse, ROLLUP(product), CUBE(location, size);
等效于以下查询:
GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size), (warehouse))
当 GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))
相当于 GROUP BY GROUPING SETS((warehouse), (warehouse, product))
。
示例
CREATE TEMP VIEW dealer (id, city, car_model, quantity) AS
VALUES (100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);
-- Sum of quantity per dealership. Group by `id`.
> SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
id sum(quantity)
--- -------------
100 32
200 33
300 13
-- Use column position in GROUP by clause.
> SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
id sum(quantity)
--- -------------
100 32
200 33
300 13
-- Multiple aggregations.
-- 1. Sum of quantity per dealership.
-- 2. Max quantity per dealership.
> SELECT id, sum(quantity) AS sum, max(quantity) AS max
FROM dealer GROUP BY id ORDER BY id;
id sum max
--- --- ---
100 32 15
200 33 20
300 13 8
-- Count the number of distinct dealers in cities per car_model.
> SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
car_model count
------------ -----
Honda Civic 3
Honda CRV 2
Honda Accord 3
-- Count the number of distinct dealers in cities per car_model, using GROUP BY ALL
> SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY ALL;
car_model count
------------ -----
Honda Civic 3
Honda CRV 2
Honda Accord 3
-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
> SELECT id,
sum(quantity) FILTER (WHERE car_model IN ('Honda Civic', 'Honda CRV')) AS `sum(quantity)`
FROM dealer
GROUP BY id ORDER BY id;
id sum(quantity)
--- -------------
100 17
200 23
300 5
-- Aggregations using multiple sets of grouping columns in a single statement.
-- Following performs aggregations based on four sets of grouping columns.
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. Empty grouping set. Returns quantities for all city and car models.
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
ORDER BY city;
city car_model sum
--------- ------------ ---
null null 78
null HondaAccord 33
null HondaCRV 10
null HondaCivic 35
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5
-- Group by processing with `ROLLUP` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), ())
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY city, car_model WITH ROLLUP
ORDER BY city, car_model;
city car_model sum
--------- ------------ ---
null null 78
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5
-- Group by processing with `CUBE` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
> SELECT city, car_model, sum(quantity) AS sum
FROM dealer
GROUP BY city, car_model WITH CUBE
ORDER BY city, car_model;
city car_model sum
--------- ------------ ---
null null 78
null HondaAccord 33
null HondaCRV 10
null HondaCivic 35
Dublin null 33
Dublin HondaAccord 10
Dublin HondaCRV 3
Dublin HondaCivic 20
Fremont null 32
Fremont HondaAccord 15
Fremont HondaCRV 7
Fremont HondaCivic 10
San Jose null 13
San Jose HondaAccord 8
San Jose HondaCivic 5
--Prepare data for ignore nulls example
> CREATE TEMP VIEW person (id, name, age) AS
VALUES (100, 'Mary', NULL),
(200, 'John', 30),
(300, 'Mike', 80),
(400, 'Dan' , 50);
--Select the first row in column age
> SELECT FIRST(age) FROM person;
first(age, false)
--------------------
NULL
--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
> SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
first(age, true) last(id, false) sum(id)
------------------- ------------------ ----------
30 400 1000