SelectSelect

SELECT [hints, ...] [ALL|DISTINCT] named_expression[, named_expression, ...]
  FROM relation[, relation, ...]
  [lateral_view[, lateral_view, ...]]
  [WHERE boolean_expression]
  [aggregation [HAVING boolean_expression]]
  [ORDER BY sort_expressions]
  [CLUSTER BY expressions]
  [DISTRIBUTE BY expressions]
  [SORT BY sort_expressions]
  [WINDOW named_window[, WINDOW named_window, ...]]
  [LIMIT num_rows]

named_expression:
  : expression [AS alias]

relation:
  | join_relation
  | (table_name|query|relation) [sample] [AS alias]
  : VALUES (expressions)[, (expressions), ...]
        [AS (column_name[, column_name, ...])]

expressions:
  : expression[, expression, ...]

sort_expressions:
  : expression [ASC|DESC][, expression [ASC|DESC], ...]

从一个或多个关系输出数据。Output data from one or more relations.

关系指的是输入数据的任何来源。A relation refers to any source of input data. 它可以是现有表(或视图)的内容、两个现有表的联接结果,或者是子查询(另一 SELECT 语句的结果)。It could be the contents of an existing table (or view), the joined result of two existing tables, or a subquery (the result of another SELECT statement).

ALL

选择关系中的所有匹配行。Select all matching rows from the relation. 默认情况下启用。Enabled by default.

DISTINCT

选择关系中的所有匹配行,然后删除重复的结果。Select all matching rows from the relation then remove duplicate results.

WHERE

按谓词筛选行。Filter rows by predicate.

HAVING

按谓词筛选分组后的结果。Filter grouped result by predicate.

ORDER BY

对一组表达式施加总计排序。Impose total ordering on a set of expressions. 默认排序方向为升序。Default sort direction is ascending. 不能将此项与 SORT BYCLUSTER BYDISTRIBUTE BY 一起使用。You cannot use this with SORT BY, CLUSTER BY, or DISTRIBUTE BY.

DISTRIBUTE BY

基于一组表达式对关系中的行重新分区。Repartition rows in the relation based on a set of expressions. 具有相同表达式值的行将哈希化到同一个工作器。Rows with the same expression values will be hashed to the same worker. 不能将此项与 ORDER BYCLUSTER BY 一起使用。You cannot use this with ORDER BY or CLUSTER BY.

SORT BY

对每个分区中的一组表达式施加排序。Impose ordering on a set of expressions within each partition. 默认排序方向为升序。Default sort direction is ascending. 不能将此项与 ORDER BYCLUSTER BY 一起使用。You cannot use this with ORDER BY or CLUSTER BY.

CLUSTER BY

基于一组表达式对关系中的行重新分区,并根据表达式以升序对行排序。Repartition rows in the relation based on a set of expressions and sort the rows in ascending order based on the expressions. 换句话说,这是 DISTRIBUTE BYSORT BY 的简写,其中所有表达式都按升序排序。In other words, this is a shorthand for DISTRIBUTE BY and SORT BY where all expressions are sorted in ascending order. 不能将此项与 ORDER BYDISTRIBUTE BYSORT BY 一起使用。You cannot use this with ORDER BY, DISTRIBUTE BY, or SORT BY.

WINDOW

将标识符分配给窗口规范。Assign an identifier to a window specification. 请参阅窗口函数See Window functions.

LIMIT

限制返回的行数。Limit the number of rows returned.

VALUES

显式指定值,而不是从关系中读取它们。Explicitly specify values instead of reading them from a relation.

示例Examples

SELECT * FROM boxes
SELECT width, length FROM boxes WHERE height=3
SELECT DISTINCT width, length FROM boxes WHERE height=3 LIMIT 2
SELECT * FROM VALUES (1, 2, 3) AS (width, length, height)
SELECT * FROM VALUES (1, 2, 3), (2, 3, 4) AS (width, length, height)
SELECT * FROM boxes ORDER BY width
SELECT * FROM boxes DISTRIBUTE BY width SORT BY width
SELECT * FROM boxes CLUSTER BY length

Delta 表Delta tables

  • 可以将表指定为 delta.<path-to-table><table-name>You can specify a table as delta.<path-to-table> or <table-name>.
  • 还可以使用 TIMESTAMP AS OFVERSION AS OF@ 语法在表标识符后指定“按时间顺序查看”版本。You can specify a time travel version after the table identifier using TIMESTAMP AS OF, VERSION AS OF, or @ syntax. 有关详细信息,请参阅查询表的旧快照(按时间顺序查看)See Query an older snapshot of a table (time travel) for details.

示例Examples

SELECT * FROM delta.`/mnt/delta/events` TIMESTAMP AS OF '2019-10-18T22:15:12.013Z'
SELECT * FROM events VERSION AS OF 5

表示例Table sample

sample:
    | TABLESAMPLE ([integer_expression | decimal_expression] PERCENT)
    : TABLESAMPLE (integer_expression ROWS)

对输入数据采样。Sample the input data. 可表示为百分比(必须介于 0 到 100 之间)或固定数量的输入行。Express in terms of either a percentage (must be between 0 and 100) or a fixed number of input rows.

示例Examples

SELECT * FROM boxes TABLESAMPLE (3 ROWS)
SELECT * FROM boxes TABLESAMPLE (25 PERCENT)

联接Join

join_relation:
    | relation join_type JOIN relation [ON boolean_expression | USING (column_name, column_name) ]
    : relation NATURAL join_type JOIN relation
join_type:
    | INNER
    | [LEFT | RIGHT] SEMI
    | [LEFT | RIGHT | FULL] [OUTER]
    : [LEFT] ANTI
  • INNER JOININNER JOIN

    从两个关系中选择存在匹配项的所有行。Select all rows from both relations where there is match.

  • OUTER JOINOUTER JOIN

    从两个关系中选择所有行,并在没有匹配项的那一侧填充 NULL 值。Select all rows from both relations, filling with null values on the side that does not have a match.

  • SEMI JOINSEMI JOIN

    仅从存在匹配项的 SEMI JOIN 侧选择行。Select only rows from the side of the SEMI JOIN where there is a match. 如果一个行与多个行匹配,则仅返回第一个匹配项。If one row matches multiple rows, only the first match is returned.

  • LEFT ANTI JOINLEFT ANTI JOIN

    从左侧选择在右侧没有匹配行的行。Select only rows from the left side that match no rows on the right side.

示例Examples

SELECT * FROM boxes INNER JOIN rectangles ON boxes.width = rectangles.width
SELECT * FROM boxes FULL OUTER JOIN rectangles USING (width, length)
SELECT * FROM boxes NATURAL JOIN rectangles

Lateral viewLateral view

lateral_view:
    : LATERAL VIEW [OUTER] function_name (expressions)
          table_name [AS (column_name[, column_name, ...])]

使用表生成函数为每个输入行生成零个或多个输出行。Generate zero or more output rows for each input row using a table-generating function. LATERAL VIEW 一起使用的最常见的内置函数是 explodeThe most common built-in function used with LATERAL VIEW is explode.

LATERAL VIEW OUTER

即使函数返回零行,也会生成一个包含 NULL 值的行。Generate a row with null values even when the function returned zero rows.

示例Examples

SELECT * FROM boxes LATERAL VIEW explode(Array(1, 2, 3)) my_view
SELECT name, my_view.grade FROM students LATERAL VIEW OUTER explode(grades) my_view AS grade

Group byGroup by

aggregation:
    : GROUP BY expressions [WITH ROLLUP | WITH CUBE | GROUPING SETS (expressions)]

使用一个或多个聚合函数按一组表达式进行分组。Group by a set of expressions using one or more aggregate functions. 常见的内置聚合函数包括 count、avg、min、max 和 sum。Common built-in aggregate functions include count, avg, min, max, and sum.

ROLLUP

在指定表达式的每个层次结构级别创建一个分组集。Create a grouping set at each hierarchical level of the specified expressions. 例如,GROUP BY a, b, c WITH ROLLUP 等效于 GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ())For instance, For instance, GROUP BY a, b, c WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ()). 分组集的总数将是 N + 1,其中,N 是组表达式的数目。The total number of grouping sets will be N + 1, where N is the number of group expressions.

CUBE

为指定表达式的集的每个可能组合创建一个分组集。Create a grouping set for each possible combination of set of the specified expressions. 例如,GROUP BY a, b, c WITH CUBE 等效于 GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ())For instance, GROUP BY a, b, c WITH CUBE is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ()). 分组集的总数将是 2^N,其中,N 是组表达式的数目。The total number of grouping sets will be 2^N, where N is the number of group expressions.

GROUPING SETS

针对分组集中指定的组表达式的每个子集,执行分组依据操作。Perform a group by for each subset of the group expressions specified in the grouping sets. 例如,GROUP BY x, y GROUPING SETS (x, y) 等效于 GROUP BY x 的结果与 GROUP BY y 的结果的联合。For instance, GROUP BY x, y GROUPING SETS (x, y) is equivalent to the result of GROUP BY x unioned with that of GROUP BY y.

示例Examples

SELECT height, COUNT(*) AS num_rows FROM boxes GROUP BY height
SELECT width, AVG(length) AS average_length FROM boxes GROUP BY width
SELECT width, length, height FROM boxes GROUP BY width, length, height WITH ROLLUP
SELECT width, length, avg(height) FROM boxes GROUP BY width, length GROUPING SETS (width, length)

开窗函数Window functions

window_expression:
  : expression OVER window_spec

named_window:
  : window_identifier AS window_spec

window_spec:
  | window_identifier
  : ( [PARTITION | DISTRIBUTE] BY expressions
        [[ORDER | SORT] BY sort_expressions] [window_frame])

window_frame:
  | [RANGE | ROWS] frame_bound
  : [RANGE | ROWS] BETWEEN frame_bound AND frame_bound

frame_bound:
  | CURRENT ROW
  | UNBOUNDED [PRECEDING | FOLLOWING]
  : expression [PRECEDING | FOLLOWING]

基于一系列输入行来计算结果。Compute a result over a range of input rows. 窗口化表达式是使用 OVER 关键字指定的,后面跟有窗口的标识符(使用 WINDOW 关键字进行定义)或窗口的规范。A windowed expression is specified using the OVER keyword, which is followed by either an identifier to the window (defined using the WINDOW keyword) or the specification of a window.

PARTITION BY

指定哪些行将位于同一分区中,其别名为 DISTRIBUTE BYSpecify which rows will be in the same partition, aliased by DISTRIBUTE BY.

ORDER BY

指定如何对窗口分区中的行排序,其别名为 SORT BYSpecify how rows within a window partition are ordered, aliased by SORT BY.

RANGE bound

将窗口的大小表示为表达式的值范围。Express the size of the window in terms of a value range for the expression.

ROWS bound

将窗口的大小表示为当前行之前和/或之后的行数。Express the size of the window in terms of the number of rows before and/or after the current row.

CURRENT ROW

使用当前行作为边界。Use the current row as a bound.

UNBOUNDED

使用负无穷作为下限,或使用无穷大作为上限。Use negative infinity as the lower bound or infinity as the upper bound.

PRECEDING

如果与 RANGE 边界一起使用,则这将定义值范围的下限。If used with a RANGE bound, this defines the lower bound of the value range. 如果与 ROWS 边界一起使用,则这将确定当前行之前要保留在窗口中的行数。If used with a ROWS bound, this determines the number of rows before the current row to keep in the window.

FOLLOWING

如果与 RANGE 边界一起使用,则这将定义值范围的上限。If used with a RANGE bound, this defines the upper bound of the value range. 如果与 ROWS 边界一起使用,则这将确定当前行之后要保留在窗口中的行数。If used with a ROWS bound, this determines the number of rows after the current row to keep in the window.

提示Hints

hints:
  : /*+ hint[, hint, ...] */
  hint:
    : hintName [(expression[, expression, ...])]

可以使用提示来提高查询的性能。You use hints improve the performance of a query. 例如,你可以提示某个表足够小,可以广播,这将加快联接速度。For example, you can hint that a table is small enough to be broadcast, which would speed up joins.

可以将一个或多个提示添加到 /*+ ... */ 注释块内的 SELECT 语句。You add one or more hints to a SELECT statement inside /*+ ... */ comment blocks. 可以在同一个注释块中指定多个提示,在这种情况下,提示以逗号分隔,并且可以有多个这样的注释块。You can specify multiple hints inside the same comment block, in which case the hints are separated by commas, and there can be multiple such comment blocks. 提示有一个名称(例如 BROADCAST),并接受 0 个或 0 个以上的参数。A hint has a name (for example, BROADCAST) and accepts 0 or more parameters.

示例Examples

SELECT /*+ BROADCAST(customers) */ * FROM customers, orders WHERE o_custId = c_custId
SELECT /*+ SKEW('orders') */ * FROM customers, orders WHERE o_custId = c_custId
SELECT /*+ SKEW('orders'), BROADCAST(demographic) */ * FROM orders, customers, demographic WHERE o_custId = c_custId AND c_demoId = d_demoId

Azure Databricks 上的 Delta Lake 有关 SKEW 提示的详细信息,请参阅倾斜联接优化Delta Lake on Azure Databricks See Skew Join optimization for more information about the SKEW hint.