NULL 语义NULL semantics

表包含一组行,每行包含一组列。A table consists of a set of rows and each row contains a set of columns. 列与数据类型相关联,表示实体的特定属性(例如,age 是名为 person 的实体的列)。A column is associated with a data type and represents a specific attribute of an entity (for example, age is a column of an entity called person). 有时,特定于某行的列的值在该行存在时是未知的。Sometimes, the value of a column specific to a row is not known at the time the row comes into existence. SQL 中,此类值表示为 NULLIn SQL, such values are represented as NULL. 本部分详细介绍了 NULL 值在各种运算符、表达式和其他 SQL 构造中的语义。This section details the semantics of NULL values handling in various operators, expressions and other SQL constructs.

以下内容说明了名为 person 的表的架构布局和数据。The following illustrates the schema layout and data of a table named person. 该数据包含 age 列中的 NULL 值,下面各部分中的各种示例中都使用了该表。The data contains NULL values in the age column and this table is used in various examples in the sections below.

|Id |Name|Age|
|---|----|---|
|100|Joe|30|
|200|Marry|NULL|
|300|Mike|18|
|400|Fred|50|
|500|Albert|NULL|
|600|Michelle|30|
|700|Dan|50|

比较运算符Comparison operators

Apache Spark 支持标准比较运算符,例如 >>==<<=Apache Spark supports the standard comparison operators such as >, >=, =, < and <=. 当其中一个操作数或两个操作数未知或为 NULL 时,这些运算符的结果未知或为 NULLThe result of these operators is unknown or NULL when one of the operands or both the operands are unknown or NULL. 为了比较 NULL 值以确定结果是否相等,Spark 提供了对 null 安全的“等于”运算符 (<=>),当其中一个操作数为 NULL 时返回 False,当两个操作数均为 NULL 时返回 TrueIn order to compare the NULL values for equality, Spark provides a null-safe equal operator (<=>), which returns False when one of the operand is NULL and returns True when both the operands are NULL. 下表说明了当其中一个或两个操作数为 NULL 时比较运算符的行为:The following table illustrates the behavior of comparison operators when one or both operands are NULL:

左操作数Left Operand 右操作数Right Operand

=
= < <= <=>
NullNULL 任何值Any value NullNULL NullNULL NullNULL NullNULL NullNULL FalseFalse
任何值Any value NullNULL NullNULL NullNULL NullNULL NullNULL NullNULL FalseFalse
NullNULL NullNULL NullNULL NullNULL NullNULL NullNULL NullNULL TrueTrue

示例Examples

-- Normal comparison operators return `NULL` when one of the operand is `NULL`.
SELECT 5 > null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT null = null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Null-safe equal operator return `False` when one of the operand is `NULL`
SELECT 5 <=> null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

-- Null-safe equal operator return `True` when one of the operand is `NULL`
SELECT NULL <=> NULL;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

逻辑运算符Logical operators

Spark 支持标准逻辑运算符,例如 ANDORNOTSpark supports standard logical operators such as AND, OR and NOT. 这些运算符采用 Boolean 表达式作为参数,并返回 Boolean 值。These operators take Boolean expressions as the arguments and return a Boolean value.

下表说明了当其中一个或两个操作数为 NULL 时逻辑运算符的行为。The following tables illustrate the behavior of logical operators when one or both operands are NULL.

左操作数Left Operand 右操作数Right Operand OR ANDAND
正确True NullNULL TrueTrue NullNULL
FalseFalse NullNULL NullNULL FalseFalse
NullNULL TrueTrue TrueTrue NullNULL
NullNULL FalseFalse NullNULL FalseFalse
NullNULL NullNULL NullNULL NullNULL
操作数operand NOTNOT
NullNULL NullNULL

示例Examples

-- Normal comparison operators return `NULL` when one of the operands is `NULL`.
SELECT (true OR null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT (null OR false) AS expression_output
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Null-safe equal operator returns `False` when one of the operands is `NULL`
SELECT NOT(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

表达式Expressions

在 Spark 中,比较运算符和逻辑运算符被看作是表达式。The comparison operators and logical operators are treated as expressions in Spark. 除了这两种表达式,Spark 还支持其他形式的表达式,例如函数表达式、强制转换表达式等。Spark 中的表达式大致可分为:Other than these two kinds of expressions, Spark supports other form of expressions such as function expressions, cast expressions, etc. The expressions in Spark can be broadly classified as:

  • 不接受 Null 的表达式Null intolerant expressions
  • 可处理 NULL 值操作数的表达式Expressions that can process NULL value operands
    • 这些表达式的结果取决于表达式本身。The result of these expressions depends on the expression itself.

不接受 Null 的表达式Null intolerant expressions

当表达式的一个或多个参数为 NULL,且大多数表达式都属于此类别时,不接受 Null 的表达式将返回 NULLNull intolerant expressions return NULL when one or more arguments of expression are NULL and most of the expressions fall in this category.

示例Examples

SELECT concat('John', null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT positive(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT to_date(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

可处理 Null 值操作数的表达式Expressions that can process null value operands

此类表达式旨在处理 NULL 值。This class of expressions are designed to handle NULL values. 这些表达式的结果取决于表达式本身。The result of the expressions depends on the expression itself. 例如,函数表达式 isnull 对 Null 输入返回 true,对非 Null 输入返回 false,而函数 coalesce 则返回其操作数列表中的第一个非 NULL 值。As an example, function expression isnull returns a true on null input and false on non null input where as function coalesce returns the first non NULL value in its list of operands. 不过,coalesce 在其所有操作数均为 NULL 时会返回 NULLHowever, coalesce returns NULL when all its operands are NULL. 以下是此类别表达式的部分列表。Below is an incomplete list of expressions of this category.

  • COALESCECOALESCE
  • NULLIFNULLIF
  • IFNULLIFNULL
  • NVLNVL
  • NVL2NVL2
  • ISNANISNAN
  • NANVLNANVL
  • ISNULLISNULL
  • ISNOTNULLISNOTNULL
  • ATLEASTNNONNULLSATLEASTNNONNULLS
  • ININ

示例Examples

SELECT isnull(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

-- Returns the first occurrence of non `NULL` value.
SELECT coalesce(null, null, 3, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|                3|
+-----------------+

-- Returns `NULL` as all its operands are `NULL`.
SELECT coalesce(null, null, null, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT isnan(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

内置聚合表达式Built-in aggregate expressions

聚合函数通过处理一组输入行来计算单个结果。Aggregate functions compute a single result by processing a set of input rows. 以下规则规定聚合函数如何处理 NULL 值。Below are the rules of how NULL values are handled by aggregate functions.

  • 所有聚合函数均不处理 NULL 值。NULL values are ignored from processing by all the aggregate functions.
    • 唯一的例外是 COUNT(*) 函数。Only exception to this rule is COUNT(*) function.
  • 所有输入值均为 NULL 或输入数据集为空时,某些聚合函数将返回 NULLSome aggregate functions return NULL when all input values are NULL or the input data set is empty. 这些函数的列表如下:The list of these functions is:
    • MAX
    • MIN
    • SUM
    • AVG
    • EVERY
    • ANY
    • SOME

示例Examples

-- `count(*)` does not skip `NULL` values.
SELECT count(*) FROM person;
+--------+
|count(1)|
+--------+
|       7|
+--------+

-- `NULL` values in column `age` are skipped from processing.
SELECT count(age) FROM person;
+----------+
|count(age)|
+----------+
|         5|
+----------+

-- `count(*)` on an empty input set returns 0. This is unlike the other
-- aggregate functions, such as `max`, which return `NULL`.
SELECT count(*) FROM person where 1 = 0;
+--------+
|count(1)|
+--------+
|       0|
+--------+

-- `NULL` values are excluded from computation of maximum value.
SELECT max(age) FROM person;
+--------+
|max(age)|
+--------+
|      50|
+--------+

-- `max` returns `NULL` on an empty input set.
SELECT max(age) FROM person where 1 = 0;
+--------+
|max(age)|
+--------+
|    null|
+--------+

WHEREHAVINGJOIN 子句中的条件表达式Condition expressions in WHERE, HAVING, and JOIN clauses

WHEREHAVING 运算符根据用户指定的条件对行进行筛选。WHERE, HAVING operators filter rows based on the user specified condition. JOIN 运算符用于根据联接条件合并两个表中的行。A JOIN operator is used to combine rows from two tables based on a join condition. 对于这三个运算符来说,条件表达式是布尔表达式,可返回 TrueFalseUnknown (NULL)For all the three operators, a condition expression is a boolean expression and can return True, False or Unknown (NULL). 如果条件的结果为 True,则它们“符合标准”。They are “satisfied” if the result of the condition is True.

示例Examples

-- Persons whose age is unknown (`NULL`) are filtered out from the result set.
SELECT * FROM person WHERE age > 0;
+--------+---+
|    name|age|
+--------+---+
|Michelle| 30|
|    Fred| 50|
|    Mike| 18|
|     Dan| 50|
|     Joe| 30|
+--------+---+

-- `IS NULL` expression is used in disjunction to select the persons
-- with unknown (`NULL`) records.
SELECT * FROM person WHERE age > 0 OR age IS NULL;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- Person with unknown(`NULL`) ages are skipped from processing.
SELECT * FROM person GROUP BY age HAVING max(age) > 18;
+---+--------+
|age|count(1)|
+---+--------+
| 50|       2|
| 30|       2|
+---+--------+

-- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
-- The persons with unknown age (`NULL`) are filtered out by the join operator.
SELECT * FROM person p1, person p2
    WHERE p1.age = p2.age
    AND p1.name = p2.name;
+--------+---+--------+---+
|    name|age|    name|age|
+--------+---+--------+---+
|Michelle| 30|Michelle| 30|
|    Fred| 50|    Fred| 50|
|    Mike| 18|    Mike| 18|
|     Dan| 50|     Dan| 50|
|     Joe| 30|     Joe| 30|
+--------+---+--------+---+

-- The age column from both legs of join are compared using null-safe equal which
-- is why the persons with unknown age (`NULL`) are qualified by the join.
SELECT * FROM person p1, person p2
    WHERE p1.age <=> p2.age
    AND p1.name = p2.name;
+--------+----+--------+----+
|    name| age|    name| age|
+--------+----+--------+----+
|  Albert|null|  Albert|null|
|Michelle|  30|Michelle|  30|
|    Fred|  50|    Fred|  50|
|    Mike|  18|    Mike|  18|
|     Dan|  50|     Dan|  50|
|   Marry|null|   Marry|null|
|     Joe|  30|     Joe|  30|
+--------+----+--------+----+

聚合运算符(GROUP BYDISTINCTAggregate operators (GROUP BY, DISTINCT)

比较运算符中所述,两个 NULL 值不相等。As discussed in Comparison operators, two NULL values are not equal. 但为了分组和进行不同的处理,具有 NULL data 的两个或多个值会分组到同一 Bucket。However, for the purpose of grouping and distinct processing, the two or more values with NULL dataare grouped together into the same bucket. 此行为符合 SQL 标准,适用于其他企业数据库管理系统。This behavior is conformant with SQL standard and with other enterprise database management systems.

示例Examples

-- `NULL` values are put in one bucket in `GROUP BY` processing.
SELECT age, count(*) FROM person GROUP BY age;
+----+--------+
| age|count(1)|
+----+--------+
|null|       2|
|  50|       2|
|  30|       2|
|  18|       1|
+----+--------+

-- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
SELECT DISTINCT age FROM person;
+----+
| age|
+----+
|null|
|  50|
|  30|
|  18|
+----+

Sort 运算符(ORDER BY 子句)Sort operator (ORDER BY clause)

Spark SQL 在 ORDER BY 子句中支持 Null 排序规范。Spark SQL supports null ordering specification in ORDER BY clause. Spark 根据 Null 排序规范将所有 NULL 值置于开头或末尾,以处理 ORDER BY 子句。Spark processes the ORDER BY clause by placing all the NULL values at first or at last depending on the null ordering specification. 默认情况下,所有 NULL 值都放在开头。By default, all the NULL values are placed at first.

示例Examples

-- `NULL` values are shown at first and other values
-- are sorted in ascending way.
SELECT age, name FROM person ORDER BY age;
+----+--------+
| age|    name|
+----+--------+
|null|   Marry|
|null|  Albert|
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|    Fred|
|  50|     Dan|
+----+--------+

-- Column values other than `NULL` are sorted in ascending
-- way and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|     Dan|
|  50|    Fred|
|null|   Marry|
|null|  Albert|
+----+--------+

-- Columns other than `NULL` values are sorted in descending
-- and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  50|    Fred|
|  50|     Dan|
|  30|Michelle|
|  30|     Joe|
|  18|    Mike|
|null|   Marry|
|null|  Albert|
+----+--------+

Set 运算符(UNIONINTERSECTEXCEPTSet operators (UNION, INTERSECT, EXCEPT)

在 set 操作的上下文中,以 Null-safe 方式比较 NULL 值是否相等。NULL values are compared in a null-safe manner for equality in the context of set operations. 这意味着在比较行时,两个 NULL 值被视为相等,这一点与常规的 EqualTo (=) 运算符不同。That means when comparing rows, two NULL values are considered equal unlike the regular EqualTo(=) operator.

示例Examples

CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;

-- Only common rows between two legs of `INTERSECT` are in the
-- result set. The comparison between columns of the row are done
-- in a null-safe manner.
SELECT name, age FROM person
    INTERSECT
    SELECT name, age from unknown_age;
+------+----+
|  name| age|
+------+----+
|Albert|null|
| Marry|null|
+------+----+

-- `NULL` values from two legs of the `EXCEPT` are not in output.
-- This basically shows that the comparison happens in a null-safe manner.
SELECT age, name FROM person
    EXCEPT
    SELECT age FROM unknown_age;
+---+--------+
|age|    name|
+---+--------+
| 30|     Joe|
| 50|    Fred|
| 30|Michelle|
| 18|    Mike|
| 50|     Dan|
+---+--------+

-- Performs `UNION` operation between two sets of data.
-- The comparison between columns of the row ae done in
-- null-safe manner.
SELECT name, age FROM person
    UNION
    SELECT name, age FROM unknown_age;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|     Joe|  30|
|Michelle|  30|
|   Marry|null|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
+--------+----+

EXISTSNOT EXISTS 子查询EXISTS and NOT EXISTS subqueries

在 Spark 中,在 WHERE 子句中可使用 EXISTSNOT EXISTS 表达式。In Spark, EXISTS and NOT EXISTS expressions are allowed inside a WHERE clause. 这些是返回 TRUEFALSE 的布尔表达式。These are Boolean expressions that return either TRUE or FALSE. 换句话说,EXISTS 是成员身份条件,当它引用的子查询返回一个或多个行时,它将返回 TRUEIn other words, EXISTS is a membership condition and returns TRUE when the subquery it refers to returns one or more rows. 类似地,NOT EXISTS 是非成员身份条件,当子查询没有返回行时,它将返回 TRUESimilarly, NOT EXISTS is a non-membership condition and returns TRUE when no rows or zero rows are returned from the subquery.

子查询的结果中存在 NULL 不会对这两个表达式产生影响。These two expressions are not affected by presence of NULL in the result of the subquery. 这两个表达式通常更快,因为它们可以在无需特别预配 Null 认知的情况下转换为半联接/反半联接。They are normally faster because they can be converted to semijoins / anti-semijoins without special provisions for null awareness.

示例Examples

-- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
-- evaluates to `TRUE` as the subquery produces 1 row.
SELECT * FROM person WHERE EXISTS (SELECT null);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
-- subquery produces no rows. In this case, it returns 1 row.
SELECT * FROM person WHERE NOT EXISTS (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- `NOT EXISTS` expression returns `TRUE`.
SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

INNOT IN 子查询IN and NOT IN subqueries

在 Spark 中,在查询的 WHERE 子句中可使用 INNOT IN 表达式。In Spark, IN and NOT IN expressions are allowed inside a WHERE clause of a query. EXISTS 表达式不同,IN 表达式可以返回 TRUEFALSEUNKNOWN (NULL) 值。Unlike the EXISTS expression, IN expression can return a TRUE, FALSE or UNKNOWN (NULL) value. 从概念上讲,IN 表达式在语义上等效于用析取运算符 (OR) 分隔的一组相等条件。Conceptually a IN expression is semantically equivalent to a set of equality condition separated by a disjunctive operator (OR). 例如,c1 IN (1, 2, 3) 在语义上等效于 (C1 = 1 OR c1 = 2 OR c1 = 3)For example, c1 IN (1, 2, 3) is semantically equivalent to (C1 = 1 OR c1 = 2 OR c1 = 3).

至于处理 NULL 值,可根据比较运算符 (=) 和逻辑运算符 (OR) 中的 NULL 值处理推导出语义。As far as handling NULL values are concerned, the semantics can be deduced from the NULL value handling in comparison operators(=) and logical operators(OR). 简而言之,以下规则规定如何计算 IN 表达式的结果。To summarize, below are the rules for computing the result of an IN expression.

  • 在列表中发现提到的非 NULL 值时,返回 TRUETRUE is returned when the non-NULL value in question is found in the list
  • 在列表中未发现非 NULL 值且列表不包含 NULL 值时,返回 FALSEFALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values
  • 值为 NULL,或者在列表中未发现非 NULL 值且列表至少包含一个 NULL 值时,返回 UNKNOWNUNKNOWN is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value

列表包含 NULL 时,无论输入值如何,NOT IN 始终返回 UNKNOWN。NOT IN always returns UNKNOWN when the list contains NULL, regardless of the input value. 发生此情况的原因是:值不在包含 NULL 的列表中时 IN 返回 UNKNOWN,并且 NOT UNKNOWN 还是 UNKNOWNThis is because IN returns UNKNOWN if the value is not in the list containing NULL, and because NOT UNKNOWN is again UNKNOWN.

示例Examples

-- The subquery has only `NULL` value in its result set. Therefore,
-- the result of `IN` predicate is UNKNOWN.
SELECT * FROM person WHERE age IN (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- The subquery has `NULL` value in the result set as well as a valid
-- value `50`. Rows with age = 50 are returned.
SELECT * FROM person
    WHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
|Fred| 50|
| Dan| 50|
+----+---+

-- Since subquery has `NULL` value in the result set, the `NOT IN`
-- predicate would return UNKNOWN. Hence, no rows are
-- qualified for this query.
SELECT * FROM person
    WHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
+----+---+