分区列中的 Null 值和空字符串在保存后均变为 Null 值Nulls and empty strings in a partitioned column save as nulls

问题Problem

如果将同时包含空字符串和 NULL 值的数据保存在对表分区所依据的列中,则在写入和读取该表后,两个值都将变为 NULL。If you save data containing both empty strings and null values in a column on which the table is partitioned, both values become null after writing and reading the table.

为了说明这一点,请创建一个简单的 DataFrameTo illustrate this, create a simple DataFrame:

import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, null))
val schema = new StructType().add("a", IntegerType).add("b", StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

此时如果显示 df 的内容,你会发现它看起来不变:At this point, if you display the contents of df, it appears unchanged:

no-alternative-textno-alternative-text

写入 df,再次读取并显示它。Write df, read it again, and display it. 空字符串将替换为 NULL 值:The empty strings are replaced by null values:

no-alternative-textno-alternative-text

原因Cause

这是预期的行为。This is the expected behavior. 它继承自 Apache Hive。It is inherited from Apache Hive.

解决方案Solution

通常,不应同时将 NULL 和空字符串用作分区列中的值。In general, you shouldn’t use both null and empty strings as values in a partitioned column.