分区列中的 Null 值和空字符串在保存后均变为 Null 值Nulls and empty strings in a partitioned column save as nulls
问题Problem
如果将同时包含空字符串和 NULL 值的数据保存在对表分区所依据的列中,则在写入和读取该表后,两个值都将变为 NULL。If you save data containing both empty strings and null values in a column on which the table is partitioned, both values become null after writing and reading the table.
为了说明这一点,请创建一个简单的 DataFrame
:To illustrate this, create a simple DataFrame
:
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, null))
val schema = new StructType().add("a", IntegerType).add("b", StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
此时如果显示 df
的内容,你会发现它看起来不变:At this point, if you display the contents of df
, it appears unchanged:
写入 df
,再次读取并显示它。Write df
, read it again, and display it. 空字符串将替换为 NULL 值:The empty strings are replaced by null values:
原因Cause
这是预期的行为。This is the expected behavior. 它继承自 Apache Hive。It is inherited from Apache Hive.
解决方案Solution
通常,不应同时将 NULL 和空字符串用作分区列中的值。In general, you shouldn’t use both null and empty strings as values in a partitioned column.