防止在联接两个数据帧时出现重复列Prevent duplicated columns when joining two DataFrames
如果在 Spark 中执行联接,但未正确指定联接,最后会得到重复的列名。If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. 这使得选择这些列变得更难。This makes it harder to select those columns. 本文和笔记本演示如何执行联接,以避免出现重复列。This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns.
联接列Join on columns
如果联接列,则会得到重复列。If you join on columns, you get duplicated columns.
ScalaScala
%scala
val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")
val df = left.join(right, left.col("name") === right.col("name"))
PythonPython
%python
llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)]
left = spark.createDataFrame(llist, ['name','date','duration'])
right = spark.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])
df = left.join(right, left.name == right.name)
解决方案Solution
将联接列指定为数组类型或字符串。Specify the join column as an array type or string.
ScalaScala
%scala
val df = left.join(right, Seq("name"))
%scala
val df = left.join(right, "name")
PythonPython
%python
df = left.join(right, ["name"])
%python
df = left.join(right, "name")
RR
首先将数据帧注册为表。First register the DataFrames as tables.
%python
left.createOrReplaceTempView("left_test_table")
right.createOrReplaceTempView("right_test_table")
%r
library(SparkR)
sparkR.session()
left <- sql("SELECT * FROM left_test_table")
right <- sql("SELECT * FROM right_test_table")
上面的代码会生成重复列。The above code results in duplicate columns. 下面的代码则不会。The following code does not.
%r
head(drop(join(left, right, left$name == right$name), left$name))