防止在联接两个数据帧时出现重复列Prevent duplicated columns when joining two DataFrames

如果在 Spark 中执行联接,但未正确指定联接,最后会得到重复的列名。If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. 这使得选择这些列变得更难。This makes it harder to select those columns. 本文和笔记本演示如何执行联接,以避免出现重复列。This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns.

联接列Join on columns

如果联接列,则会得到重复列。If you join on columns, you get duplicated columns.

ScalaScala

%scala

val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")

val df = left.join(right, left.col("name") === right.col("name"))

PythonPython

%python

llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)]
left = spark.createDataFrame(llist, ['name','date','duration'])
right = spark.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])

df = left.join(right, left.name == right.name)

解决方案Solution

将联接列指定为数组类型或字符串。Specify the join column as an array type or string.

ScalaScala

%scala

val df = left.join(right, Seq("name"))
%scala

val df = left.join(right, "name")

PythonPython

%python
df = left.join(right, ["name"])
%python
df = left.join(right, "name")

RR

首先将数据帧注册为表。First register the DataFrames as tables.

%python

left.createOrReplaceTempView("left_test_table")
right.createOrReplaceTempView("right_test_table")
%r
library(SparkR)
sparkR.session()
left <- sql("SELECT * FROM left_test_table")
right <- sql("SELECT * FROM right_test_table")

上面的代码会生成重复列。The above code results in duplicate columns. 下面的代码则不会。The following code does not.

%r
head(drop(join(left, right, left$name == right$name), left$name))

将数据帧与重复列笔记本联接起来Join DataFrames with duplicated columns notebook

获取笔记本Get notebook