Error when downloading full results after join

If you have duplicate columns after a join, you will get an error when trying to download the full results.

Last published at: May 23rd, 2022

Problem

You are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to Download full results you get an error.

Error in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/

Reproduce error

Create two tables.

%python

from pyspark.sql.functions import *

df = spark.range(12000)
df = df.withColumn("col2",lit("test"))
df.createOrReplaceTempView("table1")

df1 = spark.range(5)
df1.createOrReplaceTempView("table2")

Perform left outer join on the tables.

%sql

select * from table1 t1 left join table2 t2 on t1.id = t2.id

Click Download preview. A CSV file downloads.
Click Download full results. An error is generated.

Cause

Download preview works because this is a frontend only operation that runs in the browser. No constraints are checked and only 1000 rows are included in the CSV file.
Download full results re-executes the query in Apache Spark and writes the CSV file internally. The error occurs when duplicate columns are found after a join operation.

Solution

Option 1

If you select all the required columns, and avoid duplicate columns after the join operation, you will not get the error and can download the full result.

%sql

select t1.id, t1.col2 from table1 t1 left join table2 t2 on t1.id = t2.id

Option 2

You can use DataFrames to prevent duplicated columns. If there are no duplicated columns after the join operation, you will not get the error and can download the full result.

%python

result_df = df.join(df1, ["id"],"left")
display(result_df)

Databricks Knowledge Base

Contact Us