Broadcast join exceeds threshold, returns out of memory error

Resolve an Apache Spark OutOfMemorySparkException error that occurs when a table using BroadcastHashJoin exceeds the BroadcastJoinThreshold.

Written by sandeep.chandran

Last published at: May 23rd, 2022

Problem

You are attempting to join two large tables, projecting selected columns from the first table and all columns from the second table.

Despite the total size exceeding the limit set by spark.sql.autoBroadcastJoinThreshold, BroadcastHashJoin is used and Apache Spark returns an OutOfMemorySparkException error.

org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=1073741824. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1

Cause

This is due to a limitation with Spark’s size estimator.

If the estimated size of one of the DataFrames is less than the autoBroadcastJoinThreshold, Spark may use BroadcastHashJoin to perform the join. If the available nodes do not have enough resources to accommodate the broadcast DataFrame, your job fails due to an out of memory error.

Solution

There are three different ways to mitigate this issue.

Use ANALYZE TABLE (AWS | Azure) to collect details and compute statistics about the DataFrames before attempting a join.
Cache the table (AWS | Azure) you are broadcasting.
1. Run explain on your join command to return the physical plan.
```
%sql
explain(<join command>)
```
2. Review the physical plan. If the broadcast join returns BuildLeft, cache the left side table. If the broadcast join returns BuildRight, cache the right side table.
In Databricks Runtime 7.0 and above, set the join type to SortMergeJoin with join hints enabled.

Cannot grow BufferHolder; exceeds size limitation
Problem Your Apache Spark job fails with an IllegalArgumentException: Cannot grow...
Date functions only accept int values in Apache Spark 3.0
Problem You are attempting to use the date_add() or date_sub() functions in Spark...
Disable broadcast when query plan has BroadcastNestedLoopJoin
This article explains how to disable broadcast when the query plan has BroadcastN...
Duplicate columns in the metadata error
Problem Your Apache Spark job is processing a Delta table when the job fails with...
Generate unique increasing numeric values
This article shows you how to use Apache Spark functions to generate unique incre...

Databricks Knowledge Base

Contact Us

Problem

Cause

Solution