Spark driver overloaded
So you've determined that your driver is overloaded. The most common reason for this is that there are too many concurrent things running on the cluster. This could be too many streams, queries, or Spark jobs (some customers use threads to run many spark jobs concurrently).
It could also be that you're running non-Spark code on your cluster that is keeping the driver busy. If you see gaps in your timeline caused by running non-Spark code, this means your workers are all idle and likely wasting money during the gaps. Maybe this is intentional and unavoidable, but if you can write this code to use Spark you will fully utilize the cluster. Start with this tutorial to learn how to work with Spark.
If you have too many things running on the cluster simultaneously, then you have three options:
- Increase the size of your driver
- Reduce the concurrency
- Spread the load over multiple clusters
Azure Databricks recommends you first try doubling the size of the driver and see how that impacts your job.