Use scheduler pools for multiple streaming workloads
To enable multiple streaming queries to execute jobs concurrently on a shared cluster, you can configure queries to execute in separate scheduler pools.
How do scheduler pools work?
By default, all queries started in a notebook run in the same fair scheduling pool. Jobs generated by triggers from all of the streaming queries in a notebook run one after another in first in, first out (FIFO) order. This can cause unnecessary delays in the queries, because they are not efficiently sharing the cluster resources.
Scheduler pools allow you to declare which Structured Streaming queries share compute resources.
The following example assigns query1
to a dedicated pool, while query2
and query3
share a scheduler pool.
# Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("delta").start(path1)
# Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("delta").start(path2)
# Run streaming query3 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query3").format("delta").start(path3)
Note
The local property configuration must be in the same notebook cell where you start your streaming query.
See Apache fair scheduler documentation for more details.