Configure Structured Streaming trigger intervals
Apache Spark Structured Streaming processes data incrementally; controlling the trigger interval for batch processing allows you to use Structured Streaming for workloads including near-real time processing, refreshing databases every 5 minutes or once per hour, or batch processing all new data for a day or week.
Because Databricks Auto Loader uses Structured Streaming to load data, understanding how triggers work provides you with the greatest flexibility to control costs while ingesting data with the desired frequency.
Specifying time-based trigger intervals
Structured Streaming refers to time-based trigger intervals as "fixed interval micro-batches". Using the processingTime
keyword, specify a time duration as a string, such as .trigger(processingTime='10 seconds')
.
When you specify a trigger
interval that is too small (less than tens of seconds), the system may perform unnecessary checks to see if new data arrives. Configure your processing time to balance latency requirements and the rate that data arrives in the source.
Configuring incremental batch processing
Important
In Databricks Runtime 11.3 LTS and above, the Trigger.Once
setting is deprecated. Databricks recommends you use Trigger.AvailableNow
for all incremental batch processing workloads.
The available now trigger option consumes all available records as an incremental batch with the ability to configure batch size with options such as maxBytesPerTrigger
(sizing options vary by data source).
Azure Databricks supports using Trigger.AvailableNow
for incremental batch processing from many Structured Streaming sources. The following table includes the minimum supported Databricks Runtime version required for each data source:
Source | Minimum Databricks Runtime version |
---|---|
File sources (JSON, Parquet, etc.) | 9.1 LTS |
Delta Lake | 10.4 LTS |
Auto Loader | 10.4 LTS |
Apache Kafka | 10.4 LTS |
Kinesis | 13.1 |
What is the default trigger interval?
Structured Streaming defaults to fixed interval micro-batches of 500ms. Databricks recommends you always specify a tailored trigger
to minimize costs associated with checking if new data has arrived and processing undersized batches.
Changing trigger intervals between runs
You can change the trigger interval between runs while using the same checkpoint.
If a Structured Streaming job stops while a micro-batch is being processed, that micro-batch must complete before the new trigger interval applies. As such, you might observe a micro-batch processing with the previously specified settings after changing the trigger interval.
When moving from time-based interval to using AvailableNow
, this might result in a micro-batch processing ahead of processing all available records as an incremental batch.
When moving from AvailableNow
to a time-based interval, this might result in continuing to process all records that were available when the last AvailableNow
job triggered. This is the expected behavior.
Note
If you are trying to recover from query failure associated with an incremental batch, changing the trigger interval does not solve this problem because the batch must still be completed. Databricks recommends scaling up the compute capacity used to process the batch to try to resolve the issue. In rare cases, you might need to restart the stream with a new checkpoint.
What is continuous processing mode?
Apache Spark supports an additional trigger interval known as Continuous Processing. This mode has been classified as experimental since Spark 2.3; consult with your Azure Databricks account team to make sure you understand the trade-offs of this processing model.
Note that this continuous processing mode does not relate at all to continuous processing as applied in Delta Live Tables.