Run your first Structured Streaming workload
This article provides code examples and explanation of basic concepts necessary to run your first Structured Streaming queries on Azure Databricks. You can use Structured Streaming for near real-time and incremental processing workloads.
Structured Streaming is one of several technologies that power streaming tables in Delta Live Tables. Databricks recommends using Delta Live Tables for all new ETL, ingestion, and Structured Streaming workloads. See What is Delta Live Tables?.
Note
While Delta Live Tables provides a slightly modified syntax for declaring streaming tables, the general syntax for configuring streaming reads and transformations applies to all streaming use cases on Azure Databricks. Delta Live Tables also simplifies streaming by managing state information, metadata, and numerous configurations.
Read from a data stream
You can use Structured Streaming to incrementally ingest data from supported data sources. Some of the most common data sources used in Azure Databricks Structured Streaming workloads include the following:
- Data files in cloud object storage
- Message buses and queues
- Delta Lake
Databricks recommends using Auto Loader for streaming ingestion from cloud object storage. Auto Loader supports most file formats supported by Structured Streaming. See What is Auto Loader?.
Each data source provides a number of options to specify how to load batches of data. During reader configuration, the main options you might need to set fall into the following categories:
- Options that specify the data source or format (for example, file type, delimiters, and schema).
- Options that configure access to source systems (for example, port settings and credentials).
- Options that specify where to start in a stream (for example, Kafka offsets or reading all existing files).
- Options that control how much data is processed in each batch (for example, max offsets, files, or bytes per batch).
Use Auto Loader to read streaming data from object storage
The following example demonstrates loading JSON data with Auto Loader, which uses cloudFiles
to denote format and options. The schemaLocation
option enables schema inference and evolution. Paste the following code in a Databricks notebook cell and run the cell to create a streaming DataFrame named raw_df
:
file_path = "/databricks-datasets/structured-streaming/events"
checkpoint_path = "/tmp/ss-tutorial/_checkpoint"
raw_df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load(file_path)
)
Like other read operations on Azure Databricks, configuring a streaming read does not actually load data. You must trigger an action on the data before the stream begins.
Note
Calling display()
on a streaming DataFrame starts a streaming job. For most Structured Streaming use cases, the action that triggers a stream should be writing data to a sink. See Preparing your Structured Streaming code for production.
Perform a streaming transformation
Structured Streaming supports most transformations that are available in Azure Databricks and Spark SQL. You can even load MLflow models as UDFs and make streaming predictions as a transformation.
The following code example completes a simple transformation to enrich the ingested JSON data with additional information using Spark SQL functions:
from pyspark.sql.functions import col, current_timestamp
transformed_df = (raw_df.select(
"*",
col("_metadata.file_path").alias("source_file"),
current_timestamp().alias("processing_time")
)
)
The resulting transformed_df
contains query instructions to load and transform each record as it arrives in the data source.
Note
Structured Streaming treats data sources as unbounded or infinite datasets. As such, some transformations are not supported in Structured Streaming workloads because they would require sorting an infinite number of items.
Most aggregations and many joins require managing state information with watermarks, windows, and output mode. See Apply watermarks to control data processing thresholds.
Write to a data sink
A data sink is the target of a streaming write operation. Common sinks used in Azure Databricks streaming workloads include the following:
- Delta Lake
- Message buses and queues
- Key-value databases
As with data sources, most data sinks provide a number of options to control how data is written to the target system. During writer configuration, the main options you might need to set fall into the following categories:
- Output mode (append by default).
- A checkpoint location (required for each writer).
- Trigger intervals; see Configure Structured Streaming trigger intervals.
- Options that specify the data sink or format (for example, file type, delimiters, and schema).
- Options that configure access to target systems (for example, port settings and credentials).
Perform an incremental batch write to Delta Lake
The following example writes to Delta Lake using a specified file path and checkpoint.
Important
Always make sure you specify a unique checkpoint location for each streaming writer you configure. The checkpoint provides the unique identity for your stream, tracking all records processed and state information associated with your streaming query.
The availableNow
setting for the trigger instructs Structured Streaming to process all previously unprocessed records from the source dataset and then shut down, so you can safely execute the following code without worrying about leaving a stream running:
target_path = "/tmp/ss-tutorial/"
checkpoint_path = "/tmp/ss-tutorial/_checkpoint"
transformed_df.writeStream
.trigger(availableNow=True)
.option("checkpointLocation", checkpoint_path)
.option("path", target_path)
.start()
In this example, no new records arrive in our data source, so repeat execution of this code does not ingest new records.
Warning
Structured Streaming execution can prevent auto termination from shutting down compute resources. To avoid unexpected costs, be sure to terminate streaming queries.
Preparing your Structured Streaming code for production
Databricks recommends using Delta Live Tables for most Structured Streaming workloads. The following recommendations provide a starting point for preparing Structured Streaming workloads for production:
- Remove unnecessary code from notebooks that would return results, such as
display
andcount
. - Do not run Structured Streaming workloads on interactive clusters; always schedule streams as jobs.
- To help streaming jobs recover automatically, configure jobs with infinite retries.
- Do not use auto-scaling for workloads with Structured Streaming.
For more recommendations, see Production considerations for Structured Streaming.
Read data from Delta Lake, transform, and write to Delta Lake
Delta Lake has extensive support for working with Structured Streaming as both a source and a sink. See Delta table streaming reads and writes.
The following example shows example syntax to incrementally load all new records from a Delta table, join them with a snapshot of another Delta table, and write them to a Delta table:
(spark.readStream
.table("<table-name1>")
.join(spark.read.table("<table-name2>"), on="<id>", how="left")
.writeStream
.trigger(availableNow=True)
.option("checkpointLocation", "<checkpoint-path>")
.toTable("<table-name3>")
)
You must have proper permissions configured to read source tables and write to target tables and the specified checkpoint location. Fill in all parameters denoted with angle brackets (<>
) using the relevant values for your data sources and sinks.
Note
Delta Live Tables provides a fully declarative syntax for creating Delta Lake pipelines and manages properties like triggers and checkpoints automatically. See What is Delta Live Tables?.
Read data from Kafka, transform, and write to Kafka
Apache Kafka and other messaging buses provide some of the lowest latency available for large datasets. You can use Azure Databricks to apply transformations to data ingested from Kafka and then write data back to Kafka.
Note
Writing data to cloud object storage adds additional latency overhead. If you wish to store data from a messaging bus in Delta Lake but require the lowest latency possible for streaming workloads, Databricks recommends configuring separate streaming jobs to ingest data to the lakehouse and apply near real-time transformations for downstream messaging bus sinks.
The following code example demonstrates a simple pattern to enrich data from Kafka by joining it with data in a Delta table and then writing back to Kafka:
(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<server:ip>")
.option("subscribe", "<topic>")
.option("startingOffsets", "latest")
.load()
.join(spark.read.table("<table-name>"), on="<id>", how="left")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "<server:ip>")
.option("topic", "<topic>")
.option("checkpointLocation", "<checkpoint-path>")
.start()
)
You must have proper permissions configured for access to your Kafka service. Fill in all parameters denoted with angle brackets (<>
) using the relevant values for your data sources and sinks. See Stream processing with Apache Kafka and Azure Databricks.