Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure Databricks offers various methods for ingesting data into a lakehouse backed by Delta Lake. This article lists supported ingestion tools and guidance on which method to use based on criteria like data source and latency.
You can ingest data into Databricks using the following methods:
- Batch ingestion of a set of data rows for infrequent processing
- Streaming ingestion of individual data rows or sets of data rows as they arrive for real-time processing
Ingested data is loaded into Delta tables that can then be used across your downstream data and AI use cases. Because of Databricks' Lakehouse architecture, you do not need to duplicate your data across use cases, and you can leverage Unity Catalog for centralized access control, auditing, lineage, and data discovery across all of your data.
With batch ingestion you load data as sets (or batches) of rows into Databricks often based on a schedule (for example, every day) or triggered manually. This represents the "extract" piece of traditional extract, transform, load (ETL) use cases. You can use batch ingestion to load data from:
Local files like CSVs
Cloud object storage, including Amazon S3, and Azure Data Lake Storage
SaaS applications like Salesforce and databases like SQL Server
Batch ingestion supports a wide range of file source formats, including CSV, TSV, JSON, XML, Avro, ORC, Parquet, and text files.
Databricks supports both traditional batch ingestion and incremental batch ingestion options. While traditional batch ingestion processes all records each time it runs, incremental batch ingestion automatically detects new records in the data source and ignores records that have already been ingested. This means less data needs to be processed and, as result, ingestion jobs run faster and use compute resources more efficiently.
You can upload local data files or download files from a public URL using the add data UI. See Upload files.
This section describes supported incremental batch ingestion tools.
The CREATE STREAMING TABLE
SQL command lets you load data incrementally into streaming tables from cloud object storage. See CREATE STREAMING TABLE.
Example: Incremental batch ingestion using streaming tables
CREATE OR REFRESH STREAMING TABLE customers
AS SELECT * FROM STREAM read_files(
"/databricks-datasets/retail-org/customers/",
format => "csv"
)
Auto Loader, the built-in cloud object storage connector, allows you to incrementally and efficiently process new data files as they arrive in Amazon S3 (S3), or Azure Data Lake Storage Gen 2 (ALDS2). See Auto Loader.
Example: Incremental batch ingestion using Auto Loader
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("rescuedDataColumn", "_rescued_data")
.schema("/databricks-datasets/retail-org/customers/schema")
.load("/databricks-datasets/retail-org/customers/")
Lakeflow Connect provides fully-managed connectors to ingest from SaaS applications and databases. Managed connectors are available using the following:
- Databricks UI
- Databricks CLI
- Databricks APIs
- Databricks SDKs
- Databricks Asset Bundles
With streaming ingestion you continuously load data rows or batches of data rows as it is generated so you can query it as it arrives in near real-time. You can use streaming ingestion to load streaming data from sources like Apache Kafka, Amazon Kinesis, and Apache Pulsar.
Databricks also supports streaming ingestion using built-in connectors. These connectors allow you to incrementally and efficiently process new data as it arrives from streaming sources. See Configure streaming data sources.
Example: Streaming ingestion from Kafka
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<server:ip>")
.option("subscribe", "topic1")
.option("startingOffsets", "latest")
.load()
Databricks recommends using DLT to build reliable and scalable data processing pipelines. DLT supports both batch and streaming ingestion, and you can ingest data from any data source supported by Auto Loader.
Example: Incremental batch ingestion using DLT
@dlt.table
def customers():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/databricks-datasets/retail-org/customers/")
)
Example: Streaming ingestion from Kafka using DLT
@dlt.table
def kafka_raw():
return (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<server:ip>")
.option("subscribe", "topic1")
.option("startingOffsets", "latest")
.load()
)
You can ingest data as a one-time operation, on a recurring schedule, or continuously.
- For near real-time streaming use cases, use continuous mode.
- For batch ingestion use cases, ingest one time or set a recurring schedule.
See Ingestion with Jobs and Triggered vs. continuous pipeline mode.
Many third-party tools support batch or streaming ingestion into Databricks. Databricks validates various third-party integrations, although the steps to configure access to source systems and ingest data vary by tool.
Databricks provides a general compute platform. As a result, you can create your own ingestion connectors using any programming language supported by Databricks, like Python or Java. You can also import and leverage popular open source connector libraries like data load tool, Airbyte, and Debezium.
Databricks recommends ingestion for most use cases because it scales to accommodate high data volumes, low-latency querying, and third-party API limits. Ingestion copies data from your source systems to Azure Databricks, which results in duplicate data that might become stale over time. If you don't want to copy data, you can use the following tools:
- Lakehouse Federation allows you to query external data sources without moving your data.
- Delta Sharing allows you to securely share data across platforms, clouds, and regions.
However, if you don't want to copy data, use Lakehouse Federation or Delta Sharing.
Choose Delta Sharing for the following scenarios:
- Limiting data duplication
- Querying the freshest possible data
Choose Lakehouse Federation for the following scenarios:
- Ad hoc reporting or proof-of-concept work on your ETL pipelines
Consideration | Guidance |
---|---|
Data source | If Lakeflow Connect native connectors exist for the data source, this will be the simplest way to ingest data. For data sources not supported by Lakeflow Connect, extract the data from the source and then use Auto Loader to ingest the data into Databricks. For local files, use the Databricks UI to upload the data. |
Latency | If you want to analyze data in near real time, use streaming to leverage incremental processing. With streaming, the data is available for querying as soon as each record arrives. Otherwise, use batch ingestion. |
Data movement | If you are unable to copy data from your source systems to Databricks, use Lakehouse Federation or Delta Sharing. |
To learn how to migrate existing data to Delta Lake, see Migrate data to Delta Lake.
The CREATE STREAMING TABLE
SQL command is the recommended alternative to the legacy COPY INTO
SQL command for incremental ingestion from cloud object storage. See COPY INTO. For a more scalable and robust file ingestion experience, Databricks recommends that SQL users leverage streaming tables instead of COPY INTO
.
What is the difference between streaming and incremental batch ingestion?