Develop Delta Live Tables pipelines with Databricks Asset Bundles
Databricks Asset Bundles, also known simply as bundles, enable you to programmatically validate, deploy, and run Azure Databricks resources such as Delta Live Tables pipelines. You can also use bundles to programmatically manage Azure Databricks jobs and to work with MLOps Stacks. See What are Databricks Asset Bundles?.
This article describes a set of steps that you can complete from your local development machine to use a bundle that programmatically manages a Delta Live Tables pipeline.
Requirements
- Databricks CLI version 0.218.0 or above. To check your installed version of the Databricks CLI, run the command
databricks -v
. To install the Databricks CLI, see Install or update the Databricks CLI. - The remote workspace must have workspace files enabled. See What are workspace files?.
(Optional) Install a Python module to support local pipeline development
Databricks provides a Python module to assist your local development of Delta Live Tables pipeline code by providing syntax checking, autocomplete, and data type checking as you write code in your IDE.
The Python module for local development is available on PyPi. To install the module, see Python stub for Delta Live Tables.
Decision: Create the bundle by using a template or manually
Decide whether you want to create the bundle using a template or manually:
Create the bundle by using a template
In these steps, you create the bundle by using the Azure Databricks default bundle template for Python. These steps guide you to create a bundle that consists of a notebook that defines a Delta Live Tables pipeline, which filters data from the original dataset. You then validate, deploy, and run the deployed pipeline within your Azure Databricks workspace.
Step 1: Set up authentication
For more information about to set up authentication, see databricks authentication.
Step 2: Create the bundle
A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.
Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template's generated bundle.
Use the Databricks CLI to run the
bundle init
command:databricks bundle init
For
Template to use
, leave the default value ofdefault-python
by pressingEnter
.For
Unique name for this project
, leave the default value ofmy_project
, or type a different value, and then pressEnter
. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.For
Include a stub (sample) notebook
, selectno
and pressEnter
. This instructs the Databricks CLI to not add a sample notebook at this point, as the sample notebook that is associated with this option has no Delta Live Tables code in it.For
Include a stub (sample) DLT pipeline
, leave the default value ofyes
by pressingEnter
. This instructs the Databricks CLI to add a sample notebook that has Delta Live Tables code in it.For
Include a stub (sample) Python package
, selectno
and pressEnter
. This instructs the Databricks CLI to not add sample Python wheel package files or related build instructions to your bundle.
Step 3: Explore the bundle
To view the files that the template generated, switch to the root directory of your newly created bundle and open this directory with your preferred IDE, for example Visual Studio Code. Files of particular interest include the following:
databricks.yml
: This file specifies the bundle's programmatic name, includes a reference to the pipeline definition, and specifies settings about the target workspace.resources/<project-name>_job.yml
andresources/<project-name>_pipeline.yml
: This file specifies the pipeline's settings.src/dlt_pipeline.ipynb
: This file is a notebook that, when run, executes the pipeline.
For customizing pipelines, the mappings within a pipeline declaration correspond to the create pipeline operation's request payload as defined in POST /api/2.0/pipelines in the REST API reference, expressed in YAML format.
Step 4: Validate the project's bundle configuration file
In this step, you check whether the bundle configuration is valid.
From the root directory, use the Databricks CLI to run the
bundle validate
command, as follows:databricks bundle validate
If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.
If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.
Step 5: Deploy the local project to the remote workspace
In this step, you deploy the local notebook to your remote Azure Databricks workspace and create the Delta Live Tables pipeline within your workspace.
Use the Databricks CLI to run the
bundle validate
command as follows:databricks bundle deploy -t dev
Check whether the local notebook was deployed: In your Azure Databricks workspace's sidebar, click Workspace.
Click into the Users >
<your-username>
> .bundle ><project-name>
> dev > files > src folder. The notebook should be in this folder.Check whether the pipeline was created: In your Azure Databricks workspace's sidebar, click Delta Live Tables.
On the Delta Live Tables tab, click [dev
<your-username>
]<project-name>
_pipeline.
If you make any changes to your bundle after this step, you should repeat steps 4-5 to check whether your bundle configuration is still valid and then redeploy the project.
Step 6: Run the deployed project
In this step, you run the Delta Live Tables pipeline in your workspace.
From the root directory, use the Databricks CLI to run the
bundle run
command, as follows, replacing<project-name>
with the name of your project from Step 2:databricks bundle run -t dev <project-name>_pipeline
Copy the value of
Update URL
that appears in your terminal and paste this value into your web browser to open your Azure Databricks workspace.In your Azure Databricks workspace, after the pipeline completes successfully, click the taxi_raw view and the filtered_taxis materialized view to see the details.
If you make any changes to your bundle after this step, you should repeat steps 4-6 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.
Step 7: Clean up
In this step, you delete the deployed notebook and the pipeline from your workspace.
From the root directory, use the Databricks CLI to run the
bundle destroy
command, as follows:databricks bundle destroy -t dev
Confirm the pipeline deletion request: When prompted to permanently destroy resources, type
y
and pressEnter
.Confirm the notebook deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type
y
and pressEnter
.If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 2.
You have reached the end of the steps for creating a bundle by using a template.
Create the bundle manually
In these steps, you create the bundle from the beginning. These steps guide you to create a bundle that consists of a notebook with embedded Delta Live Tables directives and the definition of a Delta Live Tables pipeline to run this notebook. You then validate, deploy, and run the deployed notebook from the pipeline within your Azure Databricks workspace.
Step 1: Create the bundle
A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.
- Create or identify an empty directory on your development machine.
- Switch to the empty directory in your terminal, or open the empty directory in your IDE.
Tip
Your empty directory could be associated with a cloned repository managed by a Git provider. This enables you to manage your bundle with external version control and to collaborate more easily with other developers and IT professionals on your project. However, to help simplify this demonstration, a cloned repo is not used here.
If you choose to clone a repo for this demo, Databricks recommends that the repo is empty or has only basic files in it such as README
and .gitignore
. Otherwise, any pre-existing files in the repo might be unnecessarily synchronized to your Azure Databricks workspace.
Step 2: Add a notebook to the project
In this step, you add a notebooks to your project. This notebook does the following:
- Reads raw JSON clickstream data from Databricks datasets into a raw Delta table in the
pipelines
folder inside of your Azure Databricks workspace's DBFS root folder. - Reads records from the raw Delta table and uses a Delta Live Tables query and expectations to create a new Delta table with cleaned and prepared data.
- Performs an analysis of the prepared data in the new Delta table with a Delta Live Tables query.
From the directory's root, create a file with the name
dlt-wikipedia-python.py
.Add the following code to the
dlt-wikipedia-python.py
file:# Databricks notebook source import dlt from pyspark.sql.functions import * # COMMAND ---------- json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json" # COMMAND ---------- @dlt.table( comment="The raw wikipedia clickstream dataset, ingested from /databricks-datasets." ) def clickstream_raw(): return (spark.read.format("json").load(json_path)) # COMMAND ---------- @dlt.table( comment="Wikipedia clickstream data cleaned and prepared for analysis." ) @dlt.expect("valid_current_page_title", "current_page_title IS NOT NULL") @dlt.expect_or_fail("valid_count", "click_count > 0") def clickstream_prepared(): return ( dlt.read("clickstream_raw") .withColumn("click_count", expr("CAST(n AS INT)")) .withColumnRenamed("curr_title", "current_page_title") .withColumnRenamed("prev_title", "previous_page_title") .select("current_page_title", "click_count", "previous_page_title") ) # COMMAND ---------- @dlt.table( comment="A table containing the top pages linking to the Apache Spark page." ) def top_spark_referrers(): return ( dlt.read("clickstream_prepared") .filter(expr("current_page_title == 'Apache_Spark'")) .withColumnRenamed("previous_page_title", "referrer") .sort(desc("click_count")) .select("referrer", "click_count") .limit(10) )
Step 3: Add a bundle configuration schema file to the project
If you are using an IDE such as Visual Studio Code, PyCharm Professional, or IntelliJ IDEA Ultimate that provide support for YAML files and JSON schema files, you can use your IDE to not only create the bundle configuration schema file but to check your project's bundle configuration file syntax and formatting and provide code completion hints, as follows. Note that while the bundle configuration file that you will create later in Step 5 is YAML-based, the bundle configuration schema file in this step is JSON-based.
Visual Studio Code
Add YAML language server support to Visual Studio Code, for example by installing the YAML extension from the Visual Studio Code Marketplace.
Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
within the current directory, as follows:databricks bundle schema > bundle_config_schema.json
Note that later in Step 5, you will add the following comment to the beginning of your bundle configuration file, which associates your bundle configuration file with the specified JSON schema file:
# yaml-language-server: $schema=bundle_config_schema.json
Note
In the preceding comment, if your Databricks Asset Bundle configuration JSON schema file is in a different path, replace
bundle_config_schema.json
with the full path to your schema file.
PyCharm Professional
Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
within the current directory, as follows:databricks bundle schema > bundle_config_schema.json
Configure PyCharm to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in Configure a custom JSON schema.
Note that later in Step 5, you will use PyCharm to create or open a bundle configuration file. By convention, this file is named
databricks.yml
.
IntelliJ IDEA Ultimate
Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
within the current directory, as follows:databricks bundle schema > bundle_config_schema.json
Configure IntelliJ IDEA to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in Configure a custom JSON schema.
Note that later in Step 5, you will use IntelliJ IDEA to create or open a bundle configuration file. By convention, this file is named
databricks.yml
.
Step 4: Set up authentication
For more information about to set up authentication, see databricks authentication.
Step 5: Add a bundle configuration file to the project
In this step, you define how you want to deploy and run this notebook. For this demo, you want to use a Delta Live Tables pipeline to run the notebook. You model this objective within a bundle configuration file in your project.
- From the directory's root, use your favorite text editor or your IDE to create the bundle configuration file. By convention, this file is named
databricks.yml
. - Add the following code to the
databricks.yml
file, replacing<workspace-url>
with your per-workspace URL, for examplehttps://adb-1234567890123456.7.databricks.azure.cn
. This URL must match the one in your.databrickscfg
file:
Tip
The first line, starting with # yaml-language-server
, is required only if your IDE supports it. See Step 3 earlier for details.
# yaml-language-server: $schema=bundle_config_schema.json
bundle:
name: dlt-wikipedia
resources:
pipelines:
dlt-wikipedia-pipeline:
name: dlt-wikipedia-pipeline
development: true
continuous: false
channel: "CURRENT"
photon: false
libraries:
- notebook:
path: ./dlt-wikipedia-python.py
edition: "ADVANCED"
clusters:
- label: "default"
num_workers: 1
targets:
development:
workspace:
host: <workspace-url>
For customizing pipelines, the mappings within a pipeline declaration correspond to the create pipeline operation's request payload as defined in POST /api/2.0/pipelines in the REST API reference, expressed in YAML format.
Step 6: Validate the project's bundle configuration file
In this step, you check whether the bundle configuration is valid.
Use the Databricks CLI to run the
bundle validate
command, as follows:databricks bundle validate
If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.
If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.
Step 7: Deploy the local project to the remote workspace
In this step, you deploy the local notebook to your remote Azure Databricks workspace and create the Delta Live Tables pipeline in your workspace.
Use the Databricks CLI to run the
bundle validate
command as follows:databricks bundle deploy -t development
Check whether the local notebook was deployed: In your Azure Databricks workspace's sidebar, click Workspace.
Click into the Users >
<your-username>
> .bundle > dlt-wikipedia > development > files folder. The notebook should be in this folder.Check whether the Delta Live Tables pipeline was created: In your Azure Databricks workspace's sidebar, click Workflows.
On the Delta Live Tables tab, click dlt-wikipedia-pipeline.
If you make any changes to your bundle after this step, you should repeat steps 6-7 to check whether your bundle configuration is still valid and then redeploy the project.
Step 8: Run the deployed project
In this step, you run the Azure Databricks job in your workspace.
Use the Databricks CLI to run the
bundle run
command, as follows:databricks bundle run -t development dlt-wikipedia-pipeline
Copy the value of
Update URL
that appears in your terminal and paste this value into your web browser to open your Azure Databricks workspace.In your Azure Databricks workspace, after the Delta Live Tables pipeline completes successfully and shows green title bars across the various materialized views, click the clickstream_raw, clickstream_prepared, or top_spark_referrers materialized views to see more details.
Before you start the next step to clean up, note the location of the Delta tables created in DBFS as follows. You will need this information if you want to manually clean up these Delta tables later:
- With the Delta Live Tables pipeline still open, click the Settings button (next to the Permissions and Schedule buttons).
- In the Destination area, note the value of the Storage location field. This is where the Delta tables were created in DBFS.
If you make any changes to your bundle after this step, you should repeat steps 6-8 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.
Step 9: Clean up
In this step, you delete the deployed notebook and the Delta Live Tables pipeline from your workspace.
Use the Databricks CLI to run the
bundle destroy
command, as follows:databricks bundle destroy
Confirm the Delta Live Tables pipeline deletion request: When prompted to permanently destroy resources, type
y
and pressEnter
.Confirm the notebook deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type
y
and pressEnter
.
Running the bundle destroy
command deletes only the deployed Delta Live Tables pipeline and the folder containing the deployed notebook. This command does not delete any side effects, such as the Delta tables that the notebook created in DBFS. If you need to delete these Delta tables, you must do so manually.
Add an existing pipeline definition to a bundle
You can use an existing Delta Live Tables pipeline definition as a basis to define a new pipeline in a bundle configuration file. To do this, complete the following steps.
Note
The following steps create a new pipeline that has the same settings as the existing pipeline. However, the new pipeline has a different pipeline ID than the existing pipeline. You cannot automatically import an existing pipeline ID into a bundle.
Step 1: Get the existing pipeline definition in JSON format
In this step, you use the Azure Databricks workspace user interface to get the JSON representation of the existing pipeline definition.
- In your Azure Databricks workspace's sidebar, click Workflows.
- On the Delta Live Tables tab, click your pipeline's Name link.
- Between the Permissions and Schedule buttons, click the Settings button.
- Click the JSON button.
- Copy the pipeline definition's JSON.
Step 2: Convert the pipeline definition from JSON to YAML format
The pipeline definition that you copied from the previous step is in JSON format. Bundle configurations are in YAML format. You must convert the pipeline definition from JSON to YAML format. Databricks recommends the following resources for converting JSON to YAML:
- Convert JSON to YAML online.
- For Visual Studio Code, the json2yaml extension.
Step 3: Add the pipeline definition YAML to a bundle configuration file
In your bundle configuration file, add the YAML that you copied from the previous step to one of the following locations labelled <pipeline-yaml-can-go-here>
in your bundle configuration files, as follows:
resources:
pipelines:
<some-unique-programmatic-identifier-for-this-pipeline>:
<pipeline-yaml-can-go-here>
targets:
<some-unique-programmatic-identifier-for-this-target>:
resources:
pipelines:
<some-unique-programmatic-identifier-for-this-pipeline>:
<pipeline-yaml-can-go-here>
Step 4: Add notebooks, Python files, and other artifacts to the bundle
Any Python files and notebooks that are referenced in the existing pipeline should be moved into the bundle's sources.
For better compatibility with bundles, notebooks should use the IPython notebook format (.ipynb
). If you develop the bundle locally, you can export an existing notebook from an Azure Databricks workspace into the .ipynb
format by clicking File > Export > IPython Notebook from the Azure Databricks notebook user interface. By convention, you should then put the downloaded notebook into the src/
directory in your bundle.
After you add your notebooks, Python files, and other artifacts to the bundle, make sure that your pipeline definition references them. For example, for a notebook with the filename of hello.ipynb
that is in a src/
directory, and the src/
directory is in the same folder as the bundle configuration file that references the src/
directory, the pipeline definition might be expressed as follows:
resources:
pipelines:
hello-pipeline:
name: hello-pipeline
libraries:
-
notebook:
path: ./src/hello.ipynb
Step 5: Validate, deploy, and run the new pipeline
Validate that the bundle's configuration files are syntactically correct, by running the following command:
databricks bundle validate
Deploy the bundle by running the following command. In this command, replace
<target-identifier>
with the unique programmatic identifier for the target from the bundle configuration:databricks bundle deploy -t <target-identifier>
Run the pipeline by running the following command. In this command, replace the following:
- Replace
<target-identifier>
with the unique programmatic identifier for the target from the bundle configuration. - Replace
<pipeline-identifier>
with unique programmatic identifier for the pipeline from the bundle configuration.
databricks bundle run -t <target-identifier> <pipeline-identifier>
- Replace