Use dbx with Visual Studio Code
Important
This documentation has been retired and might not be updated.
Databricks recommends that you use Databricks Asset Bundles instead of dbx
by Databricks Labs. See What are Databricks Asset Bundles? and Migrate from dbx to bundles.
To use Azure Databricks with Visual Studio Code, see the article Databricks extension for Visual Studio Code.
This article describes a Python-based code sample that you can work with in any Python-compatible IDE. Specifically, this article describes how to work with this code sample in Visual Studio Code, which provides the following developer productivity features:
- Code completion
- Linting
- Testing
- Debugging code objects that do not require a real-time connection to remote Azure Databricks resources.
This article uses dbx by Databricks Labs along with Visual Studio Code to submit the code sample to a remote Azure Databricks workspace. dbx
instructs Azure Databricks to Schedule and orchestrate Workflows to run the submitted code on an Azure Databricks jobs cluster in that workspace.
You can use popular third-party Git providers for version control and continuous integration and continuous delivery or continuous deployment (CI/CD) of your code. For version control, these Git providers include the following:
For CI/CD, dbx
supports the following CI/CD platforms:
To demonstrate how version control and CI/CD can work, this article describes how to use Visual Studio Code, dbx
, and this code sample, along with GitHub and GitHub Actions.
Code sample requirements
To use this code sample, you must have the following:
- An Azure Databricks workspace in your Azure Databricks account. Create a workspace if you do not already have one.
- A GitHub account. Create a GitHub account, if you do not already have one.
Additionally, on your local development machine, you must have the following:
Python version 3.8 or above.
You should use a version of Python that matches the one that is installed on your target clusters. To get the version of Python that is installed on an existing cluster, you can use the cluster's web terminal to run the
python --version
command. See also the "System environment" section in the Databricks Runtime release notes versions and compatibility for the Databricks Runtime version for your target clusters. In any case, the version of Python must be 3.8 or above.To get the version of Python that is currently referenced on your local machine, run
python --version
from your local terminal. (Depending on how you set up Python on your local machine, you may need to runpython3
instead ofpython
throughout this article.) See also Select a Python interpreter.pip.
pip
is automatically installed with newer versions of Python. To check whetherpip
is already installed, runpip --version
from your local terminal. (Depending on how you set up Python orpip
on your local machine, you may need to runpip3
instead ofpip
throughout this article.)dbx version 0.8.0 or above. You can install the
dbx
package from the Python Package Index (PyPI) by runningpip install dbx
.Note
You do not need to install
dbx
now. You can install it later in the code sample setup section.A method to create Python virtual environments to ensure you are using the correct versions of Python and package dependencies in your
dbx
projects. This article covers pipenv.The Databricks CLI version 0.18 or below, set up with authentication.
Note
You do not need to install the legacy Databricks CLI (Databricks CLI version 0.17) now. You can install it later in the code sample setup section. If you want to install it later, you must remember to set up authentication at that time instead.
The Python extension for Visual Studio Code.
The GitHub Pull Requests and Issues extension for Visual Studio Code.
Git.
About the code sample
The Python code sample for this article, available in the databricks/ide-best-practices repo in GitHub, does the following:
- Gets data from the owid/covid-19-data repo in GitHub.
- Filters the data for a specific ISO country code.
- Creates a pivot table from the data.
- Performs data cleansing on the data.
- Modularizes the code logic into reusable functions.
- Unit tests the functions.
- Provides
dbx
project configurations and settings to enable the code to write the data to a Delta table in a remote Azure Databricks workspace.
Set up the code sample
After you have the requirements in place for this code sample, complete the following steps to begin using the code sample.
Note
These steps do not include setting up this code sample for CI/CD. You do not need to set up CI/CD to run this code sample. If you want to set up CI/CD later, see Run with GitHub Actions.
Step 1: Create a Python virtual environment
From your terminal, create a blank folder to contain a virtual environment for this code sample. These instructions use a parent folder named
ide-demo
. You can give this folder any name you want. If you use a different name, replace the name throughout this article. After you create the folder, switch to it, and then start Visual Studio Code from that folder. Be sure to include the dot (.
) after thecode
command.For Linux and macOS:
mkdir ide-demo cd ide-demo code .
Tip
If you get the error
command not found: code
, see Launching from the command line on the Microsoft website.For Windows:
md ide-demo cd ide-demo code .
In Visual Studio Code, on the menu bar, click View > Terminal.
From the root of the
ide-demo
folder, run thepipenv
command with the following option, where<version>
is the target version of Python that you already have installed locally (and, ideally, a version that matches your target clusters' version of Python), for example3.8.14
.pipenv --python <version>
Make a note of the
Virtualenv location
value in the output of thepipenv
command, as you will need it in the next step.Select the target Python interpreter, and then activate the Python virtual environment:
On the menu bar, click View > Command Palette, type
Python: Select
, and then click Python: Select Interpreter.Select the Python interpreter within the path to the Python virtual environment that you just created. (This path is listed as the
Virtualenv location
value in the output of thepipenv
command.)On the menu bar, click View > Command Palette, type
Terminal: Create
, and then click Terminal: Create New Terminal.Make sure that the command prompt indicates that you are in the
pipenv
shell. To confirm, you should see something like(<your-username>)
before your command prompt. If you do not see it, run the following command:pipenv shell
To exit the
pipenv
shell, run the commandexit
, and the parentheses disappear.
For more information, see Using Python environments in VS Code in the Visual Studio Code documentation.
Step 2: Clone the code sample from GitHub
- In Visual Studio Code, open the
ide-demo
folder (File > Open Folder), if it is not already open. - Click View > Command Palette, type
Git: Clone
, and then click Git: Clone. - For Provide repository URL or pick a repository source, enter
https://github.com/databricks/ide-best-practices
- Browse to your
ide-demo
folder, and click Select Repository Location.
Step 3: Install the code sample's dependencies
Install a version of
dbx
and Databricks CLI version 0.18 or below that is compatible with your version of Python. To do this, in Visual Studio Code from your terminal, from youride-demo
folder with apipenv
shell activated (pipenv shell
), run the following command:pip install dbx
Confirm that
dbx
is installed. To do this, run the following command:dbx --version
If the version number is returned,
dbx
is installed.If the version number is below 0.8.0, upgrade
dbx
by running the following command, and then check the version number again:pip install dbx --upgrade dbx --version # Or ... python -m pip install dbx --upgrade dbx --version
When you install
dbx
, the legacy Databricks CLI (Databricks CLI version 0.17) is also automatically installed. To confirm that the legacy Databricks CLI (Databricks CLI version 0.17) is installed, run the following command:databricks --version
If Databricks CLI version 0.17 is returned, the legacy Databricks CLI is installed.
If you have not set up the legacy Databricks CLI (Databricks CLI version 0.17) with authentication, you must do it now. To confirm that authentication is set up, run the following basic command to get some summary information about your Azure Databricks workspace. Be sure to include the forward slash (
/
) after thels
subcommand:databricks workspace ls /
If a list of root-level folder names for your workspace is returned, authentication is set up.
Install the Python packages that this code sample depends on. To do this, run the following command from the
ide-demo/ide-best-practices
folder:pip install -r unit-requirements.txt
Confirm that the code sample's dependent packages are installed. To do this, run the following command:
pip list
If the packages that are listed in the
requirements.txt
andunit-requirements.txt
files are somewhere in this list, the dependent packages are installed.Note
The files listed in
requirements.txt
are for specific package versions. For better compatibility, you can cross-reference these versions with the cluster node type that you want your Azure Databricks workspace to use for running deployments on later. See the "System environment" section for your cluster's Databricks Runtime version in Databricks Runtime release notes versions and compatibility.
Step 4: Customize the code sample for your Azure Databricks workspace
Customize the repo's
dbx
project settings. To do this, in the.dbx/project.json
file, change the value of theprofile
object fromDEFAULT
to the name of the profile that matches the one that you set up for authentication with the legacy Databricks CLI (Databricks CLI version 0.17). If you did not set up any non-default profile, leaveDEFAULT
as is. For example:{ "environments": { "default": { "profile": "DEFAULT", "storage_type": "mlflow", "properties": { "workspace_directory": "/Shared/dbx/covid_analysis", "artifact_location": "dbfs:/Shared/dbx/projects/covid_analysis" } } }, "inplace_jinja_support": false }
Customize the
dbx
project's deployment settings. To do this, in theconf/deployment.yml
file, change the value of thespark_version
andnode_type_id
objects from10.4.x-scala2.12
andm6gd.large
to the Azure Databricks runtime version string and cluster node type that you want your Azure Databricks workspace to use for running deployments on.For example, to specify Databricks Runtime 10.4 LTS and a
Standard_DS3_v2
node type:environments: default: workflows: - name: "covid_analysis_etl_integ" new_cluster: spark_version: "10.4.x-scala2.12" num_workers: 1 node_type_id: "Standard_DS3_v2" spark_python_task: python_file: "file://jobs/covid_trends_job.py" - name: "covid_analysis_etl_prod" new_cluster: spark_version: "10.4.x-scala2.12" num_workers: 1 node_type_id: "Standard_DS3_v2" spark_python_task: python_file: "file://jobs/covid_trends_job.py" parameters: ["--prod"] - name: "covid_analysis_etl_raw" new_cluster: spark_version: "10.4.x-scala2.12" num_workers: 1 node_type_id: "Standard_DS3_v2" spark_python_task: python_file: "file://jobs/covid_trends_job_raw.py"
Tip
In this example, each of these three job definitions has the same spark_version
and node_type_id
value. You can use different values for different job definitions. You can also create shared values and reuse them across job definitions, to reduce typing errors and code maintenance. See the YAML example in the dbx
documentation.
Explore the code sample
After you set up the code sample, use the following information to learn about how the various files in the ide-demo/ide-best-practices
folder work.
Code modularization
Unmodularized code
The jobs/covid_trends_job_raw.py
file is an unmodularized version of the code logic. You can run this file by itself.
Modularized code
The jobs/covid_trends_job.py
file is a modularized version of the code logic. This file relies on the shared code in the covid_analysis/transforms.py
file. The covid_analysis/__init__.py
file treats the covide_analysis
folder as a containing package.
Testing
Unit tests
The tests/testdata.csv
file contains a small portion of the data in the covid-hospitalizations.csv
file for testing purposes. The tests/transforms_test.py
file contains the unit tests for the covid_analysis/transforms.py
file.
Unit test runner
The pytest.ini
file contains configuration options for running tests with pytest. See pytest.ini and Configuration Options in the pytest
documentation.
The .coveragerc
file contains configuration options for Python code coverage measurements with coverage.py. See Configuration reference in the coverage.py
documentation.
The requirements.txt
file, which is a subset of the unit-requirements.txt
file that you ran earlier with pip
, contains a list of packages that the unit tests also depend on.
Packaging
The setup.py
file provides commands to be run at the console (console scripts), such as the pip
command, for packaging Python projects with setuptools. See Entry Points in the setuptools
documentation.
Other files
There are other files in this code sample that have not been previously described:
- The
.github/workflows
folder contains three files,databricks_pull_request_tests.yml
,onpush.yml
, andonrelease.yaml
, that represent the GitHub Actions, which are covered later in the GitHub Actions section. - The
.gitignore
file contains a list of local folders and files that Git ignores for your repo.
Run the code sample
You can use dbx
on your local machine to instruct Azure Databricks to run the code sample in your remote workspace on-demand, as described in the next subsection. Or you can use GitHub Actions to have GitHub run the code sample every time you push code changes to your GitHub repo.
Run with dbx
Install the contents of the
covid_analysis
folder as a package in Pythonsetuptools
development mode by running the following command from the root of yourdbx
project (for example, theide-demo/ide-best-practices
folder). Be sure to include the dot (.
) at the end of this command:pip install -e .
This command creates a
covid_analysis.egg-info
folder, which contains information about the compiled version of thecovid_analysis/__init__.py
andcovid_analysis/transforms.py
files.Run the tests by running the following command:
pytest tests/
The tests' results are displayed in the terminal. All four tests should show as passing.
Tip
For additional approaches to testing, including testing for R and Scala notebooks, see Unit testing for notebooks.
Optionally, get test coverage metrics for your tests by running the following command:
coverage run -m pytest tests/
Note
If a message displays that
coverage
cannot be found, runpip install coverage
, and try again.To view test coverage results, run the following command:
coverage report -m
If all four tests pass, send the
dbx
project's contents to your Azure Databricks workspace, by running the following command:dbx deploy --environment=default
Information about the project and its runs are sent to the location specified in the
workspace_directory
object in the.dbx/project.json
file.The project's contents are sent to the location specified in the
artifact_location
object in the.dbx/project.json
file.Run the pre-production version of the code in your workspace, by running the following command:
dbx launch covid_analysis_etl_integ
A link to the run's results are displayed in the terminal. It should look something like this:
https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/12345
Follow this link in your web browser to see the run's results in your workspace.
Run the production version of the code in your workspace, by running the following command:
dbx launch covid_analysis_etl_prod
A link to the run's results are displayed in the terminal. It should look something like this:
https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/23456
Follow this link in your web browser to see the run's results in your workspace.
Run with GitHub Actions
In the project's .github/workflows
folder, the onpush.yml
and onrelease.yml
GitHub Actions files do the following:
- On each push to a tag that begins with
v
, usesdbx
to deploy thecovid_analysis_etl_prod
job. - On each push that is not to a tag that begins with
v
:- Uses
pytest
to run the unit tests. - Uses
dbx
to deploy the file specified in thecovid_analysis_etl_integ
job to the remote workspace. - Uses
dbx
to launch the already-deployed file specified in thecovid_analysis_etl_integ
job on the remote workspace, tracing this run until it finishes.
- Uses
Note
An additional GitHub Actions file, databricks_pull_request_tests.yml
, is provided for you as a template to experiment with, without impacting the onpush.yml
and onrelease.yml
GitHub Actions files. You can run this code sample without the databricks_pull_request_tests.yml
GitHub Actions file. Its usage is not covered in this article.
The following subsections describe how to set up and run the onpush.yml
and onrelease.yml
GitHub Actions files.
Set up to use GitHub Actions
Set up your Azure Databricks workspace by following the instructions in Service principals for CI/CD. This includes the following actions:
- Create a service principal.
- Create a Microsoft Entra ID token for the service principal.
As a security best practice, Databricks recommends that you use a Microsoft Entra ID token for a service principal, instead of the Databricks personal access token for your workspace user, for enabling GitHub to authenticate with your Azure Databricks workspace.
After you create the service principal and its Microsoft Entra ID token, stop and make a note of the Microsoft Entra ID token value, which you will you use in the next section.
Run GitHub Actions
Step 1: Publish your cloned repo
- In Visual Studio Code, in the sidebar, click the GitHub icon. If the icon is not visible, enable the GitHub Pull Requests and Issues extension through the Extensions view (View > Extensions) first.
- If the Sign In button is visible, click it, and follow the on-screen instructions to sign in to your GitHub account.
- On the menu bar, click View > Command Palette, type
Publish to GitHub
, and then click Publish to GitHub. - Select an option to publish your cloned repo to your GitHub account.
Step 2: Add encrypted secrets to your repo
In the GitHub website for your published repo, follow the instructions in Creating encrypted secrets for a repository, for the following encrypted secrets:
- Create an encrypted secret named
DATABRICKS_HOST
, set to the value of your per-workspace URL, for examplehttps://adb-1234567890123456.7.databricks.azure.cn
. - Create an encrypted secret named
DATABRICKS_TOKEN
, set to the value of the Microsoft Entra ID token for the service principal.
Step 3: Create and publish a branch to your repo
- In Visual Studio Code, in Source Control view (View > Source Control), click the … (Views and More Actions) icon.
- Click Branch > Create Branch From.
- Enter a name for the branch, for example
my-branch
. - Select the branch to create the branch from, for example main.
- Make a minor change to one of the files in your local repo, and then save the file. For example, make a minor change to a code comment in the
tests/transforms_test.py
file. - In Source Control view, click the … (Views and More Actions) icon again.
- Click Changes > Stage All Changes.
- Click the … (Views and More Actions) icon again.
- Click Commit > Commit Staged.
- Enter a message for the commit.
- Click the … (Views and More Actions) icon again.
- Click Branch > Publish Branch.
Step 4: Create a pull request and merge
- Go to the GitHub website for your published repo,
https://github/<your-GitHub-username>/ide-best-practices
. - On the Pull requests tab, next to my-branch had recent pushes, click Compare & pull request.
- Click Create pull request.
- On the pull request page, wait for the icon next to CI pipleline / ci-pipeline (push) to display a green check mark. (It may take a few moments for the several minutes for the icon to appear.) If there is a red X instead of a green check mark, click Details to find out why. If the icon or Details are no longer showing, click Show all checks.
- If the green check mark appears, merge the pull request into the
main
branch by clicking Merge pull request.