Use dbx to sync local files with remote workspaces in real time
Important
This documentation has been retired and might not be updated.
Databricks recommends that instead of dbx sync
, you use the Databricks CLI versions 0.205 or above, which includes functionality similar to dbx sync
through the databricks sync
command.
The Databricks extension for Visual Studio Code also includes functionality similar to dbx sync
integrated into the Visual Studio Code IDE. Note that dbx sync
can synchronize file changes from a local development machine to DBFS, workspace locations, and Databricks Git folders in your Azure Databricks workspaces. The Databricks extension for Visual Studio Code supports synchronizing file changes only to workspace user (/Users
) files and Databricks Git folders (/Repos
).
Note
This article covers dbx
by Databricks Labs, which is provided as-is and is not supported by Databricks through customer technical support channels. Questions and feature requests can be communicated through the Issues page of the databrickslabs/dbx repo on GitHub.
You can perform real-time synchronization of changes to files on your local development machine with their corresponding files in your Azure Databricks workspaces by using dbx by Databricks Labs. These workspace files can be in DBFS or in Databricks Git folders .
Real-time file synchronization with dbx
(also known as dbx sync
) is useful in rapid code development scenarios. For example, you can use a local integrated development environment (IDE) for productivity features such as syntax highlighting, smart code completion, code linting, and testing and debugging. You can then go immediately to your workspace and run your updated code.
You can use dbx sync
by itself, with automated jobs, or with an IDE.
dbx sync
development workflows
There are two development workflows for dbx sync
, one with DBFS and another with Databricks Git folders.
The typical development workflow with dbx sync
and DBFS is:
- Identify a local directory that contains the files you want to synchronize to DBFS.
- Identify the path in DBFS that you want your local directory to synchronize with (or let
dbx sync
create a default DBFS path for you). - Run
dbx sync dbfs
to synchronize your local directory to the DBFS path.dbx sync
begins watching your local directory for any file changes. - Make changes to files in your local directory as needed.
dbx sync
applies those changes to the corresponding files in the DBFS path in real time.
The typical development workflow with dbx sync
and Databricks Git folders is:
Create a repository with a Git provider that Databricks Git folders supports, if you do not have a repository available already.
Clone your repo into your Azure Databricks workspace.
Clone your repo into your local development machine.
Run
dbx sync repo
to associate your local cloned repo with your workspace cloned repo.dbx sync
begins watching your local directory for any file changes.Make changes to files in your local cloned repo as needed.
dbx sync
applies those changes to the corresponding files in Databricks Git folders in real time.Periodically push updated files from the cloned repo in your workspace to your Git provider, so that the repo stays up to date with your Git provider.
Important
dbx sync
only performs one-way, real-time synchronization of file changes from your local development machine to your remote workspace. Therefore, Databricks does not recommend that you initiate changes in your Azure Databricks workspace to files that are monitored by dbx sync
. If you must make such workspace-initiated file changes, then you must also do the following:
- For file changes in DBFS, make the corresponding changes to the local files manually.
- For file changes in Databricks Git folders, push the file changes from your workspace to your Git provider. Then, on your local development machine, pull those file changes from your Git provider.
Requirements
If you want to use dbx sync
with Databricks Git folders, your Azure Databricks workspace must meet the following requirement:
- A clone of your repository with your Git provider, while not required, is suggested.
On your local development machine, you must have the following installed:
Python version 3.8 or above. To check whether Python is installed, and to check your installed Python version, run
python --version
in your terminal or PowerShell.python --version
Note
Some installations of
python
may require you to usepython3
instead ofpython
. If so, substitutepython
withpython3
throughout this article.pip. To check whether
pip
is installed, and to check your installedpip
version, runpip --version
orpython -m pip --version
.pip --version # Or... python -m pip --version
Note
Some installations of
pip
may require you to usepip3
instead ofpip
. If so, substitutepip
withpip3
throughout this article.dbx version 0.8.0 or above. To check whether
dbx
is installed, and to check your installeddbx
version, rundbx --version
. To installdbx
from the Python Package Index (PyPI), runpip install dbx
orpython -m pip install dbx
. (dbx
includes dbx sync.)# Check whether dbx is installed, and check its version. dbx --version # Install dbx. pip install dbx # Or... python -m pip install dbx
Note
For more information about
dbx
, see dbx by Databricks Labs and the dbx documentation.The Databricks CLI version 0.18 or below, set up with authentication. The legacy Databricks CLI (Databricks CLI version 0.17) is automatically installed when you install
dbx
. This authentication can be set up on your local development machine in one or both of the following locations:- Within the
DATABRICKS_HOST
andDATABRICKS_TOKEN
environment variables (starting with legacy Databricks CLI version 0.8.0). - In an Azure Databricks configuration profile within your
.databrickscfg
file.
dbx
looks for authentication credentials in these two locations, respectively.dbx
uses only the first set of matching credentials that it finds.Note
If you use a
.databrickscfg
file,dbx sync
looks in this file for a configuration profile namedDEFAULT
by default. To specify a different profile, use the--profile
option when you run thedbx sync
command, later in this article.dbx
does not support the use of a .netrc file for authentication.- Within the
If you want to use
dbx sync
with Databricks Git folders, a local clone of your repository with your Git provider, while not required, is suggested. To perform a local clone, consult your Git provider's documentation.
Use DBFS with dbx sync
From the terminal or PowerShell on your local development machine, change to the directory that contains the files you want to synchronize to DBFS in your Azure Databricks workspace.
Run the dbx sync command to synchronize your local directory to DBFS in your workspace, as follows. (Do not forget the dot (
.
) at the end, which represents your current directory.)dbx sync dbfs --source .
Tip
To specify a different source directory, replace the dot (
.
) with a different path.Note
If the error
Error: No such command 'sync'
appears, your installation ofdbx
is likely out of date. To fix this, runpip install --upgrade dbx==<version>
orpython -m pip install --upgrade dbx==version
, where<version>
is the latest version ofdbx
. This version number can be found on the PyPI webpage for dbx.pip install --upgrade dbx==<version> # Or... python -m pip install --upgrade dbx==version
dbx sync
begins synchronizing files in your current local directory with files in the following DBFS path in your workspace.dbx sync
confirms this by printingTarget base path
followed by the DBFS path, for example:/tmp/users/<your-Databricks-username>/<local-directory-name>
Tip
To specify a different username or DBFS path, specify the
--user
and--dest
options, respectively, when you rundbx sync
.Make changes to your local files, as needed.
Important
You must keep your terminal or PowerShell open for
dbx sync
to continue synchronizing. If you close your terminal or PowerShell,dbx sync
stops watching for file changes and stops synchronizing. To resume file change synchronization, repeat this procedure from the beginning.As needed, verify your file changes in the preceding path in DBFS in your workspace.
Use Databricks Git folders with dbx sync
From the terminal or PowerShell on your local development machine, change to the root directory that contains the clone of the repository with your Git provider.
In your Azure Databricks workspace, identify the name of the Databricks Git folder that you want to synchronize your local cloned repo to. You can find this repo name by clicking Git folders in your workspace's sidebar.
On your local development machine, run the dbx sync command to synchronize your local cloned repository to the Databricks Git folders in your workspace as follows, replacing
<your-repo-name>
with the name of your repo in Databricks Git folders. (Do not forget the dot (.
) at the end, which represents your current directory.)dbx sync repo -d <your-repo-name> --source .
Tip
To specify a different source directory, replace the dot (
.
) with a different path.Note
If the error
Error: No such command 'sync'
appears, your installation ofdbx
is likely out of date. To fix this, runpip install --upgrade dbx==<version>
orpython -m pip install --upgrade dbx==version
, where<version>
is the latest version ofdbx
. This version number can be found on the PyPI webpage for dbx.pip install --upgrade dbx==<version> # Or... python -m pip install --upgrade dbx==version
dbx sync
begins synchronizing files in your local cloned repository with files in Databricks Git folders in your workspace.dbx sync
confirms this by printingTarget base path
followed by the Databricks Git folders path, for example:/Repos/<your-Databricks-username>/<your-repo-name>
Tip
To specify a different username or repo name, specify the
--user
and--dest-repo
options, respectively, when you rundbx sync
.Make changes to your local files, as needed.
Important
You must keep your terminal or PowerShell open for
dbx sync
to continue synchronizing. If you close your terminal or PowerShell,dbx sync
stops watching for file changes and stops synchronizing. To resume file change synchronization, repeat this procedure from the beginning.As needed, verify your file changes in Databricks Git folders in your workspace.
Additional resources
- dbx documentation
- dbx sync documentation
- databrickslabs/dbx repository on GitHub
- dbx limitations