Use version-controlled source code in an Azure Databricks job

Article
10/07/2024

You can run jobs using notebooks or Python code located in a remote Git repository or a Databricks Git folder. This feature simplifies the creation and management of production jobs and automates continuous deployment:

You don't need to create a separate production repo in Azure Databricks, manage its permissions, and keep it updated.
You can prevent unintentional changes to a production job, such as local edits in the production repo or changes from switching a branch.
The job definition process has a single source of truth in the remote repository, and each job run is linked to a commit hash.

To use source code in a remote Git repository, you must Set up Databricks Git folders (Repos).

Important

Notebooks created by Azure Databricks jobs that run from remote Git repositories are ephemeral and cannot be relied upon to track MLflow runs, experiments, or models. When creating a notebook from a job, use a workspace MLflow experiment (instead of a notebook MLflow experiment) and call mlflow.set_experiment("/path/to/experiment")in the workspace notebook before running any MLflow tracking code. For more details, see Prevent data loss in MLflow experiments.

Note

If your job runs using a service principal as the identity, you can configure the service principal on the Git folder containing the job's source code. See Use a service principal with Databricks Git folders.

Use a notebook from a remote Git repository

To create a task with a notebook located in a remote Git repository:

Click Workflows in the sidebar and click or go to an existing job and add a new task.
If this is a new job, replace Add a name for your job… with your job name.
Enter a name for the task in the Task name field.
In the Type drop-down menu, select Notebook.
In the Source drop-down menu, select Git provider and click Edit or Add a git reference. The Git information dialog appears.
In the Git Information dialog, enter details for the repository, including the repository URL, the Git Provider, and the Git reference. This Git reference can be a branch, a tag, or a commit.

For Path, enter a relative path to the notebook location, such as etl/notebooks/.

When you enter the relative path, don't begin it with / or ./, and don't include the notebook file extension, such as .py. For example, if the absolute path for the notebook you want to access is /notebooks/covid_eda_raw.py, enter notebooks/covid_eda_raw in the Path field.
Click Create.

Important

If you work with a Python notebook directly from a source Git repository, the first line of the notebook source file must be # Databricks notebook source. For a Scala notebook, the first line of the source file must be // Databricks notebook source.

Use Python code from a remote Git repository

To create a task with Python code located in a remote Git repository:

Click Workflows in the sidebar and click or go to an existing job and add a new task.
If this is a new job, replace Add a name for your job… with your job name.
Enter a name for the task in the Task name field.
In the Type drop-down menu, select Python script.
In the Source drop-down menu, select Git provider and click Edit or Add a git reference. The Git information dialog appears.
In the Git Information dialog, enter details for the repository, including the repository URL, the Git Provider, and the Git reference. This Git reference can be a branch, a tag, or a commit.

For Path, enter a relative path to the source location, such as etl/python/python_etl.py.

When you enter the relative path, don't begin it with / or ./. For example, if the absolute path for the Python code you want to access is /python/covid_eda_raw.py, enter python/covid_eda_raw.py in the Path field.
Click Create.

When you view the run history of a task that runs Python code stored in a remote Git repository, the Task run details panel includes Git details, including the commit SHA associated with the run.

Use SQL queries from a remote Git repository

To run queries stored in .sql files located in a remote Git repository:

Click Workflows in the sidebar and click or go to an existing job and add a new task.
If this is a new job, replace Add a name for your job… with your job name.
Enter a name for the task in the Task name field.
In the Type drop-down menu, select SQL.
In the SQL task drop-down menu, select File.
In the Source drop-down menu, select Git provider and click Edit or Add a git reference. The Git information dialog appears.
In the Git Information dialog, enter details for the repository, including the repository URL, the Git Provider, and the Git reference. This Git reference can be a branch, a tag, or a commit.

For Path, enter a relative path to the source location, such as queries/sql/myquery.sql.

When you enter the relative path, don't begin it with / or ./. For example, if the absolute path to the source file is /sql/myqeury.sql, enter sql/myquery.sql in the Path field.
Select a SQL warehouse. You must select a serverless SQL warehouse or a pro SQL warehouse.
Click Create.

Adding additional tasks from a remote Git repository

Additional tasks in a multitask job can reference the same commit in the remote repository in one of the following ways:

sha of $branch/head when git_branch is set
sha of $tag when git_tag is set
the value of git_commit

You can mix notebook and Python tasks in an Azure Databricks job, but they must use the same Git reference.

Use a Databricks Git folder

If you prefer to use the Azure Databricks UI to version control your source code, clone your repository into a Databricks Git folder. For more information, see Option 2: Set up a production Git folder and Git automation.

To add a notebook or Python code from a Git folder in a job task, in the Source drop-down menu, select Workspace and enter the path to the notebook or Python code in Path.

Access notebooks from an IDE

If you need to access notebooks from an integrated development environment, make sure you have the comment # Databricks notebook source at the top of the notebook source code file. To distinguish between a regular Python file and an Azure Databricks Python-language notebook exported in source-code format, Databricks adds the line # Databricks notebook source at the top of the notebook source code file. When you import the notebook, Azure Databricks recognizes it and imports it as a notebook, not as a Python module.

Troubleshooting

Note

Git-based jobs do not support write access to workspace files. To write data to a temporary storage location, use driver storage. To write persistent data from a Git job, use a UC volume or DBFS.

Error message:

Run result unavailable: job failed with error message Notebook not found: path-to-your-notebook

Possible causes:

Your notebook is missing the comment # Databricks notebook source at the top of the notebook source code file, or in the comment, notebook is capitalized when it must start with lowercase n.