Add tasks to jobs in Databricks Asset Bundles

This page provides information about how to define job tasks in Databricks Asset Bundles. For information about job tasks, see Configure and edit tasks in Lakeflow Jobs.

Important

The job git_source field and task source field set to GIT are not recommended for bundles, because local relative paths may not point to the same content in the Git repository. Bundles expect that a deployed job has the same files as the local copy from where it was deployed.

Instead, clone the repository locally and set up your bundle project within this repository, so that the source for tasks are the workspace.

Configure tasks

Define tasks for a job in a bundle in the tasks key for the job definition. Examples of task configuration for the available task types is in the Task settings section. For information about defining a job in a bundle, see job.

Tip

To quickly generate resource configuration for an existing job using the Databricks CLI, you can use the bundle generate job command. See bundle commands.

To set task values, most job task types have task-specific parameters, but you can also define job parameters that get passed to tasks. Dynamic value references are supported for job parameters, which enable passing values that are specific to the job run between tasks. For complete information on how to pass task values by task type, see Details by task type.

You can also override general job task settings with settings for a target workspace. See Override with target settings.

The following example configuration defines a job with two notebook tasks, and passes a task value from the first task to the second task.

resources:
  jobs:
    pass_task_values_job:
      name: pass_task_values_job
      tasks:
        # Output task
        - task_key: output_value
          notebook_task:
            notebook_path: ../src/output_notebook.ipynb

        # Input task
        - task_key: input_value
          depends_on:
            - task_key: output_value
          notebook_task:
            notebook_path: ../src/input_notebook.ipynb
            base_parameters:
              received_message: '{{tasks.output_value.values.message}}'

The output_notebook.ipynb contains the following code, which sets a task value for the message key:

# Databricks notebook source
# This first task sets a simple output value.

message = "Hello from the first task"

# Set the message to be used by other tasks
dbutils.jobs.taskValues.set(key="message", value=message)

print(f"Produced message: {message}")

The input_notebook.ipynb retrieves the value of the parameter received_message, that was set in the configuration for the task:

# This notebook receives the message as a parameter.

dbutils.widgets.text("received_message", "")
received_message = dbutils.widgets.get("received_message")

print(f"Received message: {received_message}")

Task settings

This section contains settings and examples for each job task type.

Condition task

The condition_task enables you to add a task with if/else conditional logic to your job. The task evaluates a condition that can be used to control the execution of other tasks. The condition task does not require a cluster to execute and does not support retries or notifications. For more information about the if/else condition task, see Add branching logic to a job with the If/else task.

The following keys are available for a condition task. For the corresponding REST API object definition, see condition_task.

Key Type Description
left String Required. The left operand of the condition. Can be a string value or a job state or a dynamic value reference such as {{job.repair_count}} or {{tasks.task_key.values.output}}.
op String Required. The operator to use for comparison. Valid values are: EQUAL_TO, NOT_EQUAL, GREATER_THAN, GREATER_THAN_OR_EQUAL, LESS_THAN, LESS_THAN_OR_EQUAL.
right String Required. The right operand of the condition. Can be a string value or a job state or a dynamic value reference.

Examples

The following example contains a condition task and a notebook task, where the notebook task only executes if the number of job repairs is less than 5.

resources:
  jobs:
    my-job:
      name: my-job
      tasks:
        - task_key: condition_task
          condition_task:
            op: LESS_THAN
            left: '{{job.repair_count}}'
            right: '5'
        - task_key: notebook_task
          depends_on:
            - task_key: condition_task
              outcome: 'true'
          notebook_task:
            notebook_path: ../src/notebook.ipynb

Dashboard task

You use this task to refresh a dashboard and send a snapshot to subscribers. For more information about dashboards in bundles, see dashboard.

The following keys are available for a dashboard task. For the corresponding REST API object definition, see dashboard_task.

Key Type Description
dashboard_id String Required. The identifier of the dashboard to be refreshed. The dashboard must already exist.
subscription Map The subscription configuration for sending the dashboard snapshot. Each subscription object can specify destination settings for where to send snapshots after the dashboard refresh completes. See subscription.
warehouse_id String The warehouse ID to execute the dashboard with for the schedule. If not specified, the default warehouse of the dashboard will be used.

Examples

The following example adds a dashboard task to a job. When the job is run, the dashboard with the specified ID is refreshed.

resources:
  jobs:
    my-dashboard-job:
      name: my-dashboard-job
      tasks:
        - task_key: my-dashboard-task
          dashboard_task:
            dashboard_id: 11111111-1111-1111-1111-111111111111

dbt task

You use this task to run one or more dbt commands. For more information about dbt, see Connect to dbt Cloud.

The following keys are available for a dbt task. For the corresponding REST API object definition, see dbt_task.

Key Type Description
catalog String The name of the catalog to use. The catalog value can only be specified if a warehouse_id is specified. This field requires dbt-databricks >= 1.1.1.
commands Sequence Required. A list of dbt commands to execute in sequence. Each command must be a complete dbt command (e.g., dbt deps, dbt seed, dbt run, dbt test). A maximum of up to 10 commands can be provided.
profiles_directory String The path to the directory containing the dbt profiles.yml file. Can only be specified if no warehouse_id is specified. If no warehouse_id is specified and this folder is unset, the root directory is used.
project_directory String The path to the directory containing the dbt project. If not specified, defaults to the root of the repository or workspace directory. For projects stored in the Databricks workspace, the path must be absolute and begin with a slash. For projects in a remote repository, the path must be relative.
schema String The schema to write to. This parameter is only used when a warehouse_id is also provided. If not provided, the default schema is used.
source String The location type of the dbt project. Valid values are WORKSPACE and GIT. When set to WORKSPACE, the project will be retrieved from the Databricks workspace. When set to GIT, the project will be retrieved from a Git repository defined in git_source. If empty, the task uses GIT if git_source is defined and WORKSPACE otherwise.
warehouse_id String The ID of the SQL warehouse to use for running dbt commands. If not specified, the default warehouse will be used.

Examples

The following example adds a dbt task to a job. This dbt task uses the specified SQL warehouse to run the specified dbt commands.

To get a SQL warehouse's ID, open the SQL warehouse's settings page, then copy the ID found in parentheses after the name of the warehouse in the Name field on the Overview tab.

Tip

Databricks Asset Bundles also includes a dbt-sql project template that defines a job with a dbt task, as well as dbt profiles for deployed dbt jobs. For information about Databricks Asset Bundles templates, see Default bundle templates.

resources:
  jobs:
    my-dbt-job:
      name: my-dbt-job
      tasks:
        - task_key: my-dbt-task
          dbt_task:
            commands:
              - 'dbt deps'
              - 'dbt seed'
              - 'dbt run'
            project_directory: /Users/someone@example.com/Testing
            warehouse_id: 1a111111a1111aa1
          libraries:
            - pypi:
                package: 'dbt-databricks>=1.0.0,<2.0.0'

For each task

The for_each_task enables you to add a task with a for each loop to your job. The task executes a nested task for every input provided. For more information about the for_each_task, see Use a For each task to run another task in a loop.

The following keys are available for a for_each_task. For the corresponding REST API object definition, see for_each_task.

Key Type Description
concurrency Integer The maximum number of task iterations that can run concurrently. If not specified, all iterations may run in parallel subject to cluster and workspace limits.
inputs String Required. The input data for the loop. This can be a JSON string or a reference to an array parameter. Each element in the array will be passed to one iteration of the nested task.
task Map Required. The nested task definition to execute for each input. This object contains the complete task specification including task_key and the task type (e.g., notebook_task, python_wheel_task, etc.).

Examples

The following example adds a for_each_task to a job, where it loops over the values of another task and processes them.

resources:
  jobs:
    my_job:
      name: my_job
      tasks:
        - task_key: generate_countries_list
          notebook_task:
            notebook_path: ../src/generate_countries_list.ipnyb
        - task_key: process_countries
          depends_on:
            - task_key: generate_countries_list
          for_each_task:
            inputs: '{{tasks.generate_countries_list.values.countries}}'
            task:
              task_key: process_countries_iteration
              notebook_task:
                notebook_path: ../src/process_countries_notebook.ipnyb

JAR task

You use this task to run a JAR. You can reference local JAR libraries or those in a workspace, a Unity Catalog volume, or an external cloud storage location. See JAR file (Java or Scala).

For details on how to compile and deploy Scala JAR files on a Unity Catalog-enabled cluster in standard access mode, see Deploy Scala JARs on Unity Catalog clusters.

The following keys are available for a JAR task. For the corresponding REST API object definition, see jar_task.

Key Type Description
jar_uri String Deprecated. The URI of the JAR to be executed. DBFS and cloud storage paths are supported. This field is deprecated and should not be used. Instead, use the libraries field to specify JAR dependencies.
main_class_name String Required. The full name of the class containing the main method to be executed. This class must be contained in a JAR provided as a library. The code must use SparkContext.getOrCreate to obtain a Spark context; otherwise, runs of the job fail.
parameters Sequence The parameters passed to the main method. Use task parameter variables to set parameters containing information about job runs.

Examples

The following example adds a JAR task to a job. The path for the JAR is to a volume location.

resources:
  jobs:
    my-jar-job:
      name: my-jar-job
      tasks:
        - task_key: my-jar-task
          spark_jar_task:
            main_class_name: org.example.com.Main
          libraries:
            - jar: /Volumes/main/default/my-volume/my-project-0.1.0-SNAPSHOT.jar

Notebook task

You use this task to run a notebook. See Notebook task for jobs.

The following keys are available for a notebook task. For the corresponding REST API object definition, see notebook_task.

Key Type Description
base_parameters Map The base parameters to use for each run of this job.
  • If the run is initiated by a call to jobs or run-now with parameters specified, the two parameters maps are merged.
  • If the same key is specified in base_parameters and in run-now, the value from run-now is used. Use task parameter variables to set parameters containing information about job runs.
  • If the notebook takes a parameter that is not specified in the job's base_parameters or the run-now override parameters, the default value from the notebook is used. Retrieve these parameters in a notebook using dbutils.widgets.get.
notebook_path String Required. The path of the notebook in the Databricks workspace or remote repository, for example /Users/user.name@databricks.com/notebook_to_run. For notebooks stored in the Databricks workspace, the path must be absolute and begin with a slash. For notebooks stored in a remote repository, the path must be relative.
source String Location type of the notebook. Valid values are WORKSPACE and GIT. When set to WORKSPACE, the notebook will be retrieved from the local Databricks workspace. When set to GIT, the notebook will be retrieved from a Git repository defined in git_source. If the value is empty, the task will use GIT if git_source is defined and WORKSPACE otherwise.
warehouse_id String The ID of the warehouse to run the notebook on. Classic SQL warehouses are not supported. Use serverless or pro SQL warehouses instead. Note that SQL warehouses only support SQL cells. If the notebook contains non-SQL cells, the run will fail, so if you need to use Python (or other) in a cell, use serverless.

Examples

The following example adds a notebook task to a job and sets a job parameter named my_job_run_id. The path for the notebook to deploy is relative to the configuration file in which this task is declared. The task gets the notebook from its deployed location in the Azure Databricks workspace.

resources:
  jobs:
    my-notebook-job:
      name: my-notebook-job
      tasks:
        - task_key: my-notebook-task
          notebook_task:
            notebook_path: ./my-notebook.ipynb
      parameters:
        - name: my_job_run_id
          default: '{{job.run_id}}'

Pipeline task

You use this task to run a pipeline. See Lakeflow Declarative Pipelines.

The following keys are available for a pipeline task. For the corresponding REST API object definition, see pipeline_task.

Key Type Description
full_refresh Boolean If true, a full refresh of the pipeline will be triggered, which will completely recompute all datasets in the pipeline. If false or omitted, only incremental data will be processed. For details, see Pipeline refresh semantics.
pipeline_id String Required. The ID of the pipeline to run. The pipeline must already exist.

Examples

The following example adds a pipeline task to a job. This task runs the specified pipeline.

Tip

You can get a pipelines's ID by opening the pipeline in the workspace and copying the Pipeline ID value on the Pipeline details tab of the pipeline's settings page.

resources:
  jobs:
    my-pipeline-job:
      name: my-pipeline-job
      tasks:
        - task_key: my-pipeline-task
          pipeline_task:
            pipeline_id: 11111111-1111-1111-1111-111111111111

Python script task

You use this task to run a Python file.

The following keys are available for a Python script task. For the corresponding REST API object definition, see python_task.

Key Type Description
parameters Sequence The parameters to pass to the Python file. Use task parameter variables to set parameters containing information about job runs.
python_file String Required. The URI of the Python file to be executed, for example /Users/someone@example.com/my-script.py. For python files stored in the Databricks workspace, the path must be absolute and begin with /. For files stored in a remote repository, the path must be relative. This field does not support dynamic value references such as variables.
source String The location type of the Python file. Valid values are WORKSPACE and GIT. When set to WORKSPACE, the file will be retrieved from the local Databricks workspace. When set to GIT, the file will be retrieved from a Git repository defined in git_source. If the value is empty, the task will use GIT if git_source is defined and WORKSPACE otherwise.

Examples

The following example adds a Python script task to a job. The path for the Python file to deploy is relative to the configuration file in which this task is declared. The task gets the Python file from its deployed location in the Azure Databricks workspace.

resources:
  jobs:
    my-python-script-job:
      name: my-python-script-job

      tasks:
        - task_key: my-python-script-task
          spark_python_task:
            python_file: ./my-script.py

Python wheel task

You use this task to run a Python wheel. See Build a Python wheel file using Databricks Asset Bundles.

The following keys are available for a Python wheel task. For the corresponding REST API object definition, see python_wheel_task.

Key Type Description
entry_point String Required. The named entry point to execute: function or class. If it does not exist in the metadata of the package it executes the function from the package directly using $packageName.$entryPoint().
named_parameters Map The named parameters to pass to the Python wheel task, also know as Keyword arguments. A named parameter is a key-value pair with a string key and a string value. Both parameters and named_parameters cannot be specified. If named_parameters is specified, the parameters are passed as keyword arguments to the entry point function.
package_name String Required. The name of the Python package to execute. All dependencies must be installed in the environment. This does not check for or install any package dependencies.
parameters Sequence The parameters to pass to the Python wheel task, also known as Positional arguments. Each parameter is a string. If specified, named_parameters must not be specified.

Examples

The following example adds a Python wheel task to a job. The path for the Python wheel file to deploy is relative to the configuration file in which this task is declared. See Databricks Asset Bundles library dependencies.

resources:
  jobs:
    my-python-wheel-job:
      name: my-python-wheel-job
      tasks:
        - task_key: my-python-wheel-task
          python_wheel_task:
            entry_point: run
            package_name: my_package
          libraries:
            - whl: ./my_package/dist/my_package-*.whl

Run job task

You use this task to run another job.

The following keys are available for a run job task. For the corresponding REST API object definition, see run_job_task.

Key Type Description
job_id Integer Required. The ID of the job to run. The job must already exist in the workspace.
job_parameters Map Job-level parameters to pass to the job being run. These parameters are accessible within the job's tasks.
pipeline_params Map Parameters for the pipeline task. Used only if the job being run contains a pipeline task. Can include full_refresh to trigger a full refresh of the pipeline.

Examples

The following example contains a run job task in the second job that runs the first job.

This example uses a substitution to retrieve the ID of the job to run. To get a job's ID from the UI, open the job in the workspace and copy the ID from the Job ID value in the Job details tab of the jobs's settings page.

resources:
  jobs:
    my-first-job:
      name: my-first-job
      tasks:
        - task_key: my-first-job-task
          new_cluster:
            spark_version: '13.3.x-scala2.12'
            node_type_id: 'i3.xlarge'
            num_workers: 2
          notebook_task:
            notebook_path: ./src/test.py
    my_second_job:
      name: my-second-job
      tasks:
        - task_key: my-second-job-task
          run_job_task:
            job_id: ${resources.jobs.my-first-job.id}

SQL task

You use this task to run a SQL file, query, or alert.

The following keys are available for a SQL task. For the corresponding REST API object definition, see sql_task.

Key Type Description
alert Map Configuration for running a SQL alert. Contains:
  • alert_id (String): Required. The canonical identifier of the SQL alert to run.
  • pause_subscriptions (Boolean): Whether to pause alert subscriptions.
  • subscriptions (Sequence): List of subscription settings.
dashboard Map Configuration for refreshing a SQL dashboard. Contains:
  • dashboard_id (String): Required. The canonical identifier of the SQL dashboard to refresh.
  • custom_subject (String): Custom subject for the email sent to dashboard subscribers.
  • pause_subscriptions (Boolean): Whether to pause dashboard subscriptions.
  • subscriptions (Sequence): List of subscription settings.
file Map Configuration for running a SQL file. Contains:
  • path (String): Required. The path of the SQL file in the workspace or remote repository. For files stored in the Databricks workspace, the path must be absolute and begin with a slash. For files stored in a remote repository, the path must be relative.
  • source (String): The location type of the SQL file. Valid values are WORKSPACE and GIT.
parameters Map Parameters to be used for each run of this task. SQL queries and files can use these parameters by referencing them with the syntax {{parameter_key}}. Use task parameter variables to set parameters containing information about job runs.
query Map Configuration for running a SQL query. Contains:
  • query_id (String): Required. The canonical identifier of the SQL query to run.
warehouse_id String Required. The ID of the SQL warehouse to use to run the SQL task. The SQL warehouse must already exist.

Examples

Tip

To get a SQL warehouse's ID, open the SQL warehouse's settings page, then copy the ID found in parentheses after the name of the warehouse in the Name field on the Overview tab.

The following example adds a SQL file task to a job. This SQL file task uses the specified SQL warehouse to run the specified SQL file.

resources:
  jobs:
    my-sql-file-job:
      name: my-sql-file-job
      tasks:
        - task_key: my-sql-file-task
          sql_task:
            file:
              path: /Users/someone@example.com/hello-world.sql
              source: WORKSPACE
            warehouse_id: 1a111111a1111aa1

The following example adds a SQL alert task to a job. This SQL alert task uses the specified SQL warehouse to refresh the specified SQL alert.

resources:
  jobs:
    my-sql-file-job:
      name: my-sql-alert-job
      tasks:
        - task_key: my-sql-alert-task
          sql_task:
            warehouse_id: 1a111111a1111aa1
            alert:
              alert_id: 11111111-1111-1111-1111-111111111111

The following example adds a SQL query task to a job. This SQL query task uses the specified SQL warehouse to run the specified SQL query.

resources:
  jobs:
    my-sql-query-job:
      name: my-sql-query-job
      tasks:
        - task_key: my-sql-query-task
          sql_task:
            warehouse_id: 1a111111a1111aa1
            query:
              query_id: 11111111-1111-1111-1111-111111111111

Other task settings

The following task settings allow you to configure behaviors for all tasks. For the corresponding REST API object definitions, see tasks.

Key Type Description
compute_key String The key of the compute resource to use for this task. If specified, new_cluster, existing_cluster_id, and job_cluster_key cannot be specified.
depends_on Sequence An optional list of task dependencies. Each item contains:
  • task_key (String): Required. The key of the task this task depends on.
  • outcome (String): Can be specified only for condition_task. If specified, the dependent task will only run if the condition evaluates to the specified outcome (either true or false).
description String An optional description for the task.
disable_auto_optimization Boolean Whether to disable automatic optimization for this task. If true, automatic optimizations like adaptive query execution will be disabled.
email_notifications Map An optional set of email addresses to notify when a run begins, completes, or fails. Each item contains:
  • on_start (Sequence): List of email addresses to notify when a run starts.
  • on_success (Sequence): List of email addresses to notify when a run completes successfully.
  • on_failure (Sequence): List of email addresses to notify when a run fails.
  • on_duration_warning_threshold_exceeded (Sequence): List of email addresses to notify when run duration exceeds the threshold.
  • on_streaming_backlog_suceeded (Sequence): List of email addresses to notify when any streaming backlog thresholds are exceeded for any stream.
environment_key String The key of an environment defined in the job's environments configuration. Used to specify environment-specific settings. This field is required for Python script, Python wheel and dbt tasks when using serverless compute.
existing_cluster_id String The ID of an existing cluster that will be used for all runs of this task.
health Map An optional specification for health monitoring of this task that includes a rules key, which is a list of health rules to evaluate.
job_cluster_key String The key of a job cluster defined in the job's job_clusters configuration.
libraries Sequence An optional list of libraries to be installed on the cluster that will execute the task. Each library is specified as a map with keys like jar, egg, whl, pypi, maven, cran, or requirements.
max_retries Integer An optional maximum number of times to retry the task if it fails. If not specified, the task will not be retried.
min_retry_interval_millis Integer An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. If not specified, the default is 0 (immediate retry).
new_cluster Map A specification for a new cluster to be created for each run of this task. See cluster.
notification_settings Map Optional notification settings for this task. Each item contains:
  • no_alert_for_skipped_runs (Boolean): If true, do not send notifications for skipped runs.
  • no_alert_for_canceled_runs (Boolean): If true, do not send notifications for canceled runs.
  • alert_on_last_attempt (Boolean): If true, send notifications only on the last retry attempt.
retry_on_timeout Boolean An optional policy to specify whether to retry the task when it times out. If not specified, defaults to false.
run_if String An optional value indicating the condition under which the task should run. Valid values are:
  • ALL_SUCCESS (default): Run if all dependencies succeed.
  • AT_LEAST_ONE_SUCCESS: Run if at least one dependency succeeds.
  • NONE_FAILED: Run if no dependencies have failed.
  • ALL_DONE: Run when all dependencies complete, regardless of outcome.
  • AT_LEAST_ONE_FAILED: Run if at least one dependency fails.
  • ALL_FAILED: Run if all dependencies fail.
task_key String Required. A unique name for the task. This field is used to refer to this task from other tasks using the depends_on field.
timeout_seconds Integer An optional timeout applied to each run of this task. A value of 0 means no timeout. If not set, the default timeout from the cluster configuration is used.
webhook_notifications Map An optional set of system destinations to notify when a run begins, completes, or fails. Each item contains:
  • on_start (Sequence): List of notification destinations when a run starts.
  • on_success (Sequence): List of notification destinations when a run completes.
  • on_failure (Sequence): List of notification destinations when a run fails.
  • on_duration_warning_threshold_exceeded (Sequence): List of notification destinations when run duration exceeds threshold.
  • on_streaming_backlog_suceeded (Sequence): List of email addresses to notify when any streaming backlog thresholds are exceeded for any stream.