Use Azure Databricks compute with your jobs

Article
09/09/2024

When you run an Azure Databricks job, the tasks configured as part of the job run on Azure Databricks compute, either a cluster, or a SQL warehouse, depending on the task type. Selecting the compute type and configuration options is important when operationalizing a job. This article provides recommendations for using Azure Databricks compute resources to run your jobs.

Note

Secrets are not redacted from a cluster's Spark driver log stdout and stderr streams. To protect sensitive data, by default, Spark driver logs are viewable only by users with CAN MANAGE permission on job, single user access mode, and shared access mode clusters. To allow users with CAN ATTACH TO or CAN RESTART permission to view the logs on these clusters, set the following Spark configuration property in the cluster configuration: spark.databricks.acl.needAdminPermissionToViewLogs false.

On No Isolation Shared access mode clusters, the Spark driver logs can be viewed by users with CAN ATTACH TO or CAN MANAGE permission. To limit who can read the logs to only users with the CAN MANAGE permission, set spark.databricks.acl.needAdminPermissionToViewLogs to true.

See Spark configuration to learn how to add Spark properties to a cluster configuration.

Use shared job clusters

To optimize resource usage with jobs that orchestrate multiple tasks, use shared job clusters. A shared job cluster allows multiple tasks in the same job run to reuse the cluster. You can use a single job cluster to run all tasks that are part of the job, or multiple job clusters optimized for specific workloads. To use a shared job cluster:

Select New Job Clusters when you create a task and complete the cluster configuration.
Select the new cluster when adding a task to the job, or create a new job cluster. Any cluster you configure when you select New Job Clusters is available to any task in the job.

A shared job cluster is scoped to a single job run and cannot be used by other jobs or runs of the same job.

Libraries cannot be declared in a shared job cluster configuration. You must add dependent libraries in task settings.

Choose the correct cluster type for your job

New Job Clusters are dedicated clusters for a job or task run. A shared job cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. The cluster is not terminated when idle but only after all tasks are completed. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created. A cluster scoped to a single task is created and started when the task starts and terminates when the task completes. In production, Databricks recommends using new shared or task-scoped clusters so that each job or task runs in a fully isolated environment.
When you run a task on a new cluster, the task is treated as a data engineering (task) workload, subject to the task workload pricing. When you run a task on an existing all-purpose cluster, the task is treated as a data analytics (all-purpose) workload, subject to all-purpose workload pricing.
If you select a terminated existing cluster and the job owner has CAN RESTART permission, Azure Databricks starts the cluster when the job is scheduled to run.
Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.

Use a pool to reduce cluster start times

To decrease new job cluster start time, create a pool and configure the job's cluster to use the pool.

Use Azure Databricks compute with your jobs

Use shared job clusters

Choose the correct cluster type for your job

Use a pool to reduce cluster start times

Additional resources