Best practices for cost optimization
This article covers best practices supporting principles of cost optimization, organized by principle.
1. Choose optimal resources
Use performance optimized data formats
To get the most out of the Databricks Data Intelligence Platform, you must use Delta Lake as your storage framework. It helps build simpler and more reliable ETL pipelines, and comes with many performance enhancements that can significantly speed up workloads compared to using Parquet, ORC, and JSON. See Optimization recommendations on Azure Databricks. If the workload is also running on a job compute, this directly translates into shorter uptime of compute resources leading to lower costs.
Use job compute
A job is a way to run non-interactive code on a Databricks compute instance. For example, you can run an extract, transform, and load (ETL) workload interactively or on a schedule. Of course, you can also run jobs interactively in the notebook UI. However, on job compute, the non-interactive workloads will cost significantly less than on all-purpose compute. See the pricing overview to compare Jobs Compute and All-Purpose Compute.
An additional benefit for some jobs is that each job or workflow can run on a new compute instance, isolating workloads from each other. However, multitask workflows can also reuse compute resources for all tasks, so the compute startup time occurs only once per workflow. See Configure compute for jobs.
Use SQL warehouse for SQL workloads
For interactive SQL workloads, a Databricks SQL warehouse is the most cost-efficient engine. See the pricing overview. All SQL warehouses come with Photon by default, which accelerates your existing SQL and DataFrame API calls and reduces your overall cost per workload.
In addition, serverless SQL warehouses support intelligent workload management (IWM), a set of features that enhances Databricks SQL serverless ability to process large numbers of queries quickly and cost-effectively.
Use up-to-date runtimes for your workloads
The Azure Databricks platform provides different runtimes that are optimized for data engineering tasks (Databricks Runtime) or machine learning tasks (Databricks Runtime for Machine Learning). The runtimes are built to provide the best selection of libraries for the tasks, and to ensure that all libraries provided are up-to-date and work together optimally. The Databricks Runtimes are released on a regular cadence, providing performance improvements between major releases. These performance improvements often result in cost savings due to more efficient use of compute resources.
Only use GPUs for the right workloads
Virtual machines with GPUs can dramatically speed up computations for deep learning, but are significantly more expensive than CPU-only machines. Use GPU instances only for workloads that have GPU-accelerated libraries.
Most workloads do not use GPU-accelerated libraries, so they do not benefit from GPU-enabled instances. Workspace administrators can restrict GPU machines and compute resources to prevent unnecessary usage. See the blog post "Are GPUs Really Expensive? Benchmarking GPUs for Inference on Databricks Clusters".
Use serverless services for your workloads
BI use cases
BI workloads typically consume data in bursts and generate multiple concurrent queries. For example, someone using a BI tool might update a dashboard or write a query and then simply analyze the results without further interaction with the platform. In this scenario the data platform:
- Terminates idle compute resources to save costs.
- Quickly provides the compute resources when the user requests new or updated data with the BI tool.
Non-serverless Azure Databricks SQL warehouses have a startup time of minutes, so many users tend to accept the higher cost and do not terminate them during idle periods. On the other hand, serverless SQL warehouses start and scale up in seconds, so both instant availability and idle termination can be achieved. This results in a great user experience and overall cost savings.
Additionally, serverless SQL warehouses scale down earlier than non-serverless warehouses, again, resulting in lower costs.
ML and AI model serving
Most models are served as a REST API for integration into your web or client application; the model serving service receives varying loads of requests over time, and a model serving platform should always provide sufficient resources, but only as many as are actually needed (upscaling and downscaling).
Use the right instance type
Using the latest generation of cloud instance types almost always provides performance benefits, as they offer the best performance and the latest features.
Based on your workloads, it is also important to choose the right instance family to get the best performance/price ratio. Some simple rules of thumb are:
- Memory optimized for ML, heavy shuffle and spill workloads
- Compute optimized for structured streaming workloads and maintenance jobs (such as optimize and vacuum)
- Storage optimized for workloads that benefit from caching, such as ad-hoc and interactive data analysis
- GPU optimized for specific ML and DL workloads
- General purpose in the absence of specific requirements
Choose the most efficient compute size
Azure Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Azure Databricks architecture. People often think of cluster size in terms of the number of workers, but there are other important factors to consider:
- Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a compute instance.
- Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.
- Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.
Additional considerations include worker instance type and size, which also influence the preceding factors. When sizing your compute, consider the following:
- How much data will your workload consume?
- What's the computational complexity of your workload?
- Where are you reading data from?
- How is the data partitioned in external storage?
- How much parallelism do you need?
Details and examples can be found under Compute sizing considerations.
Evaluate performance-optimized query engines
Photon is a high-performance Databricks-native vectorized query engine that speeds up your SQL workloads and DataFrame API calls (for data ingestion, ETL, streaming, data science, and interactive queries). Photon is compatible with Apache Spark APIs, so getting started is as easy as turning it on - no code changes and no lock-in.
The observed speedup can lead to significant cost savings, and jobs that run regularly should be evaluated to see whether they are not only faster but also cheaper with Photon.
2. Dynamically allocate resources
Use auto-scaling compute
With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally intensive than others, and Databricks automatically adds additional workers during those phases of your job (and removes them when they're no longer needed). Autoscaling can reduce overall costs compared to a statically sized compute instance.
Compute auto-scaling has limitations when scaling down cluster size for structured streaming workloads. Databricks recommends using Delta Live Tables with enhanced autoscaling for streaming workloads.
Use auto termination
Azure Databricks provides several features to help control costs by reducing idle resources and controlling when compute resources can be deployed.
- Configure auto termination for all interactive compute resources. After a specified idle time, the compute resource shuts down. See Automatic termination.
- For use cases where compute is needed only during business hours, compute resources can be configured with auto termination, and a scheduled process can restart compute (and possibly prewarm data if necessary) in the morning before users are back at their desktops. See CACHE SELECT.
- If compute startup times are too long, consider using cluster pools, see Pool best practices. Azure Databricks pools are a set of idle, ready-to-use instances. When cluster nodes are created using the idle instances, cluster start and auto-scaling times are reduced. If the pools have no idle instances, the pools expand by allocating a new instance from the instance provider in order to accommodate the cluster's request.
Azure Databricks does not charge Databricks Units (DBUs) while instances are idle in the pool, resulting in cost savings. Instance provider billing does apply.
Use compute policies to control costs
Compute policies can enforce many cost-specific restrictions for compute resources. See Operational Excellence - Use compute policies. For example:
- Enable cluster autoscaling with a set minimum number of worker nodes.
- Enable cluster auto termination with a reasonable value (for example, 1 hour) to avoid paying for idle times.
- Ensure that only cost-efficient VM instances can be selected. Follow the best practices for cluster configuration. See Compute configuration recommendations.
- Apply a spot instance strategy.
3. Monitor and control cost
Monitor costs
Use the Azure Cost Manager to analyze Azure Databricks costs. Compute and Workspace tags are also delivered to the Azure Cost Manager. See Tag clusters for cost attribution.
Tag clusters for cost attribution
To monitor costs in general and to accurately attribute Azure Databricks usage to your organization's business units and teams for chargeback purposes, you can tag clusters, SQL warehouses, and pools. These tags propagate to detailed Databricks Units (DBU) and cloud provider VM and blob storage usage for cost analysis.
Ensure that cost control and attribution are considered when setting up workspaces and clusters for teams and use cases. This streamlines tagging and improves the accuracy of cost attribution.
Total costs include the DBU virtual machine, disk, and any associated network costs. For serverless SQL warehouses, the DBU cost already includes the virtual machine and disk costs.
The tags of Azure Databricks resources can be used in the cost analysis tools in the Azure Portal
Implement observability to track and chargeback cost
When working with complex technical ecosystems, proactively understanding the unknowns is key to maintaining platform stability and controlling costs. Observability provides a way to analyze and optimize systems based on the data they generate. This is different from monitoring, which focuses on identifying new patterns rather than tracking known issues.
See Blog: Intelligently Balance Cost Optimization & Reliability on Databricks
Share cost reports regularly
Generate monthly cost reports to track consumption growth and anomalies. Share these reports by use case or team with the teams that own the workloads using cluster tagging. This eliminates surprises and allows teams to proactively adjust their workloads if costs become too high.
Monitor and manage Delta Sharing egress costs
Unlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. See Monitor and manage Delta Sharing egress costs (for providers) to monitor and manage egress charges.
4. Design cost-effective workloads
Balance always-on and triggered streaming
Traditionally, when people think about streaming, terms such as "real-time", "24/7," or "always on" come to mind. If data ingestion happens in real-time, the underlying compute resources must run 24/7, incurring costs every single hour of the day.
However, not every use case that relies on a continuous stream of events requires those events to be immediately added to the analytics data set. If the business requirement for the use case only requires fresh data every few hours or every day, then that requirement can be met with only a few runs per day, resulting in a significant reduction in workload cost. Databricks recommends using Structured Streaming with the AvailableNow
trigger for incremental workloads that do not have low latency requirements. See Configuring incremental batch processing.