Pool best practices
This article explains what pools are, and how you can best configure them. For information on creating a pool, see Pool configuration reference.
Pool considerations
Consider the following when creating a pool:
Create pools using instance types and Azure Databricks runtimes based on target workloads.
When possible, populate pools with spot instances to reduce costs. Your driver node should use on-demand instances.
Populate pools with on-demand instances for jobs with short execution times and strict execution time requirements.
Use pool tags and cluster tags to manage billing.
Pre-populate pools to make sure instances are available when clusters need them.
Create pools based on workloads
You can minimize instance acquisition time by creating a pool for each instance type and Azure Databricks runtime your organization commonly uses. For example, if most data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each instance type.
Tag pools to manage cost and billing
Tagging pools to the correct cost center allows you to manage cost and usage chargeback. You can use multiple custom tags to associate multiple cost centers to a pool. However, it's important to understand how tags are propagated when a cluster is created from pools. Tags from pools propagate to the underlying cloud provider instances, but the cluster's tags do not. Apply all custom tags required for managing chargeback of the cloud provider compute cost to the pool.
Pool tags and cluster tags both propagate to Azure Databricks billing. You can use the combination of cluster and pool tags to manage chargeback of Azure Databricks Units.
To learn more, see Monitor usage using tags.
Configure pools to control cost
- azure-aws:
You can use the following configuration options to help control the cost of pools:
- Set the Min Idle instances to 0 to avoid paying for running instances that aren't doing work. The tradeoff is a possible increase in time when a cluster needs to acquire a new instance.
- Set the Max Capacity based on anticipated usage. This sets the ceiling for the maximum number of used and idle instances in the pool. If a job or cluster requests an instance from a pool at its maximum capacity, the request fails, and the cluster doesn't acquire more instances. Therefore, Databricks recommends that you set the maximum capacity only if there is a strict instance quota or budget constraint.
- Set the Idle Instance Auto Termination time to provide a buffer between when the instance is released from the cluster and when it's dropped from the pool. Set this to a period that allows you to minimize cost while ensuring the availability of instances for scheduled jobs. For example, job A is scheduled to run at 8:00 AM and takes 40 minutes to complete. Job B is scheduled to run at 9:00 AM and takes 30 minutes to complete. Set the Idle Instance Auto Termination value to 20 minutes to ensure that instances returned to the pool when job A completes are available when job B starts. Unless they are claimed by another cluster, those instances are terminated 20 minutes after job B ends.
Pre-populate pools
To benefit fully from pools, you can pre-populate newly created pools. Set the Min Idle instances greater than zero in the pool configuration. Alternatively, if you're following the recommendation to set this value to zero, use a starter job to ensure that newly created pools have available instances for clusters to access.
With the starter job approach, schedule a job with flexible execution time requirements to run before jobs with more strict performance requirements or before users start using interactive clusters. After the job finishes, the instances used for the job are released back to the pool. Set Min Idle instance setting to 0 and set the Idle Instance Auto Termination time high enough to ensure that idle instances remain available for subsequent jobs.
Using a starter job allows the pool instances to spin up, populate the pool, and remain available for downstream job or interactive clusters.