Remove unused data files with vacuum
You can remove data files no longer referenced by a Delta table that are older than the retention threshold by running the VACUUM
command on the table. Running VACUUM
regularly is important for cost and compliance because of the following considerations:
- Deleting unused data files reduces cloud storage costs.
- Data files removed by
VACUUM
might contain records that have been modified or deleted. Permanently removing these files from cloud storage ensures these records are no longer accessible.
Caveats for vacuum
The default retention threshold for data files after running VACUUM
is 7 days. To change this behavior, see Configure data retention for time travel queries.
VACUUM
might leave behind empty directories after removing all files from within them. Subsequent VACUUM
operations delete these empty directories.
Some Delta Lake features use metadata files to mark data as deleted rather than rewriting data files. You can use REORG TABLE ... APPLY (PURGE)
to commit these deletions and rewrite data files. See Purge metadata-only deletes to force data rewrite.
Important
- In Databricks Runtime 13.3 LTS and above,
VACUUM
semantics for shallow clones with Unity Catalog managed tables differ from other Delta tables. See Vacuum and Unity Catalog shallow clones. VACUUM
removes all files from directories not managed by Delta Lake, ignoring directories beginning with_
or.
. If you are storing additional metadata like Structured Streaming checkpoints within a Delta table directory, use a directory name such as_checkpoints
.- Data for change data feed is managed by Delta Lake in the
_change_data
directory and removed withVACUUM
. See Use Delta Lake change data feed on Azure Databricks. - Bloom filter indexes use the
_delta_index
directory managed by Delta Lake.VACUUM
cleans up files in this directory. See Bloom filter indexes.
- Data for change data feed is managed by Delta Lake in the
- The ability to query table versions older than the retention period is lost after running
VACUUM
. - Log files are deleted automatically and asynchronously after checkpoint operations and are not governed by
VACUUM
. While the default retention period of log files is 30 days, runningVACUUM
on a table removes the data files necessary for time travel.
Note
When disk caching is enabled, a cluster might contain data from Parquet files that have been deleted with VACUUM
. Therefore, it may be possible to query the data of previous table versions whose files have been deleted. Restarting the cluster will remove the cached data. See Configure the disk cache.
Example syntax for vacuum
VACUUM table_name -- vacuum files not required by versions older than the default retention period
VACUUM table_name RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours old
VACUUM table_name DRY RUN -- do dry run to get the list of files to be deleted
For Spark SQL syntax details, see VACUUM.
See the Delta Lake API documentation for Scala, Java, and Python syntax details.
Note
Use the RETAIN
keyword to specify the threshold used to determine if a data file should be removed. The VACUUM
command uses this threshold to look back in time the specified amount of time and identify the most recent table version at that moment. Delta retains all data files required to query that table version and all newer table versions. This setting interacts with other table properties. See Configure data retention for time travel queries.
Purge metadata-only deletes to force data rewrite
The REORG TABLE
command provides the APPLY (PURGE)
syntax to rewrite data to apply soft-deletes. Soft-deletes do not rewrite data or delete data files, but rather use metadata files to indicate that some data values have changed. See REORG TABLE.
Operations that create soft-deletes in Delta Lake include the following:
- Dropping columns with column mapping enabled.
- Deleting rows with deletion vectors enabled.
- Any data modifications on Photon-enabled clusters when deletion vectors are enabled.
With soft-deletes enabled, old data may remain physically present in the table's current files even after the data has been deleted or updated. To remove this data physically from the table, complete the following steps:
- Run
REORG TABLE ... APPLY (PURGE)
. After doing this, the old data is no longer present in the table's current files, but it is still present in the older files that are used for time travel. - Run
VACUUM
to delete these older files.
REORG TABLE
creates a new version of the table as the operation completes. All table versions in the history prior to this transaction refer to older data files. Conceptually, this is similar to the OPTIMIZE
command, where data files are rewritten even though data in the current table version stays consistent.
Important
Data files are only deleted when the files have expired according to the VACUUM
retention period. This means that the VACUUM
must be done with a delay after the REORG
so that the older files have expired. The retention period of VACUUM
can be reduced to shorten the required waiting time, at the cost of reducing the maximum history that is retained.
What size cluster does vacuum need?
To select the correct cluster size for VACUUM
, it helps to understand that the operation occurs in two phases:
- The job begins by using all available executor nodes to list files in the source directory in parallel. This list is compared to all files currently referenced in the Delta transaction log to identify files to be deleted. The driver sits idle during this time.
- The driver then issues deletion commands for each file to be deleted. File deletion is a driver-only operation, meaning that all operations occur in a single node while the worker nodes sit idle.
To optimize cost and performance, Databricks recommends the following, especially for long-running vacuum jobs:
- Run vacuum on a cluster with auto-scaling set for 1-4 workers, where each worker has 8 cores.
- Select a driver with between 8 and 32 cores. Increase the size of the driver to avoid out-of-memory (OOM) errors.
If VACUUM
operations are regularly deleting more than 10 thousand files or taking over 30 minutes of processing time, you might want to increase either the size of the driver or the number of workers.
If you find that the slowdown occurs while identifying files to be removed, add more worker nodes. If the slowdown occurs while delete commands are running, try increasing the size of the driver.
How frequently should you run vacuum?
Databricks recommends regularly running VACUUM
on all tables to reduce excess cloud data storage costs. The default retention threshold for vacuum is 7 days. Setting a higher threshold gives you access to a greater history for your table, but increases the number of data files stored and, as a result, incurs greater storage costs from your cloud provider.
Why can't you vacuum a Delta table with a low retention threshold?
Warning
It is recommended that you set a retention interval to be at least 7 days,
because old snapshots and uncommitted files can still be in use by concurrent
readers or writers to the table. If VACUUM
cleans up active files,
concurrent readers can fail or, worse, tables can be corrupted when VACUUM
deletes files that have not yet been committed. You must choose an interval
that is longer than the longest running concurrent transaction and the longest
period that any stream can lag behind the most recent update to the table.
Delta Lake has a safety check to prevent you from running a dangerous VACUUM
command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled
to false
.
Audit information
VACUUM
commits to the Delta transaction log contain audit information. You can query the audit events using DESCRIBE HISTORY
.