Transactional writes to cloud storage with DBIO
Important
This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported. See What are ACID guarantees on Azure Databricks?.
The Databricks DBIO package provides transactional writes to cloud storage for Apache Spark jobs. This solves a number of performance and correctness issues that occur when Spark is used in a cloud-native setting (for example, writing directly to storage services).
Important
The commit protocol is not respected when you access data using paths ending in *
. For example, reading dbfs:/my/path
will only return committed changes, while reading dbfs:/my/path/*
will return the content of all the data files in the directory, irrespective of whether their content was committed or not. This is an expected behavior.
With DBIO transactional commit, metadata files starting with _started_<id>
and _committed_<id>
accompany data files created by Spark jobs. Generally you shouldn't alter these files directly. Rather, you should use the VACUUM
command to clean them up.
Clean up uncommitted files
To clean up uncommitted files left over from Spark jobs, use the VACUUM
command to remove them. Normally VACUUM
happens automatically after Spark jobs complete, but you can also run it manually if a job is aborted.
For example, VACUUM ... RETAIN 1 HOUR
removes uncommitted files older than one hour.
Important
- Avoid vacuuming with a horizon of less than one hour. It can cause data inconsistency.
Also see VACUUM.
SQL
-- recursively vacuum an output path
VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]
-- vacuum all partitions of a catalog table
VACUUM tableName [RETAIN <N> HOURS]
Scala
// recursively vacuum an output path
spark.sql("VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]")
// vacuum all partitions of a catalog table
spark.sql("VACUUM tableName [RETAIN <N> HOURS]")