Recommendations for files in volumes and workspace files
When you upload or save data or files to Azure Databricks, you can choose to store these files using Unity Catalog volumes or workspace files. This article contains recommendations and requirements for using these locations. For more details on volumes and workspace files, see What are Unity Catalog volumes? and What are workspace files?.
Databricks recommends using Unity Catalog volumes to store data, libraries, and build artifacts. Store notebooks, SQL queries, and code files as workspace files. You can configure workspace file directories as Git folders to sync with remote Git repositories. See Git integration for Databricks Git folders. Small data files used for test scenarios can also be stored as workspace files.
The tables below provide specific recommendations for files, depending on your type of file or feature needs.
Important
The Databricks File System (DBFS) is also available for file storage, but is not recommended, as all workspace users have access to files in DBFS. See DBFS.
File types
The following table provides storage recommendations for file types. Databricks supports many file formats beyond what are provided in this table as examples.
File type | Recommendation |
---|---|
Databricks objects, such as notebooks and queries | Store as workspace files |
Structured data files, such as Parquet files and ORC files | Store in Unity Catalog volumes |
Semi-structured data files, such as text files (.csv , .txt ) and JSON files (.json ) |
Store in Unity Catalog volumes |
Unstructured data files, such as image files (.png , .svg ), audio files (.mp3 ), and document files (.pdf , .docx ) |
Store in Unity Catalog volumes |
Raw data files used for adhoc or early data exploration | Store in Unity Catalog volumes |
Operational data, such as log files | Store in Unity Catalog volumes |
Large archive files, such as ZIP files (.zip ) |
Store in Unity Catalog volumes |
Source code files, such as Python files (.py ), Java files (.java ), and Scala files (.scala ) |
Store as workspace files, if applicable, with other related objects, such as notebooks and queries. Databricks recommends managing these files in a Git folder for version control and change tracking of these files. |
Build artifacts and libraries, such as Python wheels (.whl ) and JAR files (.jar ) |
Store in Unity Catalog volumes |
Configuration files | Store configuration files needed across workspaces in Unity Catalog volumes, but store them as workspace files if they are project files in a Git folder. |
Feature comparison
The following table compares the feature offerings of workspace files and Unity Catalog volumes.
Feature | Workspace files | Unity Catalog volumes |
---|---|---|
File access | Workspace files are only accessible to each other within the same workspace. | Files are globally accessible across workspaces. |
Programmatic access | Files can be accessed using: * Spark APIs * FUSE * dbutils * REST API * Databricks SDKs * Databricks CLI |
Files can be accessed using: * Spark APIs * FUSE * dbutils * REST API * Databricks SDKs * Databricks SQL Connectors * Databricks CLI * Databricks Terraform Provider |
Databricks Asset Bundles | By default, all files in a bundle, which includes libraries and Databricks objects such as notebooks and queries, are deployed securely as workspace files. Permissions are defined in the bundle configuration. | Bundles can be customized to include libraries already in volumes when the libraries exceed the size limit of workspace files. See Databricks Asset Bundles library dependencies. |
File permission level | Permissions are at the Git-folder level if the file is in a Git folder, otherwise permissions are set at the file level. | Permissions are at the volume level. |
Permissions management | Permissions are managed by workspace ACLs and are limited to the containing workspace. | Metadata and permissions are managed by Unity Catalog. These permissions are applicable across all workspaces that have access to the catalog. |
External storage mount | Does not support mounting external storage | Provides the option to point to pre-existing datasets on external storage by creating an external volume. See What are Unity Catalog volumes?. |
UDF support | Not supported | Writing from UDFs is supported using Volumes FUSE |
File size | Store smaller files less than 500MB, such as source code files (.py , .md , .yml ) needed alongside notebooks. |
Store very large data files at limits determined by cloud service providers. |
Upload & download | Support for upload and download up to 10MB. | Support for upload and download up to 5GB. |
Table creation support | Tables cannot be created with workspace files as the location. | Tables can be created from files in a volume by running COPY INTO , Autoloader, or other options described in Ingest data into a Databricks lakehouse. |
Directory structure & file paths | Files are organized in nested directories, each with its own permission model: * User home directories, one for each user and service principal in the workspace * Git folders * Shared |
Files are organized in nested directories inside a volume See How can you access data in Unity Catalog?. |
File history | Use Git folder within workspaces to track file changes. | Audit logs are available. |