Can you use pandas on Azure Databricks?
Databricks Runtime includes pandas as one of the standard Python packages, allowing you to create and leverage pandas DataFrames in Databricks notebooks and jobs.
In Databricks Runtime 10.4 LTS and above, Pandas API on Spark provides familiar pandas commands on top of PySpark DataFrames. You can also convert DataFrames between pandas and PySpark.
Apache Spark includes Arrow-optimized execution of Python logic in the form of pandas function APIs, which allow users to apply pandas transformations directly to PySpark DataFrames. Apache Spark also supports pandas UDFs, which use similar Arrow-optimizations for arbitrary user functions defined in Python.
Where does pandas store data on Azure Databricks?
You can use pandas to store data in many different locations on Azure Databricks. Your ability to store and load data from some locations depends on configurations set by workspace administrators.
Note
Databricks recommends storing production data on cloud object storage.
For quick exploration and data without sensitive information, you can safely save data using either relative paths or the DBFS, as in the following examples:
import pandas as pd
df = pd.DataFrame([["a", 1], ["b", 2], ["c", 3]])
df.to_csv("./relative_path_test.csv")
df.to_csv("/dbfs/dbfs_test.csv")
You can explore files written to the DBFS with the %fs
magic command, as in the following example. Note that the /dbfs
directory is the root path for these commands.
%fs ls
When you save to a relative path, the location of your file depends on where you execute your code. If you're using a Databricks notebook, your data file saves to the volume storage attached to the driver of your cluster. Data stored in this location is permanently deleted when the cluster terminates. If you're using Databricks Git folders with arbitrary file support enabled, your data saves to the root of your current project. In either case, you can explore the files written using the %sh
magic command, which allows simple bash operations relative to your current root directory, as in the following example:
%sh ls
For more information on how Azure Databricks stores various files, see Work with files on Azure Databricks.
How do you load data with pandas on Azure Databricks?
Azure Databricks provides a number of options to facilitate uploading data to the workspace for exploration. The preferred method to load data with pandas varies depending on how you load your data to the workspace.
If you have small data files stored alongside notebooks on your local machine, you can upload your data and code together with Git folders. You can then use relative paths to load data files.
Azure Databricks provides extensive UI-based options for data loading. Most of these options store your data as Delta tables. You can read a Delta table to a Spark DataFrame, and then convert that to a pandas DataFrame.
If you have saved data files using DBFS or relative paths, you can use DBFS or relative paths to reload those data files. The following code provides an example:
import pandas as pd
df = pd.read_csv("./relative_path_test.csv")
df = pd.read_csv("/dbfs/dbfs_test.csv")
Databricks recommends storing production data on cloud object storage. See Connect to Azure Data Lake Storage Gen2 and Blob Storage.
If you're in a Unity Catalog-enabled workspace, you can access cloud storage with external locations. See Create an external location to connect cloud storage to Azure Databricks.
You can load data directly from Azure Data Lake Storage Gen2 using pandas and a fully qualified URL. You need to provide cloud credentials to access cloud data.
df = pd.read_csv(
f"abfss://{container}@{storage_account}.dfs.core.chinacloudapi.cn/{file_path}",
storage_options={
"sas_token": sas_token_value
}
)