Tutorial: Work with SparkR SparkDataFrames on Azure Databricks
This tutorial shows you how to load and transform data using the SparkDataFrame API for SparkR in Azure Databricks.
By the end of this tutorial, you will understand what a DataFrame is and be familiar with the following tasks:
See also Apache SparkR API reference.
What is a DataFrame?
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of a DataFrame like a spreadsheet, a SQL table, or a dictionary of series objects. Apache Spark DataFrames provide a rich set of functions (select columns, filter, join, aggregate) that allow you to solve common data analysis problems efficiently.
Apache Spark DataFrames are an abstraction built on top of Resilient Distributed Datasets (RDDs). Spark DataFrames and Spark SQL use a unified planning and optimization engine, allowing you to get nearly identical performance across all supported languages on Azure Databricks (Python, SQL, Scala, and R).
Requirements
To complete the following tutorial, you must meet the following requirements:
- You are logged into an Azure Databricks workspace.
- You have permission to create compute enabled with Unity Catalog.
Note
If you do not have cluster control privileges, you can still complete most of the following steps as long as you have access to a cluster.
From the sidebar on the homepage, you access Azure Databricks entities: the workspace browser, Catalog Explorer, workflows, and compute. Workspace is the root folder that stores your Azure Databricks assets, like notebooks and libraries.
Step 1: Create a DataFrame with SparkR
To learn how to navigate Azure Databricks notebooks, see Databricks notebook interface and controls.
Open a new notebook by clicking the icon.
Copy and paste the following code into the empty notebook cell, then press
Shift+Enter
to run the cell. The following code example creates a DataFrame nameddf1
with city population data and displays its contents.# Load the SparkR package that is already preinstalled on the cluster. library(SparkR) # Define the data data <- data.frame( rank = c(295), city = c("South Bend"), state = c("Indiana"), code = c("IN"), population = c(101190), price = c(112.9) ) # Define the schema schema <- structType( structField("rank", "double"), structField("city", "string"), structField("state", "string"), structField("code", "string"), structField("population", "double"), structField("price", "double") ) # Create the dataframe df1 <- createDataFrame( data = data, schema = schema )
Step 2: Load data into a DataFrame from files
Add more city population data from the /databricks-datasets
directory into df2
.
To load data into DataFrame df2
from the data_geo.csv
file, copy and paste the following code into the new empty notebook cell. Press Shift+Enter
to run the cell.
You can load data from many supported file formats. The following example uses a dataset available in the /databricks-datasets
directory, accessible from most workspaces. See Sample datasets.
# Create second dataframe from a file
df2 <- read.df("/databricks-datasets/samples/population-vs-price/data_geo.csv",
source = "csv",
header = "true",
schema = schema
)
Step 3: View and interact with your DataFrame
View and interact with your city population DataFrames using the following methods.
Combine DataFrames
Combine the contents of your first DataFrame df1
with Dataframe df2
containing the contents of data_geo.csv
.
In the notebook, use the following example code to create a new DataFrame that adds the rows of one DataFrame to another using the union operation:
# Returns a DataFrame that combines the rows of df1 and df2
df <- union(df1, df2)
View the DataFrame
To view the U.S. city data in a tabular format, use the Azure Databricks display()
command in a notebook cell.
display(df)
Print the DataFrame schema
Spark uses the term schema to refer to the names and data types of the columns in the DataFrame.
Print the schema of your DataFrame with the following .printSchema()
method in your notebook. Use the resulting metadata to interact with the contents of your DataFrame.
printSchema(df)
Filter rows in a DataFrame
Discover the five most populous cities in your data set by filtering rows, using .filter()
or .where()
. Use filtering to select a subset of rows to return or modify in a DataFrame. There is no difference in performance or syntax, as seen in the following example.
To select the most populous cities, continue to add new cells to your notebook and add the following code example:
# Filter rows using filter()
filtered_df <- filter(df, df$rank < 6)
display(filtered_df)
# Filter rows using .where()
filtered_df <- where(df, df$rank < 6)
display(filtered_df)
Select columns from a DataFrame
Learn about which state a city is located in using the select()
method. Select columns by passing one or more column names to .select()
, as in the following example:
# Select columns from a DataFrame
select_df <- select(df,"City", "State")
display(select_df)
Create a subset DataFrame
Create a subset DataFrame with the ten cities with the highest population and display the resulting data. Combine select and filter queries to limit rows and columns returned, using the following code in your notebook:
# Create a subset Dataframe
subset_df <- filter(select(df, col = list("city", "rank")),condition = df$rank > 1)
display(subset_df)
Step 4: Save the DataFrame
You can either save your DataFrame to a table or write the DataFrame to a file or multiple files.
Save the DataFrame to a table
Azure Databricks uses the Delta Lake format for all tables by default. To save your DataFrame, you must have CREATE
table privileges on the catalog and schema. The following example saves the contents of the DataFrame to a table named us_cities
:
# Save DataFrame to a table
saveAsTable(df, "us_cities", source = "parquet", mode = "overwrite")
Most Spark applications work on large data sets and in a distributed fashion. Spark writes out a directory of files rather than a single file. Delta Lake splits the Parquet folders and files. Many data systems can read these directories of files. Azure Databricks recommends using tables over file paths for most applications.
Save the DataFrame to JSON files
The following example saves a directory of JSON files:
# Write a DataFrame to a directory of files
write.df(df, path = "/tmp/json_data", source = "json", mode = "overwrite")
Read the DataFrame from a JSON file
# Read a DataFrame from a JSON file
df3 <- read.json("/tmp/json_data")
display(df3)
Additional tasks: Run SQL queries in SparkR
Spark DataFrames provide the following options to combine SQL with SparkR. You can run the following code in the same notebook that you created for this tutorial.
Specify a column as a SQL query
The selectExpr()
method allows you to specify each column as a SQL query, such as in the following example:
display(df_selected <- selectExpr(df, "`rank`", "upper(city) as big_name"))
Use selectExpr()
selectExpr
allows you to use SQL syntax anywhere a column would be specified, as in the following example:
display(df_selected <- selectExpr(df, "`rank`", "lower(city) as little_name"))
Run an arbitrary SQL query
You can use spark.sql()
to run arbitrary SQL queries, as in the following example:
display(query_df <- sql("SELECT * FROM us_cities"))
Parameterize SQL queries
You can use SparkR formatting to parameterize SQL queries, as in the following example:
# Define the table name
table_name <- "us_cities"
# Construct the SQL query using paste()
query <- paste("SELECT * FROM", table_name)
# Query the DataFrame using the constructed SQL query
display(query_df <- sql(query))