Comparing SparkR and sparklyr
R users can choose between two APIs for Apache Spark: SparkR and sparklyr. This article compares these APIs. Databricks recommends that you choose one of these APIs to develop a Spark application in R. Combining code from both of these APIs into a single script or Azure Databricks notebook or job can make your code more difficult to read and maintain.
API origins
SparkR is built by the Spark community and developers from Databricks. Because of this, SparkR closely follows the Spark Scala classes and DataFrame API.
sparklyr started with RStudio and has since been donated to the Linux Foundation. sparklyr is tightly integrated into the tidyverse in both its programming style and through API interoperability with dplyr.
SparkR and sparklyr are highly capable of working with big data in R. Within the past few years, their feature sets are coming closer to parity.
API differences
The following code example shows how to use SparkR and sparklyr from an Azure Databricks notebook to read a CSV file from the Sample datasets into Spark.
# #############################################################################
# SparkR usage
# Note: To load SparkR into a Databricks notebook, run the following:
# library(SparkR)
# You can then remove "SparkR::" from the following function call.
# #############################################################################
# Use SparkR to read the airlines dataset from 2008.
airlinesDF <- SparkR::read.df(path = "/databricks-datasets/asa/airlines/2008.csv",
source = "csv",
inferSchema = "true",
header = "true")
# Print the loaded dataset's class name.
cat("Class of SparkR object: ", class(airlinesDF), "\n")
# Output:
#
# Class of SparkR object: SparkDataFrame
# #############################################################################
# sparklyr usage
# Note: To install, load, and connect with sparklyr in a Databricks notebook,
# run the following:
# install.packages("sparklyr")
# library(sparklyr)
# sc <- sparklyr::spark_connect(method = "databricks")
# If you run "library(sparklyr)", you can then remove "sparklyr::" from the
# preceding "spark_connect" and from the following function call.
# #############################################################################
# Use sparklyr to read the airlines dataset from 2007.
airlines_sdf <- sparklyr::spark_read_csv(sc = sc,
name = "airlines",
path = "/databricks-datasets/asa/airlines/2007.csv")
# Print the loaded dataset's class name.
cat("Class of sparklyr object: ", class(airlines_sdf))
# Output:
#
# Class of sparklyr object: tbl_spark tbl_sql tbl_lazy tbl
However, if you try to run a sparklyr function on a SparkDataFrame
object from SparkR, or if you try to run a SparkR function on a tbl_spark
object from sparklyr, it will not work, as shown in the following code example.
# Try to call a sparklyr function on a SparkR SparkDataFrame object. It will not work.
sparklyr::sdf_pivot(airlinesDF, DepDelay ~ UniqueCarrier)
# Output:
#
# Error : Unable to retrieve a Spark DataFrame from object of class SparkDataFrame
## Now try to call s Spark R function on a sparklyr tbl_spark object. It also will not work.
SparkR::arrange(airlines_sdf, "DepDelay")
# Output:
#
# Error in (function (classes, fdef, mtable) :
# unable to find an inherited method for function 'arrange' for signature '"tbl_spark", "character"'
This is because sparklyr translates dplyr functions such as arrange
into a SQL query plan that is used by SparkSQL. This is not the case with SparkR, which has functions for SparkSQL tables and Spark DataFrames. These behaviors are why Databricks does not recommended combining SparkR and sparklyr APIs in the same script, notebook, or job.
API interoperability
In rare cases where you cannot avoid combining the SparkR and sparklyr APIs, you can use SparkSQL as a kind of bridge. For instance, in this article's first example, sparklyr loaded the airlines dataset from 2007 into a table named airlines
. You can use the SparkR sql
function to query this table, for example:
top10delaysDF <- SparkR::sql("SELECT
UniqueCarrier,
DepDelay,
Origin
FROM
airlines
WHERE
DepDelay NOT LIKE 'NA'
ORDER BY DepDelay
DESC LIMIT 10")
# Print the class name of the query result.
cat("Class of top10delaysDF: ", class(top10delaysDF), "\n\n")
# Show the query result.
cat("Top 10 airline delays for 2007:\n\n")
head(top10delaysDF, 10)
# Output:
#
# Class of top10delaysDF: SparkDataFrame
#
# Top 10 airline delays for 2007:
#
# UniqueCarrier DepDelay Origin
# 1 AA 999 RNO
# 2 NW 999 EWR
# 3 AA 999 PHL
# 4 MQ 998 RST
# 5 9E 997 SWF
# 6 AA 996 DFW
# 7 NW 996 DEN
# 8 MQ 995 IND
# 9 MQ 994 SJT
# 10 AA 993 MSY
For additional examples, see Work with DataFrames and tables in R.