从 SparkR 迁移到 sparklyr

SparkR是 Apache Spark 的一部分开发的，其设计对 Scala 和 Python 用户很熟悉，但 R 从业者可能不太直观。此外，Spark 4.0 中已弃用 SparkR。

相比之下， sparklyr 专注于提供更友好的 R 体验。它利用dplyr语法，这与使用tidyverse的用户所熟悉的模式相符，如select()、filter()和mutate()，用于 DataFrame 操作。

sparklyr 是推荐用于使用 Apache Spark 的 R 包。本页介绍 SparkR 和 Sparklyr 之间跨 Spark API 之间的差异，并提供有关代码迁移的信息。

环境配置

安装

如果位于Azure Databricks工作区中，则无需安装。使用 library(sparklyr) 加载 sparklyr。若要在 Azure Databricks 外部本地安装 sparklyr，请参阅 Get Started。

连接到 Spark

使用 sparklyr 在 Databricks 工作区中连接到 Spark，或者通过 Databricks Connect 在本地连接：

工作区：

library(sparklyr)
sc <- spark_connect(method = "databricks")

Databricks Connect：

sc <- spark_connect(method = "databricks_connect")

有关 Databricks Connect 与 sparklyr 的更多详细信息和扩展教程，请参阅入门。

读取和写入数据

sparklyr 具有一系列 spark_read_*() 和 spark_write_*() 函数来加载和保存数据，与 SparkR 的泛型 read.df() 和 write.df() 函数不同。还有唯一的函数可用于从内存中的 R 数据帧创建 Spark 数据帧或 Spark SQL 临时视图。

任务	SparkR	sparklyr
将数据复制到 Spark	`createDataFrame()`	`copy_to()`
创建临时视图	`createOrReplaceTempView()`	直接与方法一起使用`invoke()`
将数据写入表	`saveAsTable()`	`spark_write_table()`
将数据写入指定格式	`write.df()`	`spark_write_<format>()`
从表读取数据	`tableToDF()`	`tbl()` 或 `spark_read_table()`
从指定格式读取数据	`read.df()`	`spark_read_<format>()`

正在加载数据

若要将 R 数据帧转换为 Spark 数据帧，或从数据帧创建临时视图以将 SQL 应用于它：

SparkR

mtcars_df <- createDataFrame(mtcars)

sparklyr

mtcars_tbl <- copy_to(
  sc,
  df = mtcars,
  name = "mtcars_tmp",
  overwrite = TRUE,
  memory = FALSE
)

copy_to() 使用指定名称创建临时视图。如果直接使用 SQL（例如）， sdf_sql()可以使用名称来引用数据。此外，通过将 copy_to() 参数设置为 memory 来缓存数据TRUE。

创建视图

以下代码示例演示如何创建临时视图：

SparkR

createOrReplaceTempView(mtcars_df, "mtcars_tmp_view")

sparklyr

spark_dataframe(mtcars_tbl) |>
  invoke("createOrReplaceTempView", "mtcars_tmp_view")

写入数据

以下代码示例演示如何写入数据：

SparkR

# Save a DataFrame to Unity Catalog
saveAsTable(
  mtcars_df,
  tableName = "<catalog>.<schema>.<table>",
  mode = "overwrite"
)

# Save a DataFrame to local filesystem using Delta format
write.df(
  mtcars_df,
  path = "file:/<path/to/save/delta/mtcars>",
  source = "delta",
  mode = "overwrite"
)

sparklyr

# Save tbl_spark to Unity Catalog
spark_write_table(
  mtcars_tbl,
  name = "<catalog>.<schema>.<table>",
  mode = "overwrite"
)

# Save tbl_spark to local filesystem using Delta format
spark_write_delta(
  mtcars_tbl,
  path = "file:/<path/to/save/delta/mtcars>",
  mode = "overwrite"
)

# Use DBI
library(DBI)
dbWriteTable(
  sc,
  value = mtcars_tbl,
  name = "<catalog>.<schema>.<table>",
  overwrite = TRUE
)

读取数据

以下代码示例演示如何读取数据：

SparkR

# Load a Unity Catalog table as a DataFrame
tableToDF("<catalog>.<schema>.<table>")

# Load csv file into a DataFrame
read.df(
  path = "file:/<path/to/read/csv/data.csv>",
  source = "csv",
  header = TRUE,
  inferSchema = TRUE
)

# Load Delta from local filesystem as a DataFrame
read.df(
  path = "file:/<path/to/read/delta/mtcars>",
  source = "delta"
)

# Load data from a table using SQL - Databricks recommendeds using `tableToDF`
sql("SELECT * FROM <catalog>.<schema>.<table>")

sparklyr

# Load table from Unity Catalog with dplyr
tbl(sc, "<catalog>.<schema>.<table>")

# or using `in_catalog`
tbl(sc, in_catalog("<catalog>", "<schema>", "<table>"))

# Load csv from local filesystem as tbl_spark
spark_read_csv(
  sc,
  name = "mtcars_csv",
  path = "file:/<path/to/csv/mtcars>",
  header = TRUE,
  infer_schema = TRUE
)

# Load delta from local filesystem as tbl_spark
spark_read_delta(
  sc,
  name = "mtcars_delta",
  path = "file:/tmp/test/sparklyr1"
)

# Load data using SQL
sdf_sql(sc, "SELECT * FROM <catalog>.<schema>.<table>")

处理数据

选择和筛选

SparkR

# Select specific columns
select(mtcars_df, "mpg", "cyl", "hp")

# Filter rows where mpg > 20
filter(mtcars_df, mtcars_df$mpg > 20)

sparklyr

# Select specific columns
mtcars_tbl |>
  select(mpg, cyl, hp)

# Filter rows where mpg > 20
mtcars_tbl |>
  filter(mpg > 20)

添加列

SparkR

# Add a new column 'power_to_weight' (hp divided by wt)
withColumn(mtcars_df, "power_to_weight", mtcars_df$hp / mtcars_df$wt)

sparklyr

# Add a new column 'power_to_weight' (hp divided by wt)
mtcars_tbl |>
  mutate(power_to_weight = hp / wt)

分组和聚合

SparkR

# Calculate average mpg and hp by number of cylinders
mtcars_df |>
  groupBy("cyl") |>
  summarize(
    avg_mpg = avg(mtcars_df$mpg),
    avg_hp = avg(mtcars_df$hp)
  )

sparklyr

# Calculate average mpg and hp by number of cylinders
mtcars_tbl |>
  group_by(cyl) |>
  summarize(
    avg_mpg = mean(mpg),
    avg_hp = mean(hp)
  )

联接

假设我们有另一个数据集，其中包含要连接到 mtcars 的气缸标签。

SparkR

# Create another DataFrame with cylinder labels
cylinders <- data.frame(
  cyl = c(4, 6, 8),
  cyl_label = c("Four", "Six", "Eight")
)
cylinders_df <- createDataFrame(cylinders)

# Join mtcars_df with cylinders_df
join(
  x = mtcars_df,
  y = cylinders_df,
  mtcars_df$cyl == cylinders_df$cyl,
  joinType = "inner"
)

sparklyr

# Create another SparkDataFrame with cylinder labels
cylinders <- data.frame(
  cyl = c(4, 6, 8),
  cyl_label = c("Four", "Six", "Eight")
)
cylinders_tbl <- copy_to(sc, cylinders, "cylinders", overwrite = TRUE)

# join mtcars_df with cylinders_tbl
mtcars_tbl |>
  inner_join(cylinders_tbl, by = join_by(cyl))

用户定义的函数（UDF）

若要创建自定义函数进行分类，请执行以下任务：

# Define the custom function
categorize_hp <- function(df)
  df$hp_category <- ifelse(df$hp > 150, "High", "Low") # a real-world example would use case_when() with mutate()
  df

SparkR

SparkR 需要在应用函数之前显式定义输出架构：

# Define the schema for the output DataFrame
schema <- structType(
  structField("mpg", "double"),
  structField("cyl", "double"),
  structField("disp", "double"),
  structField("hp", "double"),
  structField("drat", "double"),
  structField("wt", "double"),
  structField("qsec", "double"),
  structField("vs", "double"),
  structField("am", "double"),
  structField("gear", "double"),
  structField("carb", "double"),
  structField("hp_category", "string")
)

# Apply the function across partitions
dapply(
  mtcars_df,
  func = categorize_hp,
  schema = schema
)

# Apply the same function to each group of a DataFrame. Note that the schema is still required.
gapply(
  mtcars_df,
  cols = "hp",
  func = categorize_hp,
  schema = schema
)

sparklyr

# Load Arrow to avoid cryptic errors
library(arrow)

# Apply the function over data.
# By default this applies to each partition.
mtcars_tbl |>
  spark_apply(f = categorize_hp)

# Apply the function over data
# Use `group_by` to apply data over groups
mtcars_tbl |>
  spark_apply(
    f = summary,
    group_by = "hp" # This isn't changing the resulting output as the functions behavior is applied to rows independently.
  )

spark.lapply（） vs spark_apply（）

在 SparkR 中， spark.lapply() 对 R 列表而不是 DataFrame 进行操作。 sparklyr 中没有直接的等效功能，但您可以通过使用包含唯一标识符的数据帧，并按这些 ID 进行分组spark_apply()，来实现类似的行为。在某些情况下，行操作还可以提供类似的功能。有关详细信息 spark_apply()，请参阅分布式 R 计算。

SparkR

# Define a list of integers
numbers <- list(1, 2, 3, 4, 5)

# Define a function to apply
square <- function(x)
  x * x

# Apply the function over list using Spark
spark.lapply(numbers, square)

sparklyr

# Create a DataFrame of given length
sdf <- sdf_len(sc, 5, repartition = 1)

# Apply function to each partition of the DataFrame
# spark_apply() defaults to processing data based on number of partitions.
# In this case it will return a single row due to repartition = 1.
spark_apply(sdf, f = nrow)

# Apply function to each row (option 1)
# To force behaviour like spark.lapply() you can create a DataFrame with N rows and force grouping with group_by set to a unique row identifier. In this case it's the id column automatically generated by sdf_len()). This will return N rows.
spark_apply(sdf, f = nrow, group_by = "id")

# Apply function to each row (option 2)
# This requires writing a function that operates across rows of a data.frame, in some occasions this may be faster relative to option 1. Specifying group_by in optional for this example. This example does not require rowwise(), but is just to illustrate one method to force computations to be for every row.
row_func <- function(df)
  df |>
    dplyr::rowwise() |>
    dplyr::mutate(x = id * 2)

spark_apply(sdf, f = row_func)

机器学习

用于机器学习的完整 SparkR 和 sparklyr 示例位于 Spark ML 指南和 sparklyr 参考中。

注释

如果不使用 Spark MLlib，Databricks 建议使用 UDF 训练所选库（例如 xgboost）。

线性回归

SparkR

# Select features
training_df <- select(mtcars_df, "mpg", "hp", "wt")

# Fit the model using Generalized Linear Model (GLM)
linear_model <- spark.glm(training_df, mpg ~ hp + wt, family = "gaussian")

# View model summary
summary(linear_model)

sparklyr

# Select features
training_tbl <- mtcars_tbl |>
  select(mpg, hp, wt)

# Fit the model using Generalized Linear Model (GLM)
linear_model <- training_tbl |>
  ml_linear_regression(response = "mpg", features = c("hp", "wt"))

# View model summary
summary(linear_model)

K-Means 群集

SparkR

# Apply KMeans clustering with 3 clusters using mpg and hp as features
kmeans_model <- spark.kmeans(mtcars_df, mpg ~ hp, k = 3)

# Get cluster predictions
predict(kmeans_model, mtcars_df)

sparklyr

# Use mpg and hp as features
features_tbl <- mtcars_tbl |>
  select(mpg, hp)

# Assemble features into a vector column
features_vector_tbl <- features_tbl |>
  ft_vector_assembler(
    input_cols = c("mpg", "hp"),
    output_col = "features"
  )

# Apply K-Means clustering
kmeans_model <- features_vector_tbl |>
  ml_kmeans(features_col = "features", k = 3)

# Get cluster predictions
ml_predict(kmeans_model, features_vector_tbl)

性能和优化

收集

SparkR 和 sparklyr 都用于 collect() 将 Spark 数据帧转换为 R 数据帧。仅将少量数据收集回 R 数据帧，否则 Spark 驱动程序可能会耗尽内存。

为了防止内存不足错误，SparkR 在 Databricks Runtime 中内置优化有助于收集数据或执行用户定义的函数。

若要确保 sparklyr 在低于 14.3 LTS 的 Databricks Runtime 版本中收集数据和 UDF 时的最佳性能，请加载 arrow 包：

library(arrow)

内存中的分区

SparkR

# Repartition the SparkDataFrame based on 'cyl' column
repartition(mtcars_df, col = mtcars_df$cyl)

# Repartition the SparkDataFrame to number of partitions
repartition(mtcars_df, numPartitions = 10)

# Coalesce the DataFrame to number of partitions
coalesce(mtcars_df, numPartitions = 1)

# Get number of partitions
getNumPartitions(mtcars_df)

sparklyr

# Repartition the tbl_spark based on 'cyl' column
sdf_repartition(mtcars_tbl, partition_by = "cyl")

# Repartition the tbl_spark to number of partitions
sdf_repartition(mtcars_tbl, partitions = 10)

# Coalesce the tbl_spark to number of partitions
sdf_coalesce(mtcars_tbl, partitions = 1)

# Get number of partitions
sdf_num_partitions(mtcars_tbl)

Caching

SparkR

# Cache the DataFrame in memory
cache(mtcars_df)

sparklyr

# Cache the tbl_spark in memory
tbl_cache(sc, name = "mtcars_tmp")

Last updated on 2026-04-21

从 SparkR 迁移到 sparklyr

环境配置

安装

连接到 Spark

读取和写入数据

正在加载数据

SparkR

sparklyr

创建视图

SparkR

sparklyr

写入数据

SparkR

sparklyr

读取数据

SparkR

sparklyr

处理数据

选择和筛选

SparkR

sparklyr

添加列

SparkR

sparklyr

分组和聚合

SparkR

sparklyr

联接

SparkR

sparklyr

用户定义的函数 （UDF）

SparkR

sparklyr

spark.lapply（） vs spark_apply（）

SparkR

sparklyr

机器学习

线性回归

SparkR

sparklyr

K-Means 群集

SparkR

sparklyr

性能和优化

收集

内存中的分区

SparkR

sparklyr

Caching

SparkR

sparklyr

其他资源

用户定义的函数（UDF）