sparklyrsparklyr

Azure Databricks 支持笔记本、作业和 RStudio 桌面中的 sparklyrAzure Databricks supports sparklyr in notebooks, jobs, and RStudio Desktop.

要求Requirements

Azure Databricks 通过每个运行时版本分发 sparklyr 的最新稳定版本。Azure Databricks distributes the latest stable version of sparklyr with every runtime release. 可以通过导入已安装的 sparklyr 版本,在 Azure Databricks R 笔记本或托管在 Azure Databricks 上的 RStudio Server 内使用 sparklyr。You can use sparklyr in Azure Databricks R notebooks or inside RStudio Server hosted on Azure Databricks by importing the installed version of sparklyr.

在 RStudio Desktop 中,Databricks Connect 允许将 Sparklyr 从本地计算机连接到 Azure Databricks 群集并运行 Apache Spark 代码。In RStudio Desktop, Databricks Connect allows you to connect sparklyr from your local machine to Azure Databricks clusters and run Apache Spark code. 请参阅将 sparklyr 和 RStudio Desktop 与 Databricks Connect 配合使用See Use sparklyr and RStudio Desktop with Databricks Connect.

将 sparklyr 连接到 Azure Databricks 群集Connect sparklyr to Azure Databricks clusters

若要建立 sparklyr 连接,可以使用 "databricks" 作为 spark_connect() 中的连接方法。To establish a sparklyr connection, you can use "databricks" as the connection method in spark_connect(). 不需要 spark_connect() 的其他参数,也不需要调用 spark_install(),因为 Spark 已安装在 Azure Databricks 群集上。No additional parameters to spark_connect() are needed, nor is calling spark_install() needed because Spark is already installed on an Azure Databricks cluster.

# create a sparklyr connection
sc <- spark_connect(method = "databricks")

具有 sparklyr 的进度条和 Spark UIProgress bars and Spark UI with sparklyr

如上例所示,如果将 sparklyr 连接对象分配给名为 sc 的变量,则在每个触发 Spark 作业的命令后,你将在笔记本中看到 Spark 进度条。If you assign the sparklyr connection object to a variable named sc as in the above example, you will see Spark progress bars in the notebook after each command that triggers Spark jobs. 此外,还可以单击进度条旁边的链接,查看与给定 Spark 作业关联的 Spark UI。In addition, you can click the link next to the progress bar to view the Spark UI associated with the given Spark job.

Sparklyr 进度Sparklyr progress

使用 sparklyrUse sparklyr

安装 sparklyr 并建立连接后,所有其他 sparklyr API 将正常工作。After you install sparklyr and establish the connection, all other sparklyr API work as they normally do. 有关一些示例,请参阅示例笔记本See the example notebook for some examples.

sparklyr 通常与其他 tidyverse 包(例如 dplyr)一起使用。sparklyr is usually used along with other tidyverse packages such as dplyr. 为方便起见,其中大多数包已预安装在 Databricks 上。Most of these packages are preinstalled on Databricks for your convenience. 只需导入它们即可开始使用 API。You can simply import them and start using the API.

结合使用 sparklyr 和 SparkRUse sparklyr and SparkR together

SparkR 和 sparklyr 可在单个笔记本或作业中一起使用。SparkR and sparklyr can be used together in a single notebook or job. 可以将 SparkR 与 sparklyr 一起导入,并使用其功能。You can import SparkR along with sparklyr and use its functionality. 在 Azure Databricks 笔记本中,SparkR 连接是预先配置的。In Azure Databricks notebooks, the SparkR connection is pre-configured.

SparkR 中的某些函数屏蔽了 dplyr 中的部分函数:Some of the functions in SparkR mask a number of functions in dplyr:

> library(SparkR)
The following objects are masked from ‘package:dplyr’:

arrange, between, coalesce, collect, contains, count, cume_dist,
dense_rank, desc, distinct, explain, filter, first, group_by,
intersect, lag, last, lead, mutate, n, n_distinct, ntile,
percent_rank, rename, row_number, sample_frac, select, sql,
summarize, union

如果在导入 dplyr 后导入 SparkR,则可以通过使用完全限定的名称(例如 dplyr::arrange())引用 dplyr 中的函数。If you import SparkR after you imported dplyr, you can reference the functions in dplyr by using the fully qualified names, for example, dplyr::arrange(). 同样,如果在 SparkR 后导入 dplyr,则 SparkR 中的函数将被 dplyr 屏蔽。Similarly if you import dplyr after SparkR, the functions in SparkR are masked by dplyr.

或者,可以在不需要时选择性地拆离这两个包中的一个。Alternatively, you can selectively detach one of the two packages while you do not need it.

detach("package:dplyr")

在 spark-submit 作业中使用 sparklyrUse sparklyr in spark-submit jobs

可以在 Azure Databricks 上运行使用 sparklyr 的脚本作为 spark-submit 作业,并进行少量代码修改。You can run scripts that use sparklyr on Azure Databricks as spark-submit jobs, with minor code modifications. 上述部分说明不适用于在 Azure Databricks 上的 spark-submit 作业中使用 sparklyr。Some of the instructions above do not apply to using sparklyr in spark-submit jobs on Azure Databricks. 特别是,必须向 spark_connect 提供 Spark 主 URL。In particular, you must provide the Spark master URL to spark_connect. 有关示例,请参阅创建并运行适用于 R 脚本的 spark-submit 作业For an example, refer to Create and run a spark-submit job for R scripts.

不支持的功能Unsupported features

Azure Databricks 不支持需要本地浏览器的 sparklyr 方法,如 spark_web()spark_log()Azure Databricks does not support sparklyr methods such as spark_web() and spark_log() that require a local browser. 不过,由于 Spark UI 是内置在 Azure Databricks 上的,因此你可以轻松地查看 Spark 作业和日志。However, since the Spark UI is built-in on Azure Databricks, you can inspect Spark jobs and logs easily. 请参阅群集驱动程序和工作器日志See Cluster driver and worker logs.

Sparklyr 笔记本 Sparklyr notebook

获取笔记本Get notebook