Azure Databricks 上的 RStudioRStudio on Azure Databricks

Azure Databricks 集成了 RStudio Server,后者是适用于 R 的常用集成开发环境 (IDE)。Azure Databricks integrates with RStudio Server, the popular integrated development environment (IDE) for R.

你可以使用 Azure Databricks 上的 RStudio Server 的开源版或专业版。You can use either the Open Source or Pro editions of RStudio Server on Azure Databricks. 如果要使用 RStudio Server 专业版,则必须将现有的 RStudio 专业版许可证传输到 Azure Databricks(请参阅 RStudio Server 专业版入门)。If you want to use RStudio Server Pro, you must transfer your existing RStudio pro license to Azure Databricks (see Get started with RStudio Server Pro).

用于机器学习的 Databricks Runtime 包括 RStudio Server 1.2 版开源包的未修改版本,可以在 GitHub 中找到该版本的源代码。Databricks Runtime for Machine Learning includes an unmodified version of RStudio Server Version 1.2 Open Source package for which the source code can be found in GitHub. Databricks Runtime ML 不支持 RStudio Server 1.3 版。Databricks Runtime ML does not support RStudio Server Version 1.3.

RStudio 集成体系结构 RStudio integration architecture

使用 Azure Databricks 上的 RStudio Server 时,RStudio Server 守护程序在 Azure Databricks 群集的驱动程序节点(或主节点)上运行。When you use RStudio Server on Azure Databricks, the RStudio Server Daemon runs on the driver (or master) node of an Azure Databricks cluster. RStudio Web UI 通过 Azure Databricks Webapp 进行代理,这意味着你无需对群集网络配置进行任何更改。The RStudio web UI is proxied through Azure Databricks webapp, which means that you do not need to make any changes to your cluster network configuration. 下图演示了 RStudio 集成组件体系结构。This diagram demonstrates the RStudio integration component architecture.

Databricks 上的 RStudio 的体系结构Architecture of RStudio on Databricks

警告

Azure Databricks 通过群集的 Spark 驱动程序上的端口 8787 充当 RStudio Web 服务的代理。Azure Databricks proxies the RStudio web service from port 8787 on the cluster’s Spark driver. 此 Web 代理仅与 RStudio 配合使用。This web proxy is intended for use only with RStudio. 如果你在端口 8787 上启动其他 Web 服务,用户可能会受到安全攻击。If you launch other web services on port 8787, you might expose your users to potential security exploits. Databricks 和 Microsoft 均不负责因在群集上安装不受支持的软件而导致的任何问题。Neither Databricks nor Microsoft is responsible for any issues that result from the installation of unsupported software on a cluster.

要求Requirements

  • 群集不得启用表访问控制自动终止The cluster must not have table access control or automatic termination enabled.
  • 你必须拥有对该群集的“可附加到”权限。You must have Can Attach To permission for that cluster. 群集管理员可以向你授予此权限。The cluster admin can grant you this permission. 请参阅群集访问控制See Cluster access control.
  • 如果要使用专业版,则需要具有 RStudio Server 浮动专业版许可证。If you want to use the Pro edition, an RStudio Server floating Pro license.

开始使用 RStudio Server 开源版Get started with RStudio Server Open Source

重要

Databricks Runtime 7.0 ML 上安装了 RStudio Server 开源版。RStudio Server Open Source is installed on Databricks Runtime 7.0 ML. 如果你使用的是 Databricks Runtime 7.0 ML 或更高版本,则可以跳过安装 RStudio Server 的部分。If you are using the Databricks Runtime 7.0 ML or above, you can skip the section on installing RStudio Server.

若要开始使用 Azure Databricks 上的 RStudio Server 开源版,必须在 Azure Databricks 群集上安装 RStudio。To get started with RStudio Server Open Source on Azure Databricks, you must install RStudio on an Azure Databricks cluster. 此安装只需执行一次。You need to perform this installation only once. 安装通常由管理员执行。Installation is usually performed by an administrator.

安装 RStudio Server 开源版Install RStudio Server Open Source

若要在 Azure Databricks 群集上安装 RStudio Server 开源版,你必须创建一个初始化脚本来安装 RStudio Server 开源版二进制程序包。To set up RStudio Server Open Source on an Azure Databricks cluster, you must create an init script to install the RStudio Server Open Source binary package. 有关更多详细信息,请参阅以群集为作用域的初始化脚本See Cluster-scoped init scripts for more details. 下面是一个笔记本单元示例,它在 DBFS 上的某个位置安装初始化脚本。Here is an example notebook cell that installs an init script on a location on DBFS.

script = """#!/bin/bash

set -euxo pipefail
RSTUDIO_BIN="/usr/sbin/rstudio-server"

if [[ ! -f "$RSTUDIO_BIN" && $DB_IS_DRIVER = "TRUE" ]]; then
  apt-get update
  apt-get install -y gdebi-core
  cd /tmp
  # You can find new releases at https://rstudio.com/products/rstudio/download-server/debian-ubuntu/.
  wget https://download2.rstudio.org/server/trusty/amd64/rstudio-server-1.2.5001-amd64.deb
  sudo gdebi -n rstudio-server-1.2.5001-amd64.deb
  rstudio-server restart || true
fi
"""

dbutils.fs.mkdirs("/databricks/rstudio")
dbutils.fs.put("/databricks/rstudio/rstudio-install.sh", script, True)
  1. 在一个笔记本中运行此代码以安装 dbfs:/databricks/rstudio/rstudio-install.sh 处的脚本Run the code in a notebook to install the script at dbfs:/databricks/rstudio/rstudio-install.sh
  2. 在启动群集之前,请将 dbfs:/databricks/rstudio/rstudio-install.sh 添加为初始化脚本。Before launching a cluster add dbfs:/databricks/rstudio/rstudio-install.sh as an init script. 有关详细信息,请参阅诊断日志See Diagnostic logs for details.
  3. 启动群集。Launch the cluster.

使用 RStudio Server 开源版Use RStudio Server Open Source

  1. 显示你在其中安装了 RStudio 的群集的详细信息,并单击“应用”选项卡:Display the details of the cluster on which you installed RStudio and click the Apps tab:

    群集“应用”选项卡Cluster Apps tab

  2. 在“应用”选项卡中,单击“设置 RStudio”按钮。In the Apps tab, click the Set up RStudio button. 这会为你生成一次性密码。This generates a one-time password for you. 单击“显示”链接以显示它并复制密码。Click the show link to display it and copy the password.

    RStudio 一次性密码RStudio one-time password

  3. 单击“打开 RStudio UI”链接,在新选项卡中打开 UI。在登录窗体中以输入用户名和密码的方式登录。Click the Open RStudio UI link to open the UI in a new tab. Enter your username and password in the login form and sign in.

    RStudio 登录窗体RStudio login form

  4. 在 RStudio UI 中,你可以导入 SparkR 包并设置一个 SparkR 会话,以便在群集上启动 Spark 作业。From the RStudio UI, you can import the SparkR package and set up a SparkR session to launch Spark jobs on your cluster.

    library(SparkR)
    sparkR.session()
    

    RStudio 会话RStudio session

  5. 你还可以附加 sparklyr 包并设置 Spark 连接。You can also attach the sparklyr package and set up a Spark connection.

    SparkR::sparkR.session()
    library(sparklyr)
    sc <- spark_connect(method = "databricks")
    

    包含 sparklyr 的 RStudio 会话RStudio session with sparklyr

开始使用 RStudio Server 专业版 Get started with RStudio Server Pro

设置 RStudio 许可证服务器Set up RStudio license server

若要在 Azure Databricks 上使用 RStudio Server 专业版,你需要将专业版许可证转换为浮动许可证To use RStudio Server Pro on Azure Databricks, you need to convert your Pro License to a floating license. 如需帮助,请联系 help@rstudio.comFor assistance, contact help@rstudio.com. 当你的许可证已转换时,你必须为 RStudio Server 专业版设置许可证服务器When your license is converted, you must set up a license server for RStudio Server Pro.

若要设置许可证服务器,请执行以下操作:To set up a license server:

  1. 在你的云提供商网络上启动一个小型实例;许可证服务器守护程序是很轻型的程序。Launch a small instance on your cloud provider network; the license server daemon is very lightweight.
  2. 在你的实例上下载并安装相应版本的 RStudio 许可证服务器,然后启动该服务。Download and install the corresponding version of RStudio License Server on your instance, and start the service. 有关详细说明,请参阅 RStudio Server 专业版文档For detailed instructions, see RStudio Server Pro documentation.
  3. 请确保通向 Azure Databricks 实例的许可证服务器端口已打开。Make sure that the license server port is open to Azure Databricks instances.

安装 RStudio Server 专业版Install RStudio Server Pro

若要在 Azure Databricks 群集上安装 RStudio Server 专业版,你必须创建一个初始化脚本来安装 RStudio Server 专业版二进制程序包,并将其配置为使用你的许可证服务器进行许可证租用。To set up RStudio Server Pro on an Azure Databricks cluster, you must create an init script to install the RStudio Server Pro binary package and configure it to use your license server for license lease. 有关更多详细信息,请参阅诊断日志See Diagnostic logs for more details.

备注

如果你计划在已包含 RStudio Server 开源版程序包的 Databricks Runtime 版本上安装 RStudio Server 专业版,则需要首先卸载该程序包,这样才能安装成功。If you plan to install RStudio Server Pro on a Databricks Runtime version that already includes RStudio Server Open Source package, you need to first uninstall that package for installation to succeed.

下面是一个笔记本单元示例,它在 DBFS 上生成初始化脚本。The following is an example notebook cell that generates an init script on DBFS. 该脚本还会执行其他身份验证配置,以简化与 Azure Databricks 的集成。The script also performs additional authentication configurations that streamline integration with Azure Databricks.

script = """#!/bin/bash

set -euxo pipefail

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  sudo apt-get update
  sudo dpkg --purge rstudio-server # in case open source version is installed.
  sudo apt-get install -y gdebi-core alien

  ## Installing RStudio Server Pro
  cd /tmp

  # You can find new releases at https://rstudio.com/products/rstudio/download-commercial/debian-ubuntu/.
  wget https://download2.rstudio.org/server/trusty/amd64/rstudio-server-pro-1.2.5001-3-amd64.deb
  sudo gdebi -n rstudio-server-pro-1.2.5001-3-amd64.deb

  ## Configuring authentication
  sudo echo 'auth-proxy=1' >> /etc/rstudio/rserver.conf
  sudo echo 'auth-proxy-user-header-rewrite=^(.*)$ $1' >> /etc/rstudio/rserver.conf
  sudo echo 'auth-proxy-sign-in-url=<domain>/login.html' >> /etc/rstudio/rserver.conf
  sudo echo 'admin-enabled=1' >> /etc/rstudio/rserver.conf
  sudo echo 'export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin' >> /etc/rstudio/rsession-profile

  # Enabling floating license
  sudo echo 'server-license-type=remote' >> /etc/rstudio/rserver.conf

  # Session configurations
  sudo echo 'session-rprofile-on-resume-default=1' >> /etc/rstudio/rsession.conf
  sudo echo 'allow-terminal-websockets=0' >> /etc/rstudio/rsession.conf

  sudo rstudio-server license-manager license-server <license-server-url>
  sudo rstudio-server restart || true
fi
"""

dbutils.fs.mkdirs("/databricks/rstudio")
dbutils.fs.put("/databricks/rstudio/rstudio-install.sh", script, True)
  1. <domain> 替换为你的 Azure Databricks URL,并将 <license-server-url> 替换为你的浮动许可证服务器的 URL。Replace <domain> with your Azure Databricks URL and <license-server-url> with the URL of your floating license server.
  2. 在一个笔记本中运行此代码以安装 dbfs:/databricks/rstudio/rstudio-install.sh 处的脚本Run the code in a notebook to install the script at dbfs:/databricks/rstudio/rstudio-install.sh
  3. 在启动群集之前,请将 dbfs:/databricks/rstudio/rstudio-install.sh 添加为初始化脚本。Before launching a cluster add dbfs:/databricks/rstudio/rstudio-install.sh as an init script. 有关详细信息,请参阅诊断日志See Diagnostic logs for details.
  4. 启动群集。Launch the cluster.

使用 RStudio Server 专业版Use RStudio Server Pro

  1. 显示你在其中安装了 RStudio 的群集的详细信息,并单击“应用”选项卡:Display the details of the cluster on which you installed RStudio and click the Apps tab:

    群集“应用”选项卡Cluster Apps tab

  2. 在“应用”选项卡中,单击“设置 RStudio”按钮。In the Apps tab, click the Set up RStudio button.

    RStudio 一次性密码RStudio one-time password

  3. 你不需要使用这个一次性密码。You do not need the one-time password. 单击“打开 RStudio UI”链接,它会为你打开一个经身份验证的 RStudio 专业版会话。Click the Open RStudio UI link and it will open an authenticated RStudio Pro session for you.

  4. 在 RStudio UI 中,你可以附加 SparkR 包并设置一个 SparkR 会话,以便在群集上启动 Spark 作业。From the RStudio UI, you can attach the SparkR package and set up a SparkR session to launch Spark jobs on your cluster.

    library(SparkR)
    sparkR.session()
    

    RStudio 会话RStudio session

  5. 你还可以附加 sparklyr 包并设置 Spark 连接。You can also attach the sparklyr package and set up a Spark connection.

    SparkR::sparkR.session()
    library(sparklyr)
    sc <- spark_connect(method = "databricks")
    

    包含 sparklyr 的 RStudio 会话RStudio session with sparklyr

常见问题 (FAQ)Frequently asked questions (FAQ)

RStudio Server 开源版与 RStudio Server 专业版之间的区别是什么?What is the difference between RStudio Server Open Source and RStudio Server Pro?

RStudio Server 专业版支持各种企业功能,这些功能在开源版上不可用。RStudio Server Pro supports a wide range of enterprise features that are not available on the Open Source edition. 可以在 RStudio Inc 网站上查看功能比较情况。You can see a feature comparison on the RStudio Inc website.

此外,RStudio Server 开源版根据 GNU Affero 通用公共许可证 (AGPL) 分发,而专业版为不能使用 AGPL 软件的组织提供了商业许可证。In addition, RStudio Server Open Source is distributed under the GNU Affero General Public License (AGPL), while the Pro version comes with a commercial license for organizations that are not able to use AGPL software.

最后,RStudio Server 专业版享受 RStudio Inc. 提供的专业和企业支持,而 RStudio Server 开源版没有这些支持。Finally, RStudio Server Pro comes with professional and enterprise support from RStudio Inc., while RStudio Server Open Source comes with no support.

是否可以在 Azure Databricks 上使用我的 RStudio Server 专业版许可证?Can I use my RStudio Server Pro license on Azure Databricks?

可以。如果你已有 RStudio Server 的专业版或企业版许可证,则可以在 Azure Databricks 上使用该许可证。Yes, if you already have a Pro or Enterprise license for RStudio Server, you can use that license on Azure Databricks. 若要了解如何在 Azure Databricks 上安装 RStudio Server 专业版,请参阅开始使用 RStudio Server 专业版See Get started with RStudio Server Pro to learn how to set up RStudio Server Pro on Azure Databricks.

RStudio Server 在何处运行?我是否需要管理任何其他服务/服务器?Where does RStudio Server run? Do I need to manage any additional services/servers?

正如 RStudio 集成体系结构中的关系图所示,RStudio Server 守护程序在 Azure Databricks 群集的驱动程序节点(主节点)上运行。As you can see on the diagram in RStudio integration architecture, the RStudio Server daemon runs on the driver (master) node of your Azure Databricks cluster. 使用 RStudio Server 开源版,无需运行任何其他服务器/服务。With RStudio Server Open Source, you do not need to run any additional servers/services. 但是,对于 RStudio Server 专业版,你必须管理一个运行 RStudio 许可证服务器的单独实例。However, for RStudio Server Pro, you must manage a separate instance that runs RStudio License Server.

是否可以在标准群集上使用 RStudio Server?Can I use RStudio Server on a standard cluster?

可以。Yes, you can. 最初,你需要使用高并发群集,但该限制已不再存在。Originally, you were required to use a high concurrency cluster, but that limitation is no longer in place.

我应该如何在 RStudio 上持久保存我的工作?How should I persist my work on RStudio?

我们强烈建议你使用 RStudio 中的版本控制系统来持久保存工作。We strongly recommend that you persist your work using a version control system from RStudio. RStudio 对各种版本控制系统提供了很大的支持,允许你签入和管理你的项目。RStudio has great support for various version control systems and allows you to check in and manage your projects.

你还可以在 Databricks 文件系统 (DBFS) 上保存文件(代码或数据)。You can also save your files (code or data) on the Databricks File System (DBFS). 例如,如果你将文件保存在 /dbfs/ 下,则在群集终止或重启时不会删除这些文件。For example, if you save a file under /dbfs/ the files will not be deleted when your cluster is terminated or restarted.

重要

如果你不通过版本控制或 DBFS 来持久保存代码,则在管理员重启或终止群集时,你可能会丢失工作。If you do not persist your code through version control or DBFS, you risk losing your work if an admin restarts or terminates the cluster.

另一种方法是将 R 笔记本作为 Rmarkdown 导出以将其保存到你的本地文件系统,然后将该文件导入到 RStudio 实例中。Another method is to save the R notebook to your local file system by exporting it as Rmarkdown, then import the file into the RStudio instance.

Sharing R Notebooks using RMarkdown(使用 RMarkdown 共享 R 笔记本)这一博客文章更详细地介绍了这些步骤。The blog Sharing R Notebooks using RMarkdown describes the steps in more detail.

如何启动 SparkR 会话?How do I start a SparkR session?

SparkR 包含在 Databricks Runtime 中,但你必须将其加载到 RStudio 中。SparkR is contained in Databricks Runtime, but you must load it into RStudio. 在 RStudio 中运行以下代码以初始化 SparkR 会话。Run the following code inside RStudio to initialize a SparkR session.

library(SparkR)
sparkR.session()

如果导入 SparkR 包时出错,请运行 .libPaths() 并验证结果中是否包含 /home/ubuntu/databricks/spark/R/libIf there is an error importing the SparkR package, run .libPaths() and verify that /home/ubuntu/databricks/spark/R/lib is included in the result.

如果未包含此内容,请检查 /usr/lib/R/etc/Rprofile.site 的内容。If it is not included, check the content of /usr/lib/R/etc/Rprofile.site. 列出驱动程序上的 /home/ubuntu/databricks/spark/R/lib/SparkR,以验证是否安装了 SparkR 包。List /home/ubuntu/databricks/spark/R/lib/SparkR on the driver to verify that the SparkR package is installed.

如何启动 sparklyr 会话?How do I start a sparklyr session?

必须在群集上安装 sparklyr 包。The sparklyr package must be installed on the cluster. 使用以下方法之一安装 sparklyr 包:Use one of the following methods to install the sparklyr package:

  • 作为 Azure Databricks 库As an Azure Databricks library
  • install.packages() 命令install.packages() command
  • RStudio 包管理 UIRStudio package management UI

SparkR 包含在 Databricks Runtime 中,但你必须将其加载到 RStudio 中。SparkR is contained in Databricks Runtime, but you must load it into RStudio. 在 RStudio 中运行以下代码以初始化 sparklyr 会话。Run the following code inside RStudio to initialize a sparklyr session.

SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = “databricks”)

如果 sparklyr 命令失败,请确认 SparkR::sparkR.session() 是否成功。If the sparklyr commands fail, confirm that SparkR::sparkR.session() succeeded.

RStudio 如何与 Azure Databricks R 笔记本集成?How does RStudio integrate with Azure Databricks R notebooks?

你可以通过版本控制在笔记本与 RStudio 之间移动你的工作。You can move your work between notebooks and RStudio through version control.

什么是工作目录?What is the working directory?

当你在 RStudio 中启动项目时,你选择了一个工作目录。When you start a project in RStudio, you chose a working directory. 默认情况下,这是在其中运行 RStudio Server 的驱动程序容器(主容器)中的主目录。By default this is the home directory on the driver (master) container where RStudio Server is running. 如果需要,你可以更改此目录。You can change this directory if you want.

是否可以从 Azure Databricks 上运行的 RStudio 启动 Shiny 应用?Can I launch Shiny Apps from RStudio running on Azure Databricks?

非常遗憾,Azure Databricks 尚不支持 Shiny 应用和 RStudio Connect 的集成。Unfortunately, Shiny apps and RStudio Connect integration are not yet supported on Azure Databricks.

无法在 Azure Databricks 上的 RStudio 中使用终端/git。如何解决此问题?I can’t use terminal/git inside RStudio on Azure Databricks. How can I fix that?

请确保已禁用 WebSocket。Make sure that you have disabled websockets. 在 RStudio Server 开源版中,你可以从 UI 执行此操作。In RStudio Server Open Source, you can do this from the UI.

RStudio 会话RStudio Session

在 RStudio Server 专业版中,你可以将 allow-terminal-websockets=0 添加到 /etc/rstudio/rsession.conf,以便对所有用户禁用 Websocket。In RStudio Server Pro, you can add allow-terminal-websockets=0 to /etc/rstudio/rsession.conf to disable websockets for all users.

我在群集详细信息下看不到“应用”选项卡。I don’t see the Apps tab under cluster details.

此功能并非可供所有客户使用。This feature is not available to all customers. 你必须已参加 Azure Databricks 高级计划You must be on the Azure Databricks Premium Plan.