Azure Databricks 上的 ShinyShiny on Azure Databricks

若构建交互式 R 应用程序和仪表板,可以使用 Shiny,这是 CRAN 上提供的一个开源 R 包,位于 Azure Databricks 群集上托管的 RStudio Server 中。To build interactive R applications and dashboards you can use Shiny, an open-source R package available on CRAN, in RStudio Server hosted on Azure Databricks clusters.

有关 Shiny 用户指南中的许多交互式示例,请参阅 Shiny 教程For many interactive examples from the Shiny user guide, see the Shiny tutorials.

本文介绍如何在 Azure Databricks 上运行 Shiny 应用程序以及如何在 Shiny 应用程序中使用 Apache Spark。This article describes how to run Shiny applications on Azure Databricks and use Apache Spark inside Shiny applications.

要求Requirements

开始使用 ShinyGet Started with Shiny

  1. 在 Azure Databricks 上打开 RStudio。Open RStudio on Azure Databricks.

  2. 在 RStudio 中,导入 Shiny 包并按以下方式运行示例应用 01_helloIn RStudio, import the Shiny package and run the example app 01_hello as follows:

    > library(shiny)
    > runExample("01_hello")
    
    Listening on http://127.0.0.1:3203
    

    将会弹出一个新窗口,显示 Shiny 应用程序。A new window will pop up, displaying the Shiny application.

    第一个 Shiny 应用First Shiny app

通过 R 脚本运行 Shiny 应用Run a Shiny app from an R script

若要通过 R 脚本运行 Shiny 应用,请在 RStudio 编辑器中打开 R 脚本,然后点击右上角的“运行应用”按钮。To run a Shiny app from an R script, open the R script in the RStudio editor and click the Run App button on the top right.

Shiny 运行应用Shiny run App

在 Shiny 应用内部使用 Apache SparkUse Apache Spark inside Shiny apps

在 Azure Databricks 上开发 Shiny 应用程序时,可以使用 Apache Spark。You can use Apache Spark when developing Shiny applications on Azure Databricks. 你可以使用 SparkR 和 sparklyr 与 Spark 进行交互。You can interact with Spark using both SparkR and sparklyr. 你至少需要一名工作人员才能启动 Spark 任务。You need at least one worker to launch Spark tasks.

以下示例使用 SparkR 启动 Spark 作业。The following example uses SparkR to launch Spark jobs. 该示例使用 ggplot2 钻石数据集来按克拉绘制钻石价格。The example uses the ggplot2 diamonds dataset to plot the price of diamonds by carat. 可以使用应用程序顶部的滑块更改克拉范围,并且绘图的 x 轴范围也会相应更改。The carat range can be changed using the slider at the top of the application, and the range of the plot’s x-axis would change accordingly.

library(SparkR)
library(sparklyr)
library(dplyr)
library(ggplot2)
sparkR.session()

sc <- spark_connect(method = "databricks")
diamonds_tbl <- spark_read_csv(sc, path = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")

# Define the UI
ui <- fluidPage(
  sliderInput("carat", "Select Carat Range:",
              min = 0, max = 5, value = c(0, 5), step = 0.01),
  plotOutput('plot')
)

# Define the server code
server <- function(input, output) {
  output$plot <- renderPlot({
    # Select diamonds in carat range
    df <- diamonds_tbl %>%
      dplyr::select("carat", "price") %>%
      dplyr::filter(carat >= !!input$carat[[1]], carat <= !!input$carat[[2]])

    # Scatter plot with smoothed means
    ggplot(df, aes(carat, price)) +
      geom_point(alpha = 1/2) +
      geom_smooth() +
      scale_size_area(max_size = 2) +
      ggtitle("Price vs. Carat")
  })
}

# Return a Shiny app object

shinyApp(ui = ui, server = server)

Spark Shiny 应用Spark Shiny app

常见问题 (FAQ)Frequently asked questions (FAQ)

可以在 Databricks Runtime 6.1 及更低版本上使用 Shiny 吗?Can I use Shiny on Databricks Runtime 6.1 and below?

是的。Yes. 若要在 Databricks Runtime 6.1 及更低版本上使用 Shiny,请将 Shiny 包作为 Azure Databricks 安装在群集上。To use Shiny on Databricks Runtime 6.1 and below, install the Shiny package as an Azure Databricks library on the cluster. 在 RStudio 控制台中使用 install.packages(‘shiny’) 或使用 RStudio 包管理器可能无法工作。Using install.packages(‘shiny’) in the RStudio console or using the RStudio package manager may not work.

为什么我的 Shiny 应用会在一段时间后“灰显”?Why is my Shiny app “greyed out” after some time?

如果没有与 Shiny 应用交互,则大约 4 分钟后将关闭与该应用的连接。If there is no interaction with the Shiny app, the connection to the app will close after about 4 minutes.

要重新连接,请刷新 Shiny 应用页面。To reconnect, refresh the Shiny app page. 仪表板的状态将重置。The state of the dashboard will be reset.

为什么我的 Shiny 查看器窗口会在一段时间后消失?Why does my Shiny viewer window disappear after a while?

如果 Shiny 查看器窗口在空闲几分钟后消失,这是由于与“灰显”场景相同的超时造成的。If the Shiny viewer window disappears after idling for several minutes, it is due to the same timeout as the “grey out” scenario.

我的应用启动后立即崩溃,但代码似乎是正确的。这是怎么回事?My app crashes immediately after launching, but the code appears to be correct. What’s going on?

在 Azure Databricks 上的 Shiny 应用中可以显示的数据总量有 20 MB 的限制。There is a 20 MB limit on the total amount of data that can be displayed in a Shiny app on Azure Databricks. 如果应用程序的总数据大小超过此限制,则启动后将立即崩溃。If the application’s total data size exceeds this limit, it will crash immediately after launching. 为了避免这种情况,Databricks 建议减小数据大小,例如通过降低显示数据的采样率或降低图像的分辨率。To avoid this, Databricks recommends reducing the data size, for example by down sampling the displayed data or reducing the resolution of images.

为何长 Spark 作业从不会返回?Why do long Spark jobs never return?

这也是由于空闲超时。This is also because of the idle timeout. 运行时间超过前面提到的超时时间的任何 Spark 作业将无法呈现其结果,因为连接将在作业返回之前关闭。Any Spark job running for longer than the previously mentioned timeouts would not be able to render its result because the connection will close before the job returns.

如何避免超时?How can I avoid the timeout?

  • 此问题线程中建议了一种解决方法。There is a workaround suggested in this issue thread. 该解决方法在应用空闲时发送检测信号以使 WebSocket 保持活动状态。The workaround sends heartbeats to keep the websocket alive when the app is idle. 但是,如果该应用被长时间运行的计算所阻止,则此解决方法将不起作用。However, if the app is blocked by a long running computation, this workaround does not work.

  • Shiny 不支持长时间运行的任务。Shiny does not support long running tasks. Shiny 的博客文章建议使用 promise 和 future 来异步运行长任务,并保持不阻止应用。A Shiny blog post recommends using promises and futures to run long tasks asynchronously and keep the app unblocked. 这是一个使用检测信号使 Shiny 应用保持活动状态并在 future 构造中运行长时间运行的 Spark 作业的示例。Here is an example that uses heartbeats to keep the Shiny app alive, and runs a long running Spark job in a future construct.

    # Write an app that uses spark to access data on Databricks
    # First, install the following packages:
    install.packages(‘future’)
    install.packages(‘promises’)
    
    library(shiny)
    library(promises)
    library(future)
    plan(multisession)
    
    HEARTBEAT_INTERVAL_MILLIS = 1000  # 1 second
    
    # Define the long Spark job here
    run_spark <- function(x) {
      # Environment setting
      library("SparkR", lib.loc = "/databricks/spark/R/lib")
      sparkR.session()
    
      irisDF <- createDataFrame(iris)
      collect(irisDF)
      Sys.sleep(3)
      x + 1
    }
    
    run_spark_sparklyr <- function(x) {
      # Environment setting
      library(sparklyr)
      library(dplyr)
      library("SparkR", lib.loc = "/databricks/spark/R/lib")
      sparkR.session()
      sc <- spark_connect(method = "databricks")
    
      iris_tbl <- copy_to(sc, iris, overwrite = TRUE)
      collect(iris_tbl)
      x + 1
    }
    
    ui <- fluidPage(
      sidebarLayout(
        # Display heartbeat
        sidebarPanel(textOutput("keep_alive")),
    
        # Display the Input and Output of the Spark job
        mainPanel(
          numericInput('num', label = 'Input', value = 1),
          actionButton('submit', 'Submit'),
          textOutput('value')
        )
      )
    )
    server <- function(input, output) {
      #### Heartbeat ####
      # Define reactive variable
      cnt <- reactiveVal(0)
      # Define time dependent trigger
      autoInvalidate <- reactiveTimer(HEARTBEAT_INTERVAL_MILLIS)
      # Time dependent change of variable
      observeEvent(autoInvalidate(), {  cnt(cnt() + 1)  })
      # Render print
      output$keep_alive <- renderPrint(cnt())
    
      #### Spark job ####
      result <- reactiveVal() # the result of the spark job
      busy <- reactiveVal(0)  # whether the spark job is running
      # Launch a spark job in a future when actionButton is clicked
      observeEvent(input$submit, {
        if (busy() != 0) {
          showNotification("Already running Spark job...")
          return(NULL)
        }
        showNotification("Launching a new Spark job...")
        # input$num must be read outside the future
        input_x <- input$num
        fut <- future({ run_spark(input_x) }) %...>% result()
        # Or: fut <- future({ run_spark_sparklyr(input_x) }) %...>% result()
        busy(1)
        # Catch exceptions and notify the user
        fut <- catch(fut, function(e) {
          result(NULL)
          cat(e$message)
          showNotification(e$message)
        })
        fut <- finally(fut, function() { busy(0) })
        # Return something other than the promise so shiny remains responsive
        NULL
      })
      # When the spark job returns, render the value
      output$value <- renderPrint(result())
    }
    shinyApp(ui = ui, server = server)
    

在开发过程中,一个 Shiny 应用链接可以接受多少个连接?How many connections can be accepted for one Shiny app link during development?

Databricks 建议最多 20 个。Databricks recommends up to 20.

可以使用与 Databricks Runtime 中安装的版本不同的 Shiny 包吗?Can I use a different version of the Shiny package than the one installed in Databricks Runtime?

是的。Yes. 请查阅修正 R 包的版本See Fix the Version of R Packages.

如何开发可以发布到 Shiny 服务器的 Shiny 应用并访问 Azure Databricks 上的数据?How can I develop a Shiny application that can be published to a Shiny server and access data on Azure Databricks?

虽然你可以在开发和测试 Azure Databricks期间使用 SparkR 或 sparklyr 自然地访问数据,但是在一个 Shiny 应用程序发布到独立托管服务之后,它不能直接访问 Azure Databricks 上的数据和表。While you can access data naturally using SparkR or sparklyr during development and testing on Azure Databricks, after a Shiny application is published to a stand-alone hosting service, it cannot directly access the data and tables on Azure Databricks.

若要使你的应用程序在 Azure Databricks 外部运行,必须重写访问数据的方式。To enable your application to function outside Azure Databricks, you must rewrite how you access data. 下面是几个选项:There are a few options:

Databricks 建议你与 Azure Databricks 解决方案团队合作,为现有数据和分析体系结构找到最佳方法。Databricks recommends that you work with your Azure Databricks solutions team to find the best approach for your existing data and analytics architecture.

如何保存在 Azure Databricks 上开发的 Shiny 应用程序?How can I save the Shiny applications that I develop on Azure Databricks?

你可以通过 FUSE 挂载将应用程序代码保存在 DBF 上,也可以将代码签入版本控制中。You can either save your application code on DBFS through the FUSE mount or check your code into version control.

可以在 Azure Databricks 笔记本中开发 Shiny 应用程序吗?Can I develop a Shiny application inside an Azure Databricks notebook?

无法在 Azure Databricks 笔记本中开发 Shiny 应用程序。You cannot develop a Shiny application inside an Azure Databricks notebook.