Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Notebook-scoped R libraries enable you to create and modify custom R environments that are specific to a notebook session. When you install an R notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected.
Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.
Notebook-scoped libraries libraries are automatically available on workers for SparkR UDFs.
To install libraries for all notebooks attached to a cluster, use cluster-installed libraries. See Cluster libraries.
You can use any familiar method of installing packages in R, such as install.packages(), the devtools APIs, or Bioconductor.
R packages are accessible to worker nodes as well as the driver node.
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
Databricks recommends using a CRAN snapshot as the repository to guarantee reproducible results.
devtools::install_github("klutometis/roxygen")
To remove a notebook-scoped library from a notebook, use the remove.packages()
command.
remove.packages("caesar")
- Notebook-scoped R libraries and SparkR
- Notebook-scoped R libraries and sparklyr
- Library isolation and hosted RStudio
Notebook-scoped libraries are available on SparkR workers; just import a library to use it. For example, you can run the following to generate a caesar-encrypted message with a SparkR UDF:
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
library(SparkR)
sparkR.session()
hello <- function(x) {
library(caesar)
caesar("hello world")
}
spark.lapply(c(1, 2), hello)
By default, in sparklyr::spark_apply()
, the packages
argument is set to TRUE
. This copies libraries in the current libPaths
to the workers, allowing you to import and use them on workers. For example, you can run the following to generate a caesar-encrypted message with sparklyr::spark_apply()
:
require(devtools)
install_version(
package = "caesar",
repos = "http://cran.us.r-project.org"
)
library(sparklyr)
sc <- spark_connect(method = 'databricks')
apply_caes <- function(x) {
library(caesar)
caesar("hello world")
}
sdf_len(sc, 5) %>%
spark_apply(apply_caes)
If you do not want libraries to be available on workers, set packages
to FALSE
.
RStudio creates a separate library path for each user; therefore users are isolated from each other. However, the library path is not available on workers. If you want to use a package inside SparkR workers in a job launched from RStudio, you need to install it using cluster libraries.
Alternatively, if you use sparklyr UDFs, packages installed inside RStudio are available to workers when using spark_apply(..., packages = TRUE)
.
Explicitly set the installation directory to /databricks/spark/R/lib
. For example, with install.packages()
, run install.packages("pckg", lib="/databricks/spark/R/lib")
.
Packages installed in /databricks/spark/R/lib
are shared across all notebooks on the cluster, but they are not accessible to SparkR workers. To share libraries across notebooks and also workers, use cluster libraries.
There is no caching implemented for notebook-scoped libraries on a cluster. If you install a package in a notebook, and another user installs the same package in another notebook on the same cluster, the package is downloaded, compiled, and installed again.