renv
on Azure Databricks
renv is an R package that lets users manage R dependencies specific to the notebook.
Using renv
, you can create and manage the R library environment for your project, save the state of these libraries to a lockfile
, and later restore libraries as required. Together, these tools can help make projects more isolated, portable, and reproducible.
Basic renv
workflow
In this section:
- Install
renv
- Initialize
renv
session with pre-installed R libraries - Use
renv
to install additional packages - Use
renv
to save your R notebook environment to DBFS - Reinstall a
renv
environment given alockfile
from DBFS
Install renv
You can install renv
as a cluster-scoped library or as a notebook-scoped library. To install renv
as a notebook-scoped library, use:
require(devtools)
install_version(
package = "renv",
repos = "http://cran.us.r-project.org"
)
Databricks recommends using a CRAN snapshot as the repository to fix the package version.
Initialize renv
session with pre-installed R libraries
The first step when using renv
is to initialize a session using renv::init()
. Set libPaths
to change the default download location to be your R notebook-scoped library path.
renv::init(settings = list(external.libraries=.libPaths()))
.libPaths(c(.libPaths()[2], .libPaths())
Use renv
to install additional packages
You can now use renv
's API to install and remove R packages. For example, to install the latest version of digest
, run the following inside of a notebook cell.
renv::install("digest")
To install an old version of digest
, run the following inside of a notebook cell.
renv::install("digest@0.6.18")
To install digest
from GitHub, run the following inside of a notebook cell.
renv::install("eddelbuettel/digest")
To install a package from Bioconductor, run the following inside of a notebook cell.
# (note: requires the BiocManager package)
renv::install("bioc::Biobase")
Note that the renv::install
API uses the renv Cache.
Use renv
to save your R notebook environment to DBFS
Run the following command once before saving the environment.
renv::settings$snapshot.type("all")
This sets renv
to snapshot all packages that are installed into libPaths
, not just the ones that are currently used in the notebook. See renv documentation for more information.
Now you can run the following inside of a notebook cell to save the current state of your environment.
renv::snapshot(lockfile="/dbfs/PATH/TO/WHERE/YOU/WANT/TO/SAVE/renv.lock", force=TRUE)
This updates the lockfile
by capturing all packages installed on libPaths
. It also moves your lockfile
from the local filesystem to DBFS, where it persists even if your cluster terminates or restarts.
Reinstall a renv
environment given a lockfile
from DBFS
First, make sure that your new cluster is running an identical Databricks Runtime version as the one you first created the renv
environment on. This ensures that the pre-installed R packages are identical. You can find a list of these in each runtime's release notes. After you Install renv, run the following inside of a notebook cell.
renv::init(settings = list(external.libraries=.libPaths()))
.libPaths(c(.libPaths()[2], .libPaths()))
renv::restore(lockfile="/dbfs/PATH/TO/WHERE/YOU/SAVED/renv.lock", exclude=c("Rserve", "SparkR"))
This copies your lockfile
from DBFS into the local file system and then restores any packages specified in the lockfile
.
Note
To avoid missing repository errors, exclude the Rserve
and SparkR
packages from package restoration. Both of these packages are pre-installed in all runtimes.
renv
Cache
A very useful feature of renv
is its global package cache, which is shared across all renv
projects on the cluster. It speeds up installation times and saves disk space. The renv
cache does not cache packages downloaded via the devtools
API or install.packages()
with any additional arguments other than pkgs
.