Databricks 实用工具Databricks Utilities

使用 Databricks 实用工具 (DBUtils) 可以轻松地执行各种强大的任务组合。Databricks Utilities (DBUtils) make it easy to perform powerful combinations of tasks. 可以使用这些实用工具来有效地处理对象存储,对笔记本进行链接和参数化,以及处理机密。You can use the utilities to work with object storage efficiently, to chain and parameterize notebooks, and to work with secrets. 对 DBUtils 的支持仅限于笔记本内。DBUtils are not supported outside of notebooks.

所有 dbutils 实用工具均在 Python、R 和 Scala 笔记本中提供。All dbutils utilities are available in Python, R, and Scala notebooks. 文件系统实用工具在 R 笔记本中不可用;但是,可以使用语言 magic 命令在 R 和 SQL 笔记本中调用这些 dbutils 方法。File system utilities are not available in R notebooks; however, you can use a language magic command to invoke those dbutils methods in R and SQL notebooks. 例如,若要列出 R 或 SQL 笔记本中的 Azure Databricks 数据集 DBFS 文件夹,请运行以下命令:For example, to list the Azure Databricks datasets DBFS folder in an R or SQL notebook, run the command:

dbutils.fs.ls("/databricks-datasets")

也可使用 %fsAlternatively, you can use %fs:

%fs ls /databricks-datasets

文件系统实用工具 File system utilities

文件系统实用工具会访问 Databricks 文件系统 (DBFS),让你可以更轻松地将 Azure Databricks 用作文件系统。The file system utilities access Databricks File System (DBFS), making it easier to use Azure Databricks as a file system. 有关详细信息,请运行:Learn more by running:

dbutils.fs.help()
cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory

mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs: Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point

dbutils.fs.ls 命令dbutils.fs.ls Command

ls 命令返回的序列包含以下属性:The sequence returned by the ls command contains the following attributes:

AttributeAttribute 类型Type 描述Description
pathpath stringstring 文件或目录的路径。The path of the file or directory.
namename 字符串string 文件或目录的名称。The name of the file or directory.
isDir()isDir() booleanboolean 如果路径是目录,则此项为 true。True if the path is a directory.
大小size long/int64long/int64 文件的长度(以字节为单位);如果路径是目录,则此项为零。The length of the file in bytes or zero if the path is a directory.

备注

可以使用 help 获取有关每个命令的详细信息,例如 dbutils.fs.help("ls")You can get detailed information about each command by using help, for example: dbutils.fs.help("ls")

笔记本工作流实用工具 Notebook workflow utilities

可以通过笔记本工作流将笔记本链接在一起,并对其结果进行操作。Notebook workflows allow you to chain together notebooks and act on their results. 请参阅笔记本工作流See Notebook workflows. 有关详细信息,请运行:Learn more by running:

dbutils.notebook.help()
exit(value: String): void -> This method lets you exit a notebook with a value
run(path: String, timeoutSeconds: int, arguments: Map): String -> This method runs a notebook and returns its exit value.

备注

run 返回的字符串值的最大长度为 5 MB。The maximum length of the string value returned from run is 5 MB. 请参阅运行获取输出See Runs get output.

备注

可以使用 help 获取有关每个命令的详细信息,例如 dbutils.notebook.help("exit")You can get detailed information about each command by using help, for example: dbutils.notebook.help("exit")

小组件实用工具 Widget utilities

可以通过小组件将笔记本参数化。Widgets allow you to parameterize notebooks. 请参阅小组件See Widgets. 有关详细信息,请运行:Learn more by running:

dbutils.widgets.help()
combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input widget with a given name, default value and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input widget a with given name, default value and choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect input widget with a given name, default value and choices
remove(name: String): void -> Removes an input widget from the notebook
removeAll: void -> Removes all widgets in the notebook
text(name: String, defaultValue: String, label: String): void -> Creates a text input widget with a given name and default value

备注

可以使用 help 获取有关每个命令的详细信息,例如 dbutils.widgets.help("combobox")You can get detailed information about each command by using help, for example: dbutils.widgets.help("combobox")

机密实用工具 Secrets utilities

可以利用机密存储和访问敏感的凭据信息,而无需使其在笔记本中可见。Secrets allow you to store and access sensitive credential information without making them visible in notebooks. 请参阅机密管理在笔记本中使用机密See Secret management and Use the secrets in a notebook. 有关详细信息,请运行:Learn more by running:

dbutils.secrets.help()
get(scope: String, key: String): String -> Gets the string representation of a secret value with scope and key
getBytes(scope: String, key: String): byte[] -> Gets the bytes representation of a secret value with scope and key
list(scope: String): Seq -> Lists secret metadata for secrets within a scope
listScopes: Seq -> Lists secret scopes

备注

可以使用 help 获取有关每个命令的详细信息,例如 dbutils.secrets.help("get")You can get detailed information about each command by using help, for example: dbutils.secrets.help("get")

库实用工具 Library utilities

可以通过库实用工具安装 Python 库并创建作用域为笔记本会话的环境。Library utilities allow you to install Python libraries and create an environment scoped to a notebook session. 这些库在驱动程序和执行程序上均可用,因此你可以在 UDF 中引用它们。The libraries are available both on the driver and on the executors, so you can reference them in UDFs. 这样做可以实现以下操作:This enables:

  • 在笔记本自身中组织笔记本的库依赖项。Library dependencies of a notebook to be organized within the notebook itself.
  • 具有不同库依赖项的笔记本用户共享群集而不受干扰。Notebook users with different library dependencies to share a cluster without interference.

分离笔记本会破坏此环境。Detaching a notebook destroys this environment. 但是,你可以重新创建它,只需在笔记本中重新运行库 install API 命令即可。However, you can recreate it by re-running the library install API commands in the notebook. 请参阅 restartPython API,了解如何才能重置笔记本状态而不失去环境。See the restartPython API for how you can reset your notebook state without losing your environment.

备注

库实用工具在 Databricks Runtime ML 或用于基因组学的 Databricks Runtime 上不可用。Library utilities are not available on Databricks Runtime ML or Databricks Runtime for Genomics. 请改为参阅作用域为笔记本的 Python 库Instead, see Notebook-scoped Python libraries.

对于 Databricks Runtime 7.1 或更高版本,还可以使用 %pip magic 命令安装作用域为笔记本的库。For Databricks Runtime 7.1 or above, you can also use %pip magic commands to install notebook-scoped libraries. 请参阅作用域为笔记本的 Python 库See Notebook-scoped Python libraries.

默认情况下,库实用工具处于启用状态。Library utilities are enabled by default. 因此,默认情况下会通过使用单独的 Python 可执行文件隔离每个笔记本的 Python 环境,该可执行文件在笔记本附加到群集并继承群集上的默认 Python 环境时创建。Therefore, by default the Python environment for each notebook is isolated by using a separate Python executable that is created when the notebook is attached to and inherits the default Python environment on the cluster. 通过 init script 安装到 Azure Databricks Python 环境中的库仍可用。Libraries installed through an init script into the Azure Databricks Python environment are still available. 可以通过将 spark.databricks.libraryIsolation.enabled 设置为 false 来禁用此功能。You can disable this feature by setting spark.databricks.libraryIsolation.enabled to false.

此 API 与通过 UIREST API 进行的群集范围的现有库安装兼容。This API is compatible with the existing cluster-wide library installation through the UI and REST API. 通过此 API 安装的库的优先级高于群集范围的库。Libraries installed through this API have higher priority than cluster-wide libraries.

dbutils.library.help()
install(path: String): boolean -> Install the library within the current notebook session
installPyPI(pypiPackage: String, version: String = "", repo: String = "", extras: String = ""): boolean -> Install the PyPI library within the current notebook session
list: List -> List the isolated libraries added for the current notebook session via dbutils
restartPython: void -> Restart python process for the current notebook session

备注

可以使用 help 获取有关每个命令的详细信息,例如 dbutils.library.help("install")You can get detailed information about each command by using help, for example: dbutils.library.help("install")

示例Examples

  • 在笔记本中安装 PyPI 库。Install a PyPI library in a notebook. versionrepoextras 是可选的。version, repo, and extras are optional. 使用 extras 参数可指定额外功能(额外要求)。Use the extras argument to specify the Extras feature (extra requirements).

    dbutils.library.installPyPI("pypipackage", version="version", repo="repo", extras="extras")
    dbutils.library.restartPython()  # Removes Python state, but some libraries might not work without calling this function
    

    重要

    versionextras 键不能是 PyPI 包字符串的一部分。The version and extras keys cannot be part of the PyPI package string. 例如,dbutils.library.installPyPI("azureml-sdk[databricks]==1.0.8") 无效。For example: dbutils.library.installPyPI("azureml-sdk[databricks]==1.0.8") is not valid. 请使用 versionextras 参数指定版本和额外信息,如下所示:Use the version and extras arguments to specify the version and extras information as follows:

    dbutils.library.installPyPI("azureml-sdk", version="1.0.8", extras="databricks")
    dbutils.library.restartPython()  # Removes Python state, but some libraries might not work without calling this function
    
  • 在一个笔记本中指定库要求,然后通过 %run 在另一个笔记本中安装它们。Specify your library requirements in one notebook and install them through %run in the other.

    • 定义要在名为 InstallDependencies 的笔记本中安装的库。Define the libraries to install in a notebook called InstallDependencies.

      dbutils.library.installPyPI("torch")
      dbutils.library.installPyPI("scikit-learn", version="1.19.1")
      dbutils.library.installPyPI("azureml-sdk", extras="databricks")
      dbutils.library.restartPython()  # Removes Python state, but some libraries might not work without calling this function
      
    • 将它们安装在需要这些依赖项的笔记本中。Install them in the notebook that needs those dependencies.

      %run /path/to/InstallDependencies    # Install the dependencies in first cell
      
      import torch
      from sklearn.linear_model import LinearRegression
      import azureml
      # do the actual work
      
  • 列出笔记本中已安装的库。List the libraries installed in a notebook.

    dbutils.library.list()
    
  • 在维护环境时重置 Python 笔记本状态。Reset the Python notebook state while maintaining the environment. 此 API 仅在 Python 笔记本中可用。This API is available only in Python notebooks. 这可用于:This can be used to:

    • 重新加载 Azure Databricks 预安装的其他版本的库。Reload libraries Azure Databricks preinstalled with a different version. 例如:For example:

      dbutils.library.installPyPI("numpy", version="1.15.4")
      dbutils.library.restartPython()
      
      # Make sure you start using the library in another cell.
      import numpy
      
    • 安装需要在进程启动时加载的库,例如 tensorflow。Install libraries like tensorflow that need to be loaded on process start up. 例如:For example:

      dbutils.library.installPyPI("tensorflow")
      dbutils.library.restartPython()
      
      # Use the library in another cell.
      import tensorflow
      
  • 在笔记本中安装 .egg.whl 库。Install a .egg or .whl library in a notebook.

    重要

    建议将所有库安装命令放入笔记本的第一个单元,在该单元的末尾调用 restartPythonWe recommend that you put all your library install commands in the first cell of your notebook and call restartPython at the end of that cell. 在运行 restartPython 后会重置 Python 笔记本状态;笔记本会丢失所有状态,包括但不限于本地变量、导入的库和其他临时状态。The Python notebook state is reset after running restartPython; the notebook loses all state including but not limited to local variables, imported libraries, and other ephemeral states. 因此,建议在第一个笔记本单元中安装库并重置笔记本状态。Therefore, we recommended that you install libraries and reset the notebook state in the first notebook cell.

    接受的库源是 dbfsabfssadlwasbsThe accepted library sources are dbfs, abfss, adl, and wasbs.

    dbutils.library.install("abfss:/path/to/your/library.egg")
    dbutils.library.restartPython()  # Removes Python state, but some libraries might not work without calling this function
    
    dbutils.library.install("abfss:/path/to/your/library.whl")
    dbutils.library.restartPython()  # Removes Python state, but some libraries might not work without calling this function
    

DBUtils 笔记本DBUtils notebook

获取笔记本Get notebook

Databricks 实用工具 API 库 Databricks Utilities API library

备注

如果要部署到运行 Databricks Runtime 7.0 或更高版本的群集,则无法使用此库来运行部署前测试,因为没有支持 Scala 2.12 的库版本。You cannot use this library to run pre-deployment tests if you are deploying to a cluster running Databricks Runtime 7.0 or above, because there is no version of the library that supports Scala 2.12.

为了加快应用程序开发,可以在将应用程序部署为生产作业之前对其进行编译、构建和测试。To accelerate application development, it can be helpful to compile, build, and test applications before you deploy them as production jobs. 为了让你能够针对 Databricks 实用工具进行编译,Databricks 提供了 dbutils-api 库。To enable you to compile against Databricks Utilities, Databricks provides the dbutils-api library. 可以下载 dbutils-api 库,也可以通过将依赖项添加到生成文件来包含库:You can download the dbutils-api library or include the library by adding a dependency to your build file:

  • SBTSBT

    libraryDependencies += "com.databricks" % "dbutils-api_2.11" % "0.0.4"
    
  • MavenMaven

    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>dbutils-api_2.11</artifactId>
        <version>0.0.4</version>
    </dependency>
    
  • GradleGradle

    compile 'com.databricks:dbutils-api_2.11:0.0.4'
    

针对此库生成应用程序后,即可部署该应用程序。Once you build your application against this library, you can deploy the application.

重要

dbutils-api 库允许你在本地编译一个使用 dbutils 的应用程序,但不允许你运行它。The dbutils-api library allows you to locally compile an application that uses dbutils, but not to run it. 若要运行该应用程序,必须将其部署到 Azure Databricks 中。To run the application, you must deploy it in Azure Databricks.

使用 %% 表示生成工件的 Scala 版本Express the artifact’s Scala version with %%

如果将生成工件版本表示为 groupID %% artifactID % revision 而不是 groupID % artifactID % revision(区别在于 groupID 后面的双 %%),则 SBT 会将项目的 Scala 版本添加到生成工件名称中。If you express the artifact version as groupID %% artifactID % revision instead of groupID % artifactID % revision (the difference is the double %% after the groupID), SBT will add your project’s Scala version to the artifact name.

示例Example

假设你的生成的 scalaVersion2.9.1Suppose the scalaVersion for your build is 2.9.1. 可以使用 % 编写生成工件版本,如下所示:You could write the artifact version with % as follows:

val appDependencies = Seq(
  "org.scala-tools" % "scala-stm_2.9.1" % "0.3"
)

下面这个使用 %% 的语句是等效的:The following using %% is identical:

val appDependencies = Seq(
  "org.scala-tools" %% "scala-stm" % "0.3"
)

示例项目Example projects

下面是一个示例存档,其中包含最小的示例项目,用于演示如何使用适用于 3 个常用生成工具的 dbutils-api 库进行编译:Here is an example archive containing minimal example projects that show you how to compile using the dbutils-api library for 3 common build tools:

  • sbt:sbt packagesbt: sbt package
  • Maven:mvn installMaven: mvn install
  • Gradle:gradle buildGradle: gradle build

这些命令在以下位置创建输出 JAR:These commands create output JARs in the locations:

  • sbt:target/scala-2.11/dbutils-api-example_2.11-0.0.1-SNAPSHOT.jarsbt: target/scala-2.11/dbutils-api-example_2.11-0.0.1-SNAPSHOT.jar
  • Maven:target/dbutils-api-example-0.0.1-SNAPSHOT.jarMaven: target/dbutils-api-example-0.0.1-SNAPSHOT.jar
  • Gradle:build/examples/dbutils-api-example-0.0.1-SNAPSHOT.jarGradle: build/examples/dbutils-api-example-0.0.1-SNAPSHOT.jar

可以将此 JAR 作为库附加到群集,重启群集,然后运行以下语句:You can attach this JAR to your cluster as a library, restart the cluster, and then run:

example.Test()

此语句创建一个文本输入小组件,其标签为 Hello:,初始值为 World。This statement creates a text input widget with the label Hello: and the initial value World.

可以通过相同的方式使用所有其他 dbutils API。You can use all the other dbutils APIs the same way.

若要测试某个在 Databricks 外部使用 dbutils 对象的应用程序,可以通过调用以下语句来模拟 dbutils 对象:To test an application that uses the dbutils object outside Databricks, you can mock up the dbutils object by calling:

com.databricks.dbutils_v1.DBUtilsHolder.dbutils0.set(
  new com.databricks.dbutils_v1.DBUtilsV1{
    ...
  }
)

请使用你自己的 DBUtilsV1 实例进行替换。在该实例中,你可以随意实现接口方法,例如,为 dbutils.fs 提供本地文件系统原型。Substitute your own DBUtilsV1 instance in which you implement the interface methods however you like, for example providing a local filesystem mockup for dbutils.fs.