Databricks 实用工具Databricks Utilities
使用 Databricks 实用工具 (DBUtils) 可以轻松地执行各种强大的任务组合。Databricks Utilities (DBUtils) make it easy to perform powerful combinations of tasks. 可以使用这些实用工具来有效地处理对象存储,对笔记本进行链接和参数化,以及处理机密。You can use the utilities to work with object storage efficiently, to chain and parameterize notebooks, and to work with secrets. 对 DBUtils 的支持仅限于笔记本内。DBUtils are not supported outside of notebooks.
所有 dbutils
实用工具均在 Python、R 和 Scala 笔记本中提供。All dbutils
utilities are available in Python, R, and Scala notebooks. 文件系统实用工具在 R 笔记本中不可用;但是,可以使用语言 magic 命令在 R 和 SQL 笔记本中调用这些 dbutils
方法。File system utilities are not available in R notebooks; however, you can use a language magic command to invoke those dbutils
methods in R and SQL notebooks. 例如,若要列出 R 或 SQL 笔记本中的 Azure Databricks 数据集 DBFS 文件夹,请运行以下命令:For example, to list the Azure Databricks datasets DBFS folder in an R or SQL notebook, run the command:
dbutils.fs.ls("/databricks-datasets")
也可使用 %fs
:Alternatively, you can use %fs
:
%fs ls /databricks-datasets
文件系统实用工具 File system utilities
文件系统实用工具会访问 Databricks 文件系统 (DBFS),让你可以更轻松地将 Azure Databricks 用作文件系统。The file system utilities access Databricks File System (DBFS), making it easier to use Azure Databricks as a file system. 有关详细信息,请运行:Learn more by running:
dbutils.fs.help()
cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory
mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs: Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point
dbutils.fs.ls 命令dbutils.fs.ls Command
ls
命令返回的序列包含以下属性:The sequence returned by the ls
command contains the following attributes:
AttributeAttribute | 类型Type | 描述Description |
---|---|---|
pathpath | stringstring | 文件或目录的路径。The path of the file or directory. |
namename | 字符串string | 文件或目录的名称。The name of the file or directory. |
isDir()isDir() | booleanboolean | 如果路径是目录,则此项为 true。True if the path is a directory. |
大小size | long/int64long/int64 | 文件的长度(以字节为单位);如果路径是目录,则此项为零。The length of the file in bytes or zero if the path is a directory. |
备注
可以使用 help
获取有关每个命令的详细信息,例如 dbutils.fs.help("ls")
You can get detailed information about each command by using help
, for example: dbutils.fs.help("ls")
笔记本工作流实用工具 Notebook workflow utilities
可以通过笔记本工作流将笔记本链接在一起,并对其结果进行操作。Notebook workflows allow you to chain together notebooks and act on their results. 请参阅笔记本工作流。See Notebook workflows. 有关详细信息,请运行:Learn more by running:
dbutils.notebook.help()
exit(value: String): void -> This method lets you exit a notebook with a value
run(path: String, timeoutSeconds: int, arguments: Map): String -> This method runs a notebook and returns its exit value.
备注
从 run
返回的字符串值的最大长度为 5 MB。The maximum length of the string value returned from run
is 5 MB. 请参阅运行获取输出。See Runs get output.
备注
可以使用 help
获取有关每个命令的详细信息,例如 dbutils.notebook.help("exit")
You can get detailed information about each command by using help
, for example: dbutils.notebook.help("exit")
小组件实用工具 Widget utilities
可以通过小组件将笔记本参数化。Widgets allow you to parameterize notebooks. 请参阅小组件。See Widgets. 有关详细信息,请运行:Learn more by running:
dbutils.widgets.help()
combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input widget with a given name, default value and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input widget a with given name, default value and choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect input widget with a given name, default value and choices
remove(name: String): void -> Removes an input widget from the notebook
removeAll: void -> Removes all widgets in the notebook
text(name: String, defaultValue: String, label: String): void -> Creates a text input widget with a given name and default value
备注
可以使用 help
获取有关每个命令的详细信息,例如 dbutils.widgets.help("combobox")
You can get detailed information about each command by using help
, for example: dbutils.widgets.help("combobox")
机密实用工具 Secrets utilities
可以利用机密存储和访问敏感的凭据信息,而无需使其在笔记本中可见。Secrets allow you to store and access sensitive credential information without making them visible in notebooks. 请参阅机密管理和在笔记本中使用机密。See Secret management and Use the secrets in a notebook. 有关详细信息,请运行:Learn more by running:
dbutils.secrets.help()
get(scope: String, key: String): String -> Gets the string representation of a secret value with scope and key
getBytes(scope: String, key: String): byte[] -> Gets the bytes representation of a secret value with scope and key
list(scope: String): Seq -> Lists secret metadata for secrets within a scope
listScopes: Seq -> Lists secret scopes
备注
可以使用 help
获取有关每个命令的详细信息,例如 dbutils.secrets.help("get")
You can get detailed information about each command by using help
, for example: dbutils.secrets.help("get")
库实用工具 Library utilities
可以通过库实用工具安装 Python 库并创建作用域为笔记本会话的环境。Library utilities allow you to install Python libraries and create an environment scoped to a notebook session. 这些库在驱动程序和执行程序上均可用,因此你可以在 UDF 中引用它们。The libraries are available both on the driver and on the executors, so you can reference them in UDFs. 这样做可以实现以下操作:This enables:
- 在笔记本自身中组织笔记本的库依赖项。Library dependencies of a notebook to be organized within the notebook itself.
- 具有不同库依赖项的笔记本用户共享群集而不受干扰。Notebook users with different library dependencies to share a cluster without interference.
分离笔记本会破坏此环境。Detaching a notebook destroys this environment. 但是,你可以重新创建它,只需在笔记本中重新运行库 install
API 命令即可。However, you can recreate it by re-running the library install
API commands in the notebook. 请参阅 restartPython
API,了解如何才能重置笔记本状态而不失去环境。See the restartPython
API for how you can reset your notebook state without losing your environment.
备注
库实用工具在 Databricks Runtime ML 或用于基因组学的 Databricks Runtime 上不可用。Library utilities are not available on Databricks Runtime ML or Databricks Runtime for Genomics. 请改为参阅作用域为笔记本的 Python 库。Instead, see Notebook-scoped Python libraries.
对于 Databricks Runtime 7.1 或更高版本,还可以使用 %pip
magic 命令安装作用域为笔记本的库。For Databricks Runtime 7.1 or above, you can also use %pip
magic commands to install notebook-scoped libraries. 请参阅作用域为笔记本的 Python 库。See Notebook-scoped Python libraries.
默认情况下,库实用工具处于启用状态。Library utilities are enabled by default. 因此,默认情况下会通过使用单独的 Python 可执行文件隔离每个笔记本的 Python 环境,该可执行文件在笔记本附加到群集并继承群集上的默认 Python 环境时创建。Therefore, by default the Python environment for each notebook is isolated by using a separate Python executable that is created when the notebook is attached to and inherits the default Python environment on the cluster. 通过 init script 安装到 Azure Databricks Python 环境中的库仍可用。Libraries installed through an init script into the Azure Databricks Python environment are still available. 可以通过将 spark.databricks.libraryIsolation.enabled
设置为 false
来禁用此功能。You can disable this feature by setting spark.databricks.libraryIsolation.enabled
to false
.
此 API 与通过 UI 和 REST API 进行的群集范围的现有库安装兼容。This API is compatible with the existing cluster-wide library installation through the UI and REST API. 通过此 API 安装的库的优先级高于群集范围的库。Libraries installed through this API have higher priority than cluster-wide libraries.
dbutils.library.help()
install(path: String): boolean -> Install the library within the current notebook session
installPyPI(pypiPackage: String, version: String = "", repo: String = "", extras: String = ""): boolean -> Install the PyPI library within the current notebook session
list: List -> List the isolated libraries added for the current notebook session via dbutils
restartPython: void -> Restart python process for the current notebook session
备注
可以使用 help
获取有关每个命令的详细信息,例如 dbutils.library.help("install")
You can get detailed information about each command by using help
, for example: dbutils.library.help("install")
示例Examples
在笔记本中安装 PyPI 库。Install a PyPI library in a notebook.
version
、repo
和extras
是可选的。version
,repo
, andextras
are optional. 使用extras
参数可指定额外功能(额外要求)。Use theextras
argument to specify the Extras feature (extra requirements).dbutils.library.installPyPI("pypipackage", version="version", repo="repo", extras="extras") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
重要
version
和extras
键不能是 PyPI 包字符串的一部分。Theversion
andextras
keys cannot be part of the PyPI package string. 例如,dbutils.library.installPyPI("azureml-sdk[databricks]==1.0.8")
无效。For example:dbutils.library.installPyPI("azureml-sdk[databricks]==1.0.8")
is not valid. 请使用version
和extras
参数指定版本和额外信息,如下所示:Use theversion
andextras
arguments to specify the version and extras information as follows:dbutils.library.installPyPI("azureml-sdk", version="1.0.8", extras="databricks") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
在一个笔记本中指定库要求,然后通过
%run
在另一个笔记本中安装它们。Specify your library requirements in one notebook and install them through%run
in the other.定义要在名为
InstallDependencies
的笔记本中安装的库。Define the libraries to install in a notebook calledInstallDependencies
.dbutils.library.installPyPI("torch") dbutils.library.installPyPI("scikit-learn", version="1.19.1") dbutils.library.installPyPI("azureml-sdk", extras="databricks") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
将它们安装在需要这些依赖项的笔记本中。Install them in the notebook that needs those dependencies.
%run /path/to/InstallDependencies # Install the dependencies in first cell
import torch from sklearn.linear_model import LinearRegression import azureml # do the actual work
列出笔记本中已安装的库。List the libraries installed in a notebook.
dbutils.library.list()
在维护环境时重置 Python 笔记本状态。Reset the Python notebook state while maintaining the environment. 此 API 仅在 Python 笔记本中可用。This API is available only in Python notebooks. 这可用于:This can be used to:
重新加载 Azure Databricks 预安装的其他版本的库。Reload libraries Azure Databricks preinstalled with a different version. 例如:For example:
dbutils.library.installPyPI("numpy", version="1.15.4") dbutils.library.restartPython()
# Make sure you start using the library in another cell. import numpy
安装需要在进程启动时加载的库,例如 tensorflow。Install libraries like tensorflow that need to be loaded on process start up. 例如:For example:
dbutils.library.installPyPI("tensorflow") dbutils.library.restartPython()
# Use the library in another cell. import tensorflow
在笔记本中安装
.egg
或.whl
库。Install a.egg
or.whl
library in a notebook.重要
建议将所有库安装命令放入笔记本的第一个单元,在该单元的末尾调用
restartPython
。We recommend that you put all your library install commands in the first cell of your notebook and callrestartPython
at the end of that cell. 在运行restartPython
后会重置 Python 笔记本状态;笔记本会丢失所有状态,包括但不限于本地变量、导入的库和其他临时状态。The Python notebook state is reset after runningrestartPython
; the notebook loses all state including but not limited to local variables, imported libraries, and other ephemeral states. 因此,建议在第一个笔记本单元中安装库并重置笔记本状态。Therefore, we recommended that you install libraries and reset the notebook state in the first notebook cell.接受的库源是
dbfs
、abfss
、adl
和wasbs
。The accepted library sources aredbfs
,abfss
,adl
, andwasbs
.dbutils.library.install("abfss:/path/to/your/library.egg") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
dbutils.library.install("abfss:/path/to/your/library.whl") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
DBUtils 笔记本DBUtils notebook
Databricks 实用工具 API 库 Databricks Utilities API library
备注
如果要部署到运行 Databricks Runtime 7.0 或更高版本的群集,则无法使用此库来运行部署前测试,因为没有支持 Scala 2.12 的库版本。You cannot use this library to run pre-deployment tests if you are deploying to a cluster running Databricks Runtime 7.0 or above, because there is no version of the library that supports Scala 2.12.
为了加快应用程序开发,可以在将应用程序部署为生产作业之前对其进行编译、构建和测试。To accelerate application development, it can be helpful to compile, build, and test applications before you deploy them as production jobs. 为了让你能够针对 Databricks 实用工具进行编译,Databricks 提供了 dbutils-api
库。To enable you to compile against Databricks Utilities, Databricks provides the dbutils-api
library. 可以下载 dbutils-api 库,也可以通过将依赖项添加到生成文件来包含库:You can download the dbutils-api library or include the library by adding a dependency to your build file:
SBTSBT
libraryDependencies += "com.databricks" % "dbutils-api_2.11" % "0.0.4"
MavenMaven
<dependency> <groupId>com.databricks</groupId> <artifactId>dbutils-api_2.11</artifactId> <version>0.0.4</version> </dependency>
GradleGradle
compile 'com.databricks:dbutils-api_2.11:0.0.4'
针对此库生成应用程序后,即可部署该应用程序。Once you build your application against this library, you can deploy the application.
重要
dbutils-api
库允许你在本地编译一个使用 dbutils
的应用程序,但不允许你运行它。The dbutils-api
library allows you to locally compile an application that uses dbutils
, but not to run it. 若要运行该应用程序,必须将其部署到 Azure Databricks 中。To run the application, you must deploy it in Azure Databricks.
使用 %%
表示生成工件的 Scala 版本Express the artifact’s Scala version with %%
如果将生成工件版本表示为 groupID %% artifactID % revision
而不是 groupID % artifactID % revision
(区别在于 groupID
后面的双 %%
),则 SBT 会将项目的 Scala 版本添加到生成工件名称中。If you express the artifact version as groupID %% artifactID % revision
instead of groupID % artifactID % revision
(the difference is the double %%
after the groupID
), SBT will add your project’s Scala version to the artifact name.
示例Example
假设你的生成的 scalaVersion
为 2.9.1
。Suppose the scalaVersion
for your build is 2.9.1
. 可以使用 %
编写生成工件版本,如下所示:You could write the artifact version with %
as follows:
val appDependencies = Seq(
"org.scala-tools" % "scala-stm_2.9.1" % "0.3"
)
下面这个使用 %%
的语句是等效的:The following using %%
is identical:
val appDependencies = Seq(
"org.scala-tools" %% "scala-stm" % "0.3"
)
示例项目Example projects
下面是一个示例存档,其中包含最小的示例项目,用于演示如何使用适用于 3 个常用生成工具的 dbutils-api
库进行编译:Here is an example archive containing minimal example projects that show you how to compile using the dbutils-api
library for 3 common build tools:
- sbt:
sbt package
sbt:sbt package
- Maven:
mvn install
Maven:mvn install
- Gradle:
gradle build
Gradle:gradle build
这些命令在以下位置创建输出 JAR:These commands create output JARs in the locations:
- sbt:
target/scala-2.11/dbutils-api-example_2.11-0.0.1-SNAPSHOT.jar
sbt:target/scala-2.11/dbutils-api-example_2.11-0.0.1-SNAPSHOT.jar
- Maven:
target/dbutils-api-example-0.0.1-SNAPSHOT.jar
Maven:target/dbutils-api-example-0.0.1-SNAPSHOT.jar
- Gradle:
build/examples/dbutils-api-example-0.0.1-SNAPSHOT.jar
Gradle:build/examples/dbutils-api-example-0.0.1-SNAPSHOT.jar
可以将此 JAR 作为库附加到群集,重启群集,然后运行以下语句:You can attach this JAR to your cluster as a library, restart the cluster, and then run:
example.Test()
此语句创建一个文本输入小组件,其标签为 Hello:,初始值为 World。This statement creates a text input widget with the label Hello: and the initial value World.
可以通过相同的方式使用所有其他 dbutils
API。You can use all the other dbutils
APIs the same way.
若要测试某个在 Databricks 外部使用 dbutils
对象的应用程序,可以通过调用以下语句来模拟 dbutils
对象:To test an application that uses the dbutils
object outside Databricks, you can mock up the dbutils
object by calling:
com.databricks.dbutils_v1.DBUtilsHolder.dbutils0.set(
new com.databricks.dbutils_v1.DBUtilsV1{
...
}
)
请使用你自己的 DBUtilsV1
实例进行替换。在该实例中,你可以随意实现接口方法,例如,为 dbutils.fs
提供本地文件系统原型。Substitute your own DBUtilsV1
instance in which you implement the interface methods however you like, for example providing a local filesystem mockup for dbutils.fs
.