笔记本范围内的 Python 库Notebook-scoped Python libraries

重要

此功能目前以公共预览版提供。This feature is in Public Preview.

笔记本范围内的库允许你创建、保存、重用和共享特定于笔记本的自定义 Python 环境。Notebook-scoped libraries let you create, save, reuse, and share custom Python environments that are specific to a notebook. 安装笔记本范围内的库时,只有当前笔记本以及与该笔记本关联的任何作业有权访问该库。When you install a notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. 附加到同一群集的其他笔记本不受影响。Other notebooks attached to the same cluster are not affected.

笔记本范围内的库不会跨会话保留。Notebook-scoped libraries do not persist across sessions. 必须在每个会话开始时或从群集中分离笔记本时,重新安装笔记本范围内的库。You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.

安装笔记本范围内的库有两种方法:There are two methods for installing notebook-scoped libraries:

  • 使用笔记本中的 %pip%conda magic 命令。Using the %pip or %conda magic command in the notebook. Databricks Runtime 7.1 和更高版本支持 %pip 命令。The %pip command is supported on Databricks Runtime 7.1 and above. Databricks Runtime 6.4 ML 和更高版本以及用于基因组学的 Databricks Runtime 6.4 和更高版本都支持 %pip%condaBoth %pip and %conda are supported on Databricks Runtime 6.4 ML and above as well as Databricks Runtime for Genomics 6.4 and above. 本文介绍如何使用这些 magic 命令。This article describes how to use these magic commands.
  • 使用 Azure Databricks 库实用工具。Using Azure Databricks library utilities. 仅在 Databricks Runtime 上受支持,而在 Databricks Runtime ML 或用于基因组学的 Databricks Runtime 上不受支持。This is supported only on Databricks Runtime, not Databricks Runtime ML or Databricks Runtime for Genomics. 请参阅库实用工具See Library utilities.

若要为附加到群集的所有笔记本安装库,请使用工作区群集安装的库。To install libraries for all notebooks attached to the cluster, use workspace and cluster-installed libraries.

要求Requirements

默认情况下,在 Databricks Runtime 7.1 和更高版本、Databricks Runtime 7.1 ML 和更高版本以及用于基因组学的 Databricks Runtime 7.1 和更高版本中启用此功能。This feature is enabled by default in Databricks Runtime 7.1 and above, Databricks Runtime 7.1 ML and above, and Databricks Runtime for Genomics 7.1 and above.

还可通过 Databricks Runtime 6.4 ML 到 7.0 ML 和用于基因组学的 Databricks Runtime 6.4 到 7.0 中的配置设置提供此功能。It is also available via a configuration setting in Databricks Runtime 6.4 ML to 7.0 ML and Databricks Runtime for Genomics 6.4 to 7.0. 将群集的 Spark 配置 spark.databricks.conda.condaMagic.enabled 设置为 true。Set the Spark configuration spark.databricks.conda.condaMagic.enabled to true for your cluster.

此功能与高并发群集上的表访问控制凭据传递不兼容。This feature is not compatible with table access control or credential passthrough on a high-concurrency cluster. 对于启用了这些功能的 Databricks Runtime ML 或用于基因组学的 Databricks Runtime,无法在其上使用笔记本范围内的库。You cannot use notebook-scoped libraries on a Databricks Runtime ML or Databricks Runtime for Genomics cluster with those features enabled. 一种替代方法是使用 Databricks Runtime 群集上的库实用工具An alternative is to use Library utilities on a Databricks Runtime cluster.

驱动程序节点 Driver node

使用笔记本范围内的库可能会导致更多的流量流向驱动程序节点,因为它可使环境在执行程序节点之间保持一致。Using notebook-scoped libraries might result in more traffic to the driver node as it works to keep the environment consistent across executor nodes. 当使用包含 10 个或更多节点的群集时,Azure Databricks 建议为驱动程序节点使用以下规格:When you use a cluster with 10 or more nodes, Azure Databricks recommends these specs for the driver node:

  • 对于 100 节点 CPU 群集,请使用 Standard_DS5_v2。For a 100 node CPU cluster, use Standard_DS5_v2.
  • 对于 10 节点 GPU 群集,请使用 Standard_NC12。For a 10 node GPU cluster, use Standard_NC12.

对于更大的群集,请使用更大的驱动程序节点。For larger clusters, use a larger driver node.

使用笔记本范围内的库Using notebook-scoped libraries

Databricks Runtime 使用 %pip magic 命令来创建和管理笔记本范围内的库。Databricks Runtime uses %pip magic commands to create and manage notebook-scoped libraries. 在 Databricks Runtime ML 和用于基因组学的 Databricks Runtime 上,还可以使用 %conda magic 命令。On Databricks Runtime ML and Databricks Runtime for Genomics, you can also use %conda magic commands. Azure Databricks 建议使用 pip 安装库,除非要安装的库建议使用 conda。Azure Databricks recommends using pip to install libraries, unless the library you want to install recommends using conda. 有关详细信息,请参阅了解 conda 和 pipFor more information, see Understanding conda and pip.

重要

  • 应将所有 %pip%conda 命令放在笔记本的开头。You should place all %pip and %conda commands at the beginning of the notebook. 在修改环境的 %pip%conda 命令后,会重置笔记本状态。The notebook state is reset after any %pip or %conda command that modifies the environment. 如果在笔记本中创建 Python 方法或变量,然后在后续单元中使用 %pip%conda 命令,则这些方法或变量将丢失。If you create Python methods or variables in a notebook, and then use %pip or %conda commands in a later cell, the methods or variables will be lost.
  • 如果必须在笔记本中同时使用 %pip%conda 命令,请参阅 和 conda 命令之间的交互If you must use both %pip and %conda commands in a notebook, see Interactions between pip and conda commands.
  • 在用于机器学习的 Databricks Runtime 中,使用 %pip%conda 卸载或修改核心 Python 包(例如 IPython 或 conda)可能会导致某些功能无法正常工作。In Databricks Runtime for Machine Learning, uninstalling or modifying core Python packages (for example, IPython or conda) with %pip or %conda may lead to some features not working as expected. 如果出现问题,可以通过分离并重新附加笔记本来重置环境,如果问题仍然存在,请重启群集。If you see issues, you can reset the environment by detaching and re-attaching the notebook or by restarting the cluster if issues persist.

使用 %pip 命令管理库 Manage libraries with %pip commands

以下部分包含一些示例,说明如何使用 %pip 命令来管理环境。The following sections contain some examples of how you can use %pip commands to manage the environment.

使用要求文件安装库Use a requirements file to install libraries

要求文件包含要使用 pip 安装的包的列表。A requirements file contains a list of packages to be installed using pip. 文件的名称必须以 requirements.txt 结尾。The name of the file must end with requirements.txt. 以下是使用要求文件的示例:An example of using a requirements file is:

%pip install -r /dbfs/requirements.txt

有关 requirements.txt 文件的详细信息,请参阅要求文件格式See Requirements File Format for more information on requirements.txt files.

使用 pip 安装库Use pip to install a library

%pip install matplotlib

使用 pip 安装 wheel 包Use pip to install a wheel package

%pip install /dbfs/my_package.whl

使用 pip 卸载库Use pip to uninstall a library

备注

在 Databricks Runtime 中,无法卸载 Databricks Runtime 中包含的库,也无法卸载作为群集库安装的库。In Databricks Runtime, you cannot uninstall a library that is included in Databricks Runtime or a library that has been installed as a cluster library. 如果安装的版本不同于 Databricks Runtime 中包含的版本或在群集上安装的版本,则可以使用 %pip uninstall 将库还原到 Databricks Runtime 中的默认版本或在群集上安装的版本,但不能使用 %pip 卸载 Databricks Runtime 中包含的库版本或在群集上安装的库版本。If you have installed a different version than the one included in Databricks Runtime or the one installed on the cluster, you can use %pip uninstall to revert the library to the default version in Databricks Runtime or the version installed on the cluster, but you cannot use %pip to uninstall the version of a library included in Databricks Runtime or installed on the cluster.

%pip uninstall -y matplotlib

备注

必须使用 -y 选项。The -y option is required.

使用 %pip 从版本控制系统项目 URL 安装 PyPI 库Use %pip to install PyPI libraries from a version control system project URL

%pip install git+https://github.com/databricks/databricks-cli

备注

可以向 URL 添加参数,以指定版本或 Git 子目录等。You can add parameters to the URL to specify things like the version or git subdirectory. 有关详细信息以及其他版本控制系统的示例,请参阅 pip 安装文档Refer to the pip install documentation for more information and for examples for other version control systems.

使用 %pip 从版本控制系统项目 URL 安装非 PyPI 库Use %pip to install non-PyPI libraries from a version control system project URL

%pip install --index-url http://<personal-access-token>@your-package-repository.com/your/file/path <package>==<version>

使用 %pip 通过 Databricks 机密管理的凭据安装专用包Use %pip to install a private package with credentials managed by Databricks secrets

Databricks 机密 API 允许存储身份验证令牌和密码。The Databricks Secrets API allows you to store authentication tokens and passwords. 使用 DBUtils API 从笔记本访问机密。Use the DBUtils API to access secrets from your notebook.

%pip install git+https://<token>@gitprovider.com/<user>/<respository>.git/@<version>#egg=<package>-0

还可以在 magic 命令中使用 $variablesYou can also use $variables in magic commands.

token = dbutils.secrets.get(scope="scope", key="key")
%pip install git+https://$token@gitprovider.com/<user>/<repository>.git

使用 %pip 从 DBFS 安装包Use %pip to install a package from DBFS

可以使用 %pip 安装已保存在 DBFS 上的专用包。You can use %pip to install a private package that has been saved on DBFS.

备注

将文件上传到 DBFS 时,它将自动重命名该文件,并使用下划线替换空格、句点和连字符。When you upload a file to DBFS, it automatically renames the file, replacing spaces, periods, and hyphens with underscores. pip 要求 wheel 文件的名称在版本(例如 0.1.0)中使用句点和连字符,而不是空格或下划线。pip requires that the name of the wheel file use periods in the version (for example, 0.1.0) and hyphens instead of spaces or underscores. 若要使用 %pip 安装包,必须重命名文件以满足这些要求。To install the package with %pip, you will have to rename the file to meet these requirements.

%pip install /dbfs/mypackage-0.0.1-py3-none-any.whl

在要求文件中保存库Save libraries in a requirements file

%pip freeze > /dbfs/requirements.txt

备注

文件路径中的任何子目录都必须已存在。Any subdirectories in the file path must already exist. 调用 %pip freeze > /dbfs/<new-directory>/requirements.txt 时,如果目录 /dbfs/<new-directory> 尚不存在,则该命令将失败。If you call %pip freeze > /dbfs/<new-directory>/requirements.txt, the command will fail if the directory /dbfs/<new-directory> does not already exist.

使用 %conda 命令管理库 Manage libraries with %conda commands

备注

%conda magic 命令在 Databricks Runtime 上不可用。%conda magic commands are not available on Databricks Runtime. 它们在用于机器学习的 Databricks Runtime 和用于基因组学的 Databricks Runtime 上可用。They are available on Databricks Runtime for Machine Learning and Databricks Runtime for Genomics.

以下部分包含一些示例,说明如何使用 %conda 命令来管理环境。The following sections contain some examples of how you can use %conda commands to manage the environment.

使用 conda 安装库Use conda to install a library

%conda install matplotlib

使用 conda 卸载库Use conda to uninstall a library

%conda uninstall matplotlib

复制、重用和共享环境 Copy, reuse, and share an environment

将笔记本从群集中分离时,环境不会保存。When you detach a notebook from a cluster, the environment is not saved. 若要保存环境以便以后重用或将其与他人共享,请按照以下步骤进行操作。To save an environment so you can reuse it later or share it with someone else, follow these steps.

备注

Azure Databricks 建议仅在运行同一版本 Databricks Runtime ML 或同一版本用于基因组学的 Databricks Runtime 的群集之间共享环境。Azure Databricks recommends that environments be shared only between clusters running the same version of Databricks Runtime ML or the same version of Databricks Runtime for Genomics.

  1. 将环境另存为 conda YAML 规范。Save the environment as a conda YAML specification.

    %conda env export -f /dbfs/myenv.yml
    
  2. 使用 conda env update 将文件导入到另一个笔记本。Import the file to another notebook using conda env update.

    %conda env update -f /dbfs/myenv.yml
    

列出笔记本的 Python 环境List the Python environment of a notebook

若要显示与笔记本关联的 Python 环境,请使用 %conda listTo show the Python environment associated with a notebook, use %conda list:

%conda list

pipconda 命令之间的交互 Interactions between pip and conda commands

若要避免冲突,请在使用 pip 或 conda 安装 Python 包和库时遵循这些准则。To avoid conflicts, follow these guidelines when using pip or conda to install Python packages and libraries.

  • 通过 API群集 UI 安装的库是使用 pip 安装的。Libraries installed via the API or via the cluster UI are installed using pip. 如果已从 API 或群集 UI 安装了任何库,则在安装笔记本范围内的库时,应仅使用 %pip 命令。If any libraries have been installed from the API or the cluster UI, you should use only %pip commands when installing notebook-scoped libraries.
  • 如果将在群集上使用笔记本范围内的库,则在该群集上运行的初始化脚本可以使用 condapip 命令来安装库。If you will use notebook-scoped libraries on a cluster, init scripts run on that cluster can use either conda or pip commands to install libraries. 但是,如果初始化脚本包含 pip 命令,则仅使用笔记本中的 %pip 命令(而不是 %conda)。However, if the init script includes pip commands, use only %pip commands in notebooks (not %conda).
  • 最好是专门使用 pip 命令,或专门使用 conda 命令。It’s best to use either pip commands exclusively or conda commands exclusively. 如果必须通过 conda 安装一些包,并通过 pip 安装一些包,请先运行 conda 命令,然后运行 pip 命令。 If you must install some packages via conda and some via pip, run the conda commands first, and then run the pip commands. 有关详细信息,请参阅在 Conda 环境中使用 PipFor more information, see Using Pip in a Conda Environment.

常见问题解答 (FAQ)Frequently asked questions (FAQ)

从群集 UI/API 安装的库如何与笔记本范围内的库交互?How do libraries installed from the cluster UI/API interact with notebook-scoped libraries?

从群集 UI 或 API 安装的库可用于群集上的所有笔记本。Libraries installed from the cluster UI or API are available to all notebooks on the cluster. 这些库是使用 pip 安装的;因此,如果通过群集 UI 安装库,请仅在笔记本中使用 %pip 命令。These libraries are installed using pip; therefore, if libraries are installed via the cluster UI, use only %pip commands in notebooks.

通过初始化脚本安装的库如何与笔记本范围内的库交互?How do libraries installed via an init script interact with notebook-scoped libraries?

通过初始化脚本安装的库可用于群集上的所有笔记本。Libraries installed via an init script are available to all notebooks on the cluster.

如果在运行 Databricks Runtime ML 或用于基因组学的 Databricks Runtime 的群集上使用笔记本范围内的库,则群集上运行的初始化脚本可以使用 condapip 命令来安装库。If you use notebook-scoped libraries on a cluster running Databricks Runtime ML or Databricks Runtime for Genomics, init scripts run on the cluster can use either conda or pip commands to install libraries. 但是,如果初始化脚本包含 pip 命令,则仅使用笔记本中的 %pip 命令。However, if the init script includes pip commands, then use only %pip commands in notebooks.

例如,此笔记本代码片段会生成一个脚本,该脚本将在所有群集节点上安装 fast.ai 包。For example, this notebook code snippet generates a script that installs fast.ai packages on all the cluster nodes.

dbutils.fs.put("dbfs:/home/myScripts/fast.ai", "conda install -c pytorch -c fastai fastai -y", True)

能否在作业笔记本中使用 %pip%conda 命令?Can I use %pip and %conda commands in job notebooks?

能。Yes.

能否使用 %sh pip%sh condaCan I use %sh pip or %sh conda?

不建议使用 %sh pip,因为它与 %pip 用法不兼容。We do not recommend using %sh pip because it is not compatible with %pip usage.

能否使用 %conda 命令更新 R 包?Can I update R packages using %conda commands?

错误。No.

限制Limitations

  • 如果在高并发群集上启用了表访问控制凭据传递,则不能使用 %pip%conda 命令。%pip and %conda commands cannot be used if either table access control or credential passthrough has been enabled on a high-concurrency cluster. 一种替代方法是使用 Databricks Runtime 上的库实用工具An alternative is to use use Library utilities on Databricks Runtime.

  • 不支持以下 conda 命令:The following conda commands are not supported:

    • activate
    • create
    • init
    • run
    • env create
    • env remove

已知问题Known issues

  • 在 Databricks Runtime 7.0 ML 和更低版本以及用于基因组学的 Databricks Runtime 7.0 和更低版本上,如果注册的 UDF 依赖于通过 %pip/%conda 安装的 Python 包,则它不会在 %sql 单元中工作。On Databricks Runtime 7.0 ML and below as well as Databricks Runtime for Genomics 7.0 and below, if a registered UDF depends on Python packages installed via %pip/%conda, it won’t work in %sql cells. 请改用 Python 命令行界面中的 spark.sqlUse spark.sql in a Python command shell instead.
  • 在 Databricks Runtime 7.2 ML 和更低版本上,使用 %conda 更新笔记本环境时,不会在工作 Python 进程上激活新环境。On Databricks Runtime 7.2 ML and below, when you update the notebook environment using %conda, the new environment is not activated on worker Python processes. 如果 PySpark UDF 函数调用第三方函数(该函数使用在 Conda 环境中安装的资源),这可能会导致问题。This can cause issues if a PySpark UDF function calls a third-party function that uses resources installed inside the Conda environment.
  • 使用 %conda env update 更新笔记本环境时,不保证包的安装顺序。When you use %conda env update to update a notebook environment, the installation order of packages is not guaranteed. 这可能会导致 horovod 包出现问题,在这种情况下,需要在 horovod 之前安装 tensorflowtorch,以便分别使用 horovod.tensorflowhorovod.torchThis can cause problems for the horovod package, which requires that tensorflow and torch be installed before horovod in order to use horovod.tensorflow or horovod.torch respectively. 如果发生这种情况,请卸载 horovod 包,并在确保安装依赖项后重新安装它。If this happens, uninstall the horovod package and reinstall it after ensuring that the dependencies are installed.