管理笔记本Manage notebooks

可以通过使用 UI、CLI 以及通过调用工作区 API 来管理笔记本。You can manage notebooks using the UI, the CLI, and by invoking the Workspace API. 本文重点介绍如何使用 UI 执行笔记本任务。This article focuses on performing notebook tasks using the UI. 有关其他方法,请参阅 Databricks CLI工作区 APIFor the other methods, see Databricks CLI and Workspace API.

创建笔记本Create a notebook

  1. 单击边栏中的“工作区”按钮 “工作区”图标 或“主页”按钮 “主页”图标Click the Workspace button Workspace Icon or the Home button Home Icon in the sidebar. 执行下列操作之一:Do one of the following:
    • 在任何文件夹旁边,单击文本右侧的 菜单下拉列表,然后选择“创建”>“笔记本”。Next to any folder, click the Menu Dropdown on the right side of the text and select Create > Notebook.

      创建笔记本Create notebook

    • 在工作区或用户文件夹中,单击 向下的脱字号,然后选择“创建”>“笔记本”。In the Workspace or a user folder, click Down Caret and select Create > Notebook.

  2. 在“创建笔记本”对话框中输入一个名称,然后选择笔记本的默认语言。In the Create Notebook dialog, enter a name and select the notebook’s default language.
  3. 如果有正在运行的群集,则会显示“群集”下拉列表。If there are running clusters, the Cluster drop-down displays. 选择要将笔记本附加到的群集。Select the cluster you want to attach the notebook to.
  4. 单击 创建Click Create.

打开笔记本Open a notebook

在工作区中,单击一个In your workspace, click a 笔记本.. 将鼠标指针悬停在笔记本标题上时,会显示笔记本路径。The notebook path displays when you hover over the notebook title.

删除笔记本Delete a notebook

请参阅文件夹工作区对象操作,了解如何访问工作区菜单,以及如何删除工作区中的笔记本或其他项。See Folders and Workspace object operations for information about how to access the workspace menu and delete notebooks or other items in the Workspace.

复制笔记本路径Copy notebook path

若要在不打开笔记本的情况下复制笔记本文件路径,请右键单击笔记本名称,或者单击笔记本名称右侧的 菜单下拉列表,然后选择“复制文件路径”。To copy a notebook file path without opening the notebook, right-click the notebook name or click the Menu Dropdown to the right of the notebook name and select Copy File Path.

复制笔记本路径Copy notebook path

重命名笔记本Rename a notebook

若要更改已打开笔记本的标题,请单击标题并进行内联编辑,或单击“文件”>“重命名”。To change the title of an open notebook, click the title and edit inline or click File > Rename.

控制对笔记本的访问Control access to a notebook

如果 Azure Databricks 帐户有 Azure Databricks 高级计划,则可以使用工作区访问控制来控制谁有权访问笔记本。If your Azure Databricks account has the Azure Databricks Premium Plan, you can use Workspace access control to control who has access to a notebook.

笔记本外部格式 Notebook external formats

Azure Databricks 支持多种笔记本外部格式:Azure Databricks supports several notebook external formats:

  • 源文件:一个具有 .scala.py.sql.r 扩展名的文件,其中仅包含源代码语句。Source file: A file containing only source code statements with the extension .scala, .py, .sql, or .r.
  • HTML:一个具有 .html 扩展名的 Azure Databricks 笔记本。HTML: An Azure Databricks notebook with the extension .html.
  • DBC 存档:一个 Databricks 存档DBC archive: A Databricks archive.
  • IPython 笔记本:一个具有 .ipynb 扩展名的 Jupyter 笔记本IPython notebook: A Jupyter notebook with the extension .ipynb.
  • RMarkdown:一个具有 .Rmd 扩展名的 R Markdown 文档RMarkdown: An R Markdown document with the extension .Rmd.

本节内容:In this section:

导入笔记本 Import a notebook

可以从 URL 或文件导入外部笔记本。You can import an external notebook from a URL or a file.

  1. 单击边栏中的“工作区”按钮 “工作区”图标 或“主页”按钮 “主页”图标Click the Workspace button Workspace Icon or the Home button Home Icon in the sidebar. 执行下列操作之一:Do one of the following:

    • 在任意文件夹旁边,单击文本右侧的 菜单下拉列表,然后选择“导入”。Next to any folder, click the Menu Dropdown on the right side of the text and select Import.

    • 在工作区或用户文件夹中,单击 向下的脱字号,然后选择“导入”。In the Workspace or a user folder, click Down Caret and select Import.

      导入笔记本Import notebook

  2. 指定 URL 或浏览到一个包含受支持的外部格式的文件。Specify the URL or browse to a file containing a supported external format.

  3. 单击“导入”。Click Import.

导出笔记本 Export a notebook

在笔记本工具栏中,选择“文件”>“导出”和一个格式In the notebook toolbar, select File > Export and a format.

备注

如果你将笔记本导出为 HTML、IPython 笔记本或存档 (DBC),且尚未清除结果,则会包含运行该笔记本的结果。When you export a notebook as HTML, IPython notebook, or archive (DBC), and you have not cleared the results, the results of running the notebook are included.

笔记本和群集Notebooks and clusters

在笔记本中执行任何工作之前,必须先将笔记本附加到群集。Before you can do any work in a notebook, you must first attach the notebook to a cluster. 本部分介绍如何在群集中附加和拆离笔记本,以及在执行这些操作时后台会发生什么情况。This section describes how to attach and detach notebooks to and from clusters and what happens behind the scenes when you perform these actions.

本节内容:In this section:

执行上下文 Execution contexts

将笔记本附加到群集时,Azure Databricks 会创建执行上下文。When you attach a notebook to a cluster, Azure Databricks creates an execution context. 执行上下文包含以下每种受支持编程语言的 REPL 环境的状态:Python、R、Scala 和 SQL。An execution context contains the state for a REPL environment for each supported programming language: Python, R, Scala, and SQL. 当你在笔记本中运行某个单元格时,该命令会调度到相应的语言 REPL 环境并运行。When you run a cell in a notebook, the command is dispatched to the appropriate language REPL environment and run.

还可以使用 REST 1.2 API 来创建执行上下文并发送要在执行上下文中运行的命令。You can also use the REST 1.2 API to create an execution context and send a command to run in the execution context. 类似地,该命令会调度到相应的语言 REPL 环境并运行。Similarly, the command is dispatched to the language REPL environment and run.

群集的执行上下文数量有上限(145 个)。A cluster has a maximum number of execution contexts (145). 一旦执行上下文数量达到此阈值,就不能将笔记本附加到群集或创建新的执行上下文。Once the number of execution contexts has reached this threshold, you cannot attach a notebook to the cluster or create a new execution context.

空闲的执行上下文Idle execution contexts

如果上次完成执行后经过的时间超过了已设置的空闲阈值,系统会将执行上下文视为空闲。An execution context is considered idle when the last completed execution occurred past a set idle threshold. 上次完成的执行是指笔记本上次执行完命令的时间。Last completed execution is the last time the notebook completed execution of commands. 空闲阈值是指在上次完成的执行与任何自动拆离笔记本的尝试之间必须经历的时间。The idle threshold is the amount of time that must pass between the last completed execution and any attempt to automatically detach the notebook. 默认的空闲阈值为 24 小时。The default idle threshold is 24 hours.

当群集达到最大上下文限制时,Azure Databricks 会根据需要删除(逐出)空闲执行上下文(从最近使用次数最少的开始)。When a cluster has reached the maximum context limit, Azure Databricks removes (evicts) idle execution contexts (starting with the least recently used) as needed. 即使删除了上下文,使用上下文的笔记本仍附加到群集,并显示在群集的笔记本列表中。Even when a context is removed, the notebook using the context is still attached to the cluster and appears in the cluster’s notebook list. 流式处理笔记本会被视为正在活跃地运行,其上下文不会被逐出,直到其执行被停止为止。Streaming notebooks are considered actively running, and their context is never evicted until their execution has been stopped. 如果逐出某个空闲上下文,UI 会显示一条消息,指出使用该上下文的笔记本已由于处于空闲状态而被拆离。If an idle context is evicted, the UI displays a message indicating that the notebook using the context was detached due to being idle.

笔记本上下文被逐出Notebook context evicted

如果尝试将笔记本附加到已达到执行上下文上限且没有空闲上下文(或禁用了自动逐出)的群集,则 UI 会显示一条消息,指出已达到当前的最大执行上下文阈值,笔记本会保持在已拆离状态。If you attempt to attach a notebook to cluster that has maximum number of execution contexts and there are no idle contexts (or if auto-eviction is disabled), the UI displays a message saying that the current maximum execution contexts threshold has been reached and the notebook will remain in the detached state.

笔记本已拆离Notebook detached

如果为某个进程创建分支,则在为进程创建分支的请求的执行返回后,空闲执行上下文仍会被视为空闲。If you fork a process, an idle execution context is still considered idle once execution of the request that forked the process returns. 使用 Spark 时,建议不要为独立的进程创建分支。Forking separate processes is not recommended with Spark.

配置上下文自动逐出 Configure context auto-eviction

可以通过设置 Spark 属性 spark.databricks.chauffeur.enableIdleContextTracking 来配置上下文自动逐出。You can configure context auto-eviction by setting the Spark property spark.databricks.chauffeur.enableIdleContextTracking.

  • 在 Databricks 5.0 及更高版本中,默认情况下会启用自动逐出。In Databricks 5.0 and above, auto-eviction is enabled by default. 可以通过设置 spark.databricks.chauffeur.enableIdleContextTracking false 来为群集禁用自动逐出。You disable auto-eviction for a cluster by setting spark.databricks.chauffeur.enableIdleContextTracking false.
  • 在 Databricks 4.3 中,默认情况下禁用自动逐出。In Databricks 4.3, auto-eviction is disabled by default. 可以通过设置 spark.databricks.chauffeur.enableIdleContextTracking true 来为群集启用自动逐出。You enable auto-eviction for a cluster by setting spark.databricks.chauffeur.enableIdleContextTracking true.

将笔记本附加到群集 Attach a notebook to a cluster

若要将笔记本附加到群集,请执行以下操作:To attach a notebook to a cluster:

  1. 在笔记本工具栏中,单击 “群集”图标“已分离”群集下拉列表In the notebook toolbar, click Clusters Icon Detached Cluster Dropdown.
  2. 从下拉列表中选择一个群集From the drop-down, select a cluster.

重要

附加的笔记本定义了以下 Apache Spark 变量。An attached notebook has the following Apache Spark variables defined.

Class 变量名称Variable Name
SparkContext sc
SQLContext/HiveContext sqlContext
SparkSession (Spark 2.x)SparkSession (Spark 2.x) spark

请勿创建 SparkSessionSparkContextSQLContextDo not create a SparkSession, SparkContext, or SQLContext. 这样做会导致行为不一致。Doing so will lead to inconsistent behavior.

确定 Spark 和 Databricks Runtime 版本 Determine Spark and Databricks Runtime version

若要确定笔记本附加到的群集的 Spark 版本,请运行:To determine the Spark version of the cluster your notebook is attached to, run:

spark.version

若要确定笔记本附加到的群集的 Databricks Runtime 版本,请运行:To determine the Databricks Runtime version of the cluster your notebook is attached to, run:

ScalaScala
dbutils.notebook.getContext.tags("sparkVersion")
PythonPython
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")

备注

sparkVersion 标记以及群集 API作业 API 中的终结点所需的 spark_version 属性均指的是 Databricks Runtime 版本(而不是 Spark 版本)。Both this sparkVersion tag and the spark_version property required by the endpoints in the Clusters API and Jobs API refer to the Databricks Runtime version, not the Spark version.

从群集中拆离笔记本 Detach a notebook from a cluster

  1. 在笔记本工具栏中,单击 “群集”图标“已附加 群集下拉列表In the notebook toolbar, click Clusters Icon Attached Cluster Dropdown.

  2. 选择“拆离”。Select Detach.

    拆离笔记本Detach notebook

也可使用群集详细信息页上的“笔记本”选项卡将笔记本从群集中拆离。You can also detach notebooks from a cluster using the Notebooks tab on the cluster details page.

将笔记本从群集中拆离时,会删除执行上下文,并会从笔记本中清除已计算出来的所有变量值。When you detach a notebook from a cluster, the execution context is removed and all computed variable values are cleared from the notebook.

提示

Azure Databricks 建议从群集中拆离未使用的笔记本。Azure Databricks recommends that you detach unused notebooks from a cluster. 这将释放驱动程序占用的内存空间。This frees up memory space on the driver.

查看附加到群集的所有笔记本View all notebooks attached to a cluster

群集详细信息页上的“笔记本”选项卡会显示附加到群集的所有笔记本。The Notebooks tab on the cluster details page displays all of the notebooks that are attached to a cluster. 该选项卡还显示每个已附加的笔记本的状态,以及上次在笔记本中运行命令的时间。The tab also displays the status of each attached notebook, along with the last time a command was run from the notebook.

群集详细信息 - 附加的笔记本Cluster details attached notebooks

计划笔记本 Schedule a notebook

若要计划一个需定期运行的笔记本作业,请执行以下操作:To schedule a notebook job to run periodically:

  1. 在笔记本工具栏中,单击右上方的In the notebook toolbar, click the 计划 按钮。button at the top right.
  2. 单击“+ 新建”。Click + New.
  3. 选择该计划。Choose the schedule.
  4. 单击“确定”。Click OK.

分发笔记本 Distribute notebooks

为了让你能够轻松地分发 Azure Databricks 笔记本,Azure Databricks 支持了 Databricks 存档(一个包,其中可以包含笔记本的文件夹或单个笔记本)。To allow you to easily distribute Azure Databricks notebooks, Azure Databricks supports the Databricks archive , which is a package that can contain a folder of notebooks or a single notebook. Databricks 存档是一个具有额外元数据的 JAR 文件,其扩展名为 .dbcA Databricks archive is a JAR file with extra metadata and has the extension .dbc. 存档中包含的笔记本采用 Azure Databricks 内部格式。The notebooks contained in the archive are in an Azure Databricks internal format.

导入存档Import an archive

  1. 单击文件夹或笔记本右侧的 向下的脱字号菜单下拉菜单,然后选择“导入”。Click Down Caret or Menu Dropdown to the right of a folder or notebook and select Import.
  2. 选择“文件”或“URL”。Choose File or URL.
  3. 转到放置区中的 Databricks 存档,或在放置区中放置一个该存档。Go to or drop a Databricks archive in the dropzone.
  4. 单击“导入”。Click Import. 存档会导入到 Azure Databricks 中。The archive is imported into Azure Databricks. 如果存档包含文件夹,Azure Databricks 会重新创建该文件夹。If the archive contains a folder, Azure Databricks recreates that folder.

导出存档Export an archive

单击文件夹或笔记本右侧的 向下的脱字号菜单下拉菜单,然后选择“导出”>“DBC 存档”。Click Down Caret or Menu Dropdown to the right of a folder or notebook and select Export > DBC Archive. 此时 Azure Databricks 会下载名为 <[folder|notebook]-name>.dbc 的文件。Azure Databricks downloads a file named <[folder|notebook]-name>.dbc.