配置群集 Configure clusters

本文介绍创建和编辑 Azure Databricks 群集时可用的配置选项。This article explains the configuration options available when you create and edit Azure Databricks clusters. 它侧重于说明如何使用 UI 创建和编辑群集。It focuses on creating and editing clusters using the UI. 有关其他方法,请参阅群集 CLI群集 APIFor other methods, see Clusters CLI and Clusters API.

创建群集Create cluster

群集策略Cluster policy

群集策略基于一组规则限制配置群集的能力。A cluster policy limits the ability to configure clusters based on a set of rules. 策略规则会限制可用于创建群集的属性或属性值。The policy rules limit the attributes or attribute values available for cluster creation. 群集策略的 ACL 可以将策略的使用限制到特定用户和组,因此可以限制你在创建群集时可选择的策略。Cluster policies have ACLs that limit their use to specific users and groups and thus limit which policies you can select when you create a cluster.

若要配置群集策略,请在“策略”下拉列表中选择群集策略。To configure a cluster policy, select the cluster policy in the Policy drop-down.

选择群集策略Select cluster policy

备注

如果尚未在工作区中创建任何策略,则不会显示“策略”下拉列表。If no policies have been created in the workspace, the Policy drop-down does not display.

如果你有以下权限,则可执行相应的操作:If you have:

  • 如果有 群集创建权限,则可选择 自由形式 的策略,创建可以充分配置的群集。Cluster create permission, you can select the Free form policy and create fully-configurable clusters. 自由形式 的策略不限制任何群集属性或属性值。The Free form policy does not limit any cluster attributes or attribute values.
  • 如果有群集创建权限和群集策略访问权限,则可选择 自由形式 的策略和你有权访问的策略。Both cluster create permission and access to cluster policies, you can select the Free form policy and the policies you have access to.
  • 如果只有群集策略访问权限,则可选择你有权访问的策略。Access to cluster policies only, you can select the policies you have access to.

群集模式Cluster mode

Azure Databricks 支持三种群集模式:标准、高并发和单节点Azure Databricks supports three cluster modes: Standard, High Concurrency, and Single Node. 默认群集模式为“标准”。The default cluster mode is Standard.

备注

群集配置包括自动终止设置,其默认值依赖于群集模式:The cluster configuration includes an auto terminate setting whose default value depends on cluster mode:

  • 标准群集和单节点群集配置为在 120 分钟后自动终止。Standard and Single Node clusters are configured to terminate automatically after 120 minutes.
  • 高并发群集配置为不会自动终止。High Concurrency clusters are configured to not terminate automatically.

重要

创建群集后,无法更改群集模式。You cannot change the cluster mode after a cluster is created. 如果需要另一群集模式,必须创建新群集。If you want a different cluster mode, you must create a new cluster.

标准群集 Standard clusters

建议在单用户模式下使用标准群集。Standard clusters are recommended for a single user. 标准群集可以运行采用以下任何语言开发的工作负荷:Python、R、Scala 和 SQL。Standard clusters can run workloads developed in any language: Python, R, Scala, and SQL.

高并发群集 High concurrency clusters

高并发群集是托管的云资源。A High Concurrency cluster is a managed cloud resource. 高并发群集的主要优点是,它们提供 Apache Spark 原生的细粒度共享,可以最大限度地提高资源利用率并降低查询延迟。The key benefits of High Concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.

高并发群集仅支持 SQL、Python 和 R。高并发群集的性能和安全性通过在单独的进程中运行用户代码来实现,这在 Scala 中是不可能的。High Concurrency clusters work only for SQL, Python, and R. The performance and security of High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.

此外,只有高并发群集支持表访问控制In addition, only High Concurrency clusters support table access control.

若要创建高并发群集,请在“群集模式”下拉列表中选择“高并发”。To create a High Concurrency cluster, in the Cluster Mode drop-down select High Concurrency .

高并发群集模式High concurrency cluster mode

有关如何使用群集 API 创建高并发群集的示例,请参阅高并发群集示例For an example of how to create a High Concurrency cluster using the Clusters API, see High Concurrency cluster example.

单节点群集 Single Node clusters

单节点群集没有工作器,在驱动程序节点上运行 Spark 作业。A Single Node cluster has no workers and runs Spark jobs on the driver node. 相比而言,标准模式群集除了需要用于执行 Spark 作业的驱动程序节点外,还需要至少一个 Spark 工作器节点。In contrast, Standard mode clusters require at least one Spark worker node in addition to the driver node to execute Spark jobs.

若要创建单节点群集,请在“群集模式”下拉列表中选择“单节点”。To create a Single Node cluster, in the Cluster Mode drop-down select Single Node .

单节点群集模式Single node cluster mode

若要详细了解如何使用单节点群集,请参阅单节点群集To learn more about working with Single Node clusters, see Single Node clusters.

Pool

重要

此功能目前以公共预览版提供。This feature is in Public Preview.

若要缩短群集启动时间,可以将群集附加到包含空闲实例的预定义To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances. 附加到池时,群集会从池中分配其驱动程序节点和工作器节点。When attached to a pool, a cluster allocates its driver and worker nodes from the pool. 如果池中没有足够的空闲资源来满足群集的请求,则池会通过从实例提供程序分配新的实例进行扩展。If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. 终止附加的群集后,它使用的实例会返回到池中,可供其他群集重复使用。When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

请参阅使用池,详细了解如何在 Azure Databricks 中使用池。See Use a pool to learn more about working with pools in Azure Databricks.

Databricks RuntimeDatabricks Runtime

Databricks 运行时是在群集上运行的核心组件集。Databricks runtimes are the set of core components that run on your clusters. 所有 Databricks 运行时都包括 Apache Spark,都添加了可以提高可用性、性能和安全性的组件与更新。All Databricks runtimes include Apache Spark and add components and updates that improve usability, performance, and security.

Azure Databricks 提供多种类型的运行时以及这些运行时类型的多个版本,这些版本会在你创建或编辑群集时出现在“Databricks 运行时版本”下拉列表中。Azure Databricks offers several types of runtimes and several versions of those runtime types in the Databricks Runtime Version drop-down when you create or edit a cluster.

有关详细信息,请参阅 Databricks 运行时For details, see Databricks runtimes.

Python 版本 Python version

重要

Python 2 的生命周期已于 2020 年 1 月 1 日结束。Python 2 reached its end of life on January 1, 2020. Databricks Runtime 6.0 及更高版本不支持 Python 2。Python 2 is not supported in Databricks Runtime 6.0 and above. Databricks Runtime 5.5 及更低版本继续支持 Python 2。Databricks Runtime 5.5 and below continue to support Python 2.

运行 Databricks Runtime 6.0 及更高版本的 Python 群集Python clusters running Databricks Runtime 6.0 and above

Databricks Runtime 6.0(不受支持)及更高版本仅支持 Python 3。Databricks Runtime 6.0 (Unsupported) and above supports only Python 3. 若要了解与 Databricks Runtime 6.0 所引入的 Python 环境相关的重大变更,请参阅发行说明中的 Python 环境For major changes related to the Python environment introduced by Databricks Runtime 6.0, see Python environment in the release notes.

运行 Databricks Runtime 5.5 LTS 的 Python 群集Python clusters running Databricks Runtime 5.5 LTS

对于 Databricks Runtime 5.5 LTS,Spark 作业、Python 笔记本单元和库安装都支持 Python 2 和 3。For Databricks Runtime 5.5 LTS, Spark jobs, Python notebook cells, and library installation all support both Python 2 and 3.

使用 UI 创建的群集的默认 Python 版本为 Python 3。The default Python version for clusters created using the UI is Python 3. 在 Databricks Runtime 5.5 LTS 中,使用 REST API 创建的群集的默认版本为 Python 2。In Databricks Runtime 5.5 LTS the default version for clusters created using the REST API is Python 2.

指定 Python 版本Specify Python version

若要在使用 UI 创建群集时指定 Python 版本,请从“Python 版本”下拉列表中进行选择。To specify the Python version when you create a cluster using the UI, select it from the Python Version drop-down.

群集 Python 版本Cluster Python version

若要在使用 API 创建群集时指定 Python 版本,请将环境变量 PYSPARK_PYTHON 设置为 /databricks/python/bin/python/databricks/python3/bin/python3To specify the Python version when you create a cluster using the API, set the environment variable PYSPARK_PYTHON to /databricks/python/bin/python or /databricks/python3/bin/python3. 有关示例,请参阅创建 Python 3 群集 (Databricks Runtime 5.5 LTS) 中的 REST API 示例。For an example, see the REST API example Create a Python 3 cluster (Databricks Runtime 5.5 LTS).

若要验证 PYSPARK_PYTHON 配置是否已生效,请在 Python 笔记本(或 %python 单元)中运行以下语句:To validate that the PYSPARK_PYTHON configuration took effect, in a Python notebook (or %python cell) run:

import sys
print(sys.version)

如果指定了 /databricks/python3/bin/python3,其输出应如下所示:If you specified /databricks/python3/bin/python3, it should print something like:

3.5.2 (default, Sep 10 2016, 08:21:44)
[GCC 5.4.0 20160609]

重要

对于 Databricks Runtime 5.5 LTS,当你在笔记本中运行 %sh python --version 时,python 指的是 Ubuntu 系统 Python 版本,即 Python 2。For Databricks Runtime 5.5 LTS, when you run %sh python --version in a notebook, python refers to the Ubuntu system Python version, which is Python 2. 使用 /databricks/python/bin/python 来指示 Databricks 笔记本和 Spark 使用的 Python 版本:此路径会自动配置为指向正确的 Python 可执行文件。Use /databricks/python/bin/python to refer to the version of Python used by Databricks notebooks and Spark: this path is automatically configured to point to the correct Python executable.

常见问题 (FAQ)Frequently asked questions (FAQ)

是否可以在同一群集上同时使用 Python 2 和 Python 3 笔记本?Can I use both Python 2 and Python 3 notebooks on the same cluster?

否。No. Python 版本是群集范围的设置,不能在每个笔记本上进行配置。The Python version is a cluster-wide setting and is not configurable on a per-notebook basis.

Python 群集上安装了哪些库?What libraries are installed on Python clusters?

若要详细了解已安装的特定库,请参阅 Databricks 运行时发行说明For details on the specific libraries that are installed, see the Databricks runtime release notes.

现有的 PyPI 库是否适用于 Python 3?Will my existing PyPI libraries work with Python 3?

这取决于库的版本是否支持 Python 3 版的 Databricks Runtime 版本。It depends on whether the version of the library supports the Python 3 version of a Databricks Runtime version. Databricks Runtime 5.5 LTS 使用 Python 3.5。Databricks Runtime 5.5 LTS uses Python 3.5. Databricks Runtime 6.0 及更高版本和带有 Conda 的 Databricks Runtime 使用 Python 3.7。Databricks Runtime 6.0 and above and Databricks Runtime with Conda use Python 3.7. 特定的旧版 Python 库可能无法与 Python 3.7 前向兼容。It is possible that a specific old version of a Python library is not forward compatible with Python 3.7. 对于这种情况,你需要使用较新版的库。For this case, you will need to use a newer version of the library.

现有的 .egg 库是否适用于 Python 3?Will my existing .egg libraries work with Python 3?

这取决于现有的 egg 库是否与 Python 2 和 3 交叉兼容。It depends on whether your existing egg library is cross-compatible with both Python 2 and 3. 如果该库不支持 Python 3,则库附件会出故障,或者会出现运行时错误。If the library does not support Python 3 then either library attachment will fail or runtime errors will occur.

若要全面了解如何将代码移植到 Python 3 以及如何编写与 Python 2 和 3 兼容的代码,请参阅支持 Python 3For a comprehensive guide on porting code to Python 3 and writing code compatible with both Python 2 and 3, see Supporting Python 3.

是否仍可使用初始化脚本安装 Python 库?Can I still install Python libraries using init scripts?

群集节点初始化脚本的一个常见用例是安装包。A common use case for Cluster node initialization scripts is to install packages. 对于 Databricks Runtime 5.5 LTS,请使用 /databricks/python/bin/pip 来确保 Python 包安装到 Databricks Python 虚拟环境中,而不是系统 Python 环境中。For Databricks Runtime 5.5 LTS, use /databricks/python/bin/pip to ensure that Python packages install into Databricks Python virtual environment rather than the system Python environment. 对于 Databricks Runtime 6.0 及更高版本和带有 Conda 的 Databricks Runtime,pip 命令是指正确的 Python 虚拟环境中的 pipFor Databricks Runtime 6.0 and above, and Databricks Runtime with Conda, the pip command is referring to the pip in the correct Python virtual environment. 但是,如果使用初始化脚本来创建 Python 虚拟环境,请始终使用绝对路径来访问 pythonpipHowever, if you are using an init script to create the Python virtual environment, always use the absolute path to access python and pip.

群集节点类型 Cluster node type

群集由一个驱动程序节点和多个工作器节点组成。A cluster consists of one driver node and worker nodes. 你可以为驱动程序节点和工作器节点分别选取不同的云提供程序实例类型,虽然默认情况下驱动程序节点使用与工作器节点相同的实例类型。You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. 不同的实例类型系列适用于不同的用例,例如内存密集型工作负荷或计算密集型工作负荷。Different families of instance types fit different use cases, such as memory-intensive or compute-intensive workloads.

备注

如果安全要求包括计算隔离,请选择一个 Standard_F72s_V2 实例作为工作器类型。If your security requirements include compute isolation, select a Standard_F72s_V2 instance as your worker type. 这些实例类型表示使用整个物理主机的隔离虚拟机,并提供为特定工作负荷(例如美国国防部影响级别 5 (IL5) 工作负荷)提供支持所需的隔离级别。These instance types represent isolated virtual machines that consume the entire physical host and provide the necessary level of isolation required to support, for example, US Department of Defense Impact Level 5 (IL5) workloads.

驱动程序节点Driver node

驱动程序保留附加到群集的所有笔记本的状态信息。The driver maintains state information of all notebooks attached to the cluster. 驱动程序节点还负责维护 SparkContext,并解释从群集上的某个笔记本或库运行的所有命令。The driver node is also responsible for maintaining the SparkContext and interpreting all the commands you run from a notebook or a library on the cluster. 驱动程序节点还运行与 Spark 执行程序协调的 Apache Spark master。The driver node also runs the Apache Spark master that coordinates with the Spark executors.

驱动程序节点类型的默认值与工作器节点类型相同。The default value of the driver node type is the same as the worker node type. 如果计划使用 collect() 从 Spark 工作器处收集大量数据,并在笔记本中分析这些数据,则可以选择具有更多内存的更大驱动程序节点类型。You can choose a larger driver node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze them in the notebook.

提示

由于驱动程序节点保留附加的笔记本的所有状态信息,因此,请务必将未使用的笔记本与驱动程序分离。Since the driver node maintains all of the state information of the notebooks attached, make sure to detach unused notebooks from the driver.

工作器节点Worker node

Azure Databricks 工作器运行 Spark 执行程序和正常运行群集所需的其他服务。Azure Databricks workers run the Spark executors and other services required for the proper functioning of the clusters. 通过 Spark 分配工作负荷时,所有分布式处理都在工作器上进行。When you distribute your workload with Spark, all of the distributed processing happens on workers. Azure Databricks 为每个工作器节点运行一个执行程序;因此,“执行程序”和“工作器”这两个术语可在 Azure Databricks 体系结构的上下文中互换使用。Azure Databricks runs one executor per worker node; therefore the terms executor and worker are used interchangeably in the context of the Azure Databricks architecture.

提示

若要运行 Spark 作业,至少需要一个工作器。To run a Spark job, you need at least one worker. 如果群集没有工作器,你可以在驱动程序上运行非 Spark 命令,但 Spark 命令会失败。If a cluster has zero workers, you can run non-Spark commands on the driver, but Spark commands will fail.

GPU 实例类型GPU instance types

对于在计算方面富有挑战性并且对性能要求很高的任务(例如与深度学习相关的任务),Azure Databricks 支持通过图形处理单元 (GPU) 进行加速的群集。For computationally challenging tasks that demand high performance, like those associated with deep learning, Azure Databricks supports clusters accelerated with graphics processing units (GPUs). 此支持为 Beta 版。This support is in Beta. 有关详细信息,请参阅支持 GPU 的群集For more information, see GPU-enabled clusters.

群集大小和自动缩放 Cluster size and autoscaling

创建 Azure Databricks 群集时,可以为群集提供固定数目的工作器,也可以为群集提供最小数目和最大数目的工作器。When you create a Azure Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.

当你提供固定大小的群集时,Azure Databricks 确保你的群集具有指定数量的工作器。When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number of workers. 当你为工作器数量提供了范围时,Databricks 会选择运行作业所需的适当数量的工作器。When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. 这称为“自动缩放”。This is referred to as autoscaling .

使用自动缩放,Azure Databricks 可以根据作业特征动态地重新分配工作器。With autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of your job. 在计算方面,管道的某些部分可能比其他部分的要求更高,Databricks 会自动在作业的这些阶段添加额外的工作器(并在不再需要它们时将其删除)。Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

通过自动缩放,可以更轻松地实现高群集利用率,因为无需通过预配群集来匹配工作负载。Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload. 这特别适用于其需求随时间变化的工作负荷(例如在一天的过程中浏览数据集),但也可能适用于预配要求未知的、时间较短的一次性工作负荷。This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. 因此,自动缩放有两个优点:Autoscaling thus offers two advantages:

  • 与大小恒定且未充分预配的群集相比,工作负荷的运行速度可以更快。Workloads can run faster compared to a constant-sized under-provisioned cluster.
  • 与静态大小的群集相比,自动缩放群集可降低总体成本。Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.

自动缩放可以提供这两个优点之一,也可以同时提供这两个优点,具体取决于群集和工作负荷的恒定大小。Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these benefits at the same time. 当云服务提供商终止实例时,群集大小可能会小于所选工作器的最小数目。The cluster size can go below the minimum number of workers selected when the cloud provider terminates instances. 在这种情况下,Azure Databricks 会通过连续重试来重新预配实例,以便保持最小数量的工作器。In this case, Azure Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers.

备注

自动缩放不适用于 spark-submit 作业。Autoscaling is not available for spark-submit jobs.

自动缩放类型Autoscaling types

Azure Databricks 提供两种类型的群集节点自动缩放:标准和优化。Azure Databricks offers two types of cluster node autoscaling: standard and optimized. 有关优化自动缩放优点的讨论,请参阅关于优化自动缩放的博客文章。For a discussion of the benefits of optimized autoscaling, see the blog post on Optimized Autoscaling.

自动化(作业)群集始终使用优化自动缩放。Automated (job) clusters always use optimized autoscaling. 在全用途群集上执行的自动缩放的类型取决于工作区配置。The type of autoscaling performed on all-purpose clusters depends on the workspace configuration.

标准自动缩放由“标准”定价层的工作区中的全用途群集使用。Standard autoscaling is used by all-purpose clusters in workspaces in the Standard pricing tier. 优化自动缩放由 Azure Databricks 高级计划中的全用途群集使用。Optimized autoscaling is used by all-purpose clusters in the Azure Databricks Premium Plan.

自动缩放的表现方式How autoscaling behaves

自动缩放的表现取决于它是优化自动缩放还是标准自动缩放,以及是应用于全用途群集还是作业群集。Autoscaling behaves differently depending on whether it is optimized or standard and whether applied to an all-purpose or a job cluster.

优化的自动缩放Optimized autoscaling

  • 通过 2 个步骤从最小值纵向扩展到最大值。Scales up from min to max in 2 steps.
  • 即使群集未处于空闲状态,也可以通过查看 shuffle 文件状态进行纵向缩减。Can scale down even if the cluster is not idle by looking at shuffle file state.
  • 按当前节点数的某个百分比进行纵向缩减。Scales down based on a percentage of current nodes.
  • 在作业群集上,如果在过去的 40 秒内群集未得到充分利用,则进行纵向缩减。On job clusters, scales down if the cluster is underutilized over the last 40 seconds.
  • 在全用途群集上,如果在过去的 150 秒内群集未得到充分利用,则进行纵向缩减。On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds.

标准自动缩放Standard autoscaling

  • 从添加 8 个节点开始。Starts with adding 8 nodes. 然后,以指数方式进行纵向扩展,但可能需要执行多个步骤才能达到最大值。你可以通过设置 spark.databricks.autoscaling.standardFirstStepUp Spark 配置属性来自定义第一步。Thereafter, scales up exponentially, but can take many steps to reach the max. You can customize the first step by setting the spark.databricks.autoscaling.standardFirstStepUp Spark configuration property.
  • 仅当群集处于完全空闲状态且在过去 10 分钟内未得到充分利用时,才进行纵向缩减。Scales down only when the cluster is completely idle and it has been underutilized for the last 10 minutes.
  • 从 1 个节点开始,以指数方式进行纵向缩减。Scales down exponentially, starting with 1 node.

启用并配置自动缩放Enable and configure autoscaling

若要允许 Azure Databricks 自动重设群集大小,请为群集启用自动缩放,并提供工作器数目的最小值和最大值范围。To allow Azure Databricks to resize your cluster automatically, you enable autoscaling for the cluster and provide the min and max range of workers.

  1. 启用自动缩放。Enable autoscaling.

    • 全用途群集 - 在“创建群集”页的“Autopilot 选项”框中,选中“启用自动缩放”复选框: All-Purpose cluster - On the Create Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:

      启用自动缩放Enable autoscaling

    • 作业群集 - 在“配置群集”页的“Autopilot 选项”框中,选择“启用自动缩放”复选框: Job cluster - On the Configure Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:

    启用自动缩放Enable autoscaling

  2. 配置最小和最大工作器数。Configure the min and max workers.

    配置最小和最大工作器数Configure min and max workers

    重要

    如果使用实例池If you are using an instance pool:

    • 请确保请求的群集大小小于或等于池中空闲实例的最小数目Make sure the cluster size requested is less than or equal to the minimum number of idle instances in the pool. 如果它大于该数目,则群集启动时间相当于不使用池的群集的启动时间。If it is larger, cluster startup time will be equivalent to a cluster that doesn’t use a pool.
    • 请确保群集最大大小小于或等于池的最大容量Make sure the maximum cluster size is less than or equal to the maximum capacity of the pool. 如果它大于该容量,则无法创建群集。If it is larger, the cluster creation will fail.

自动缩放示例Autoscaling example

如果将静态群集重新配置为自动缩放群集,Azure Databricks 会立即在最小值和最大值边界内重设群集的大小,然后开始自动缩放。If you reconfigure a static cluster to be an autoscaling cluster, Azure Databricks immediately resizes the cluster within the minimum and maximum bounds and then starts autoscaling. 例如,下表展示了在将群集重新配置为在 5 到 10 个节点之间进行自动缩放时,具有特定初始大小的群集会发生什么情况。As an example, the following table demonstrates what happens to clusters with a certain initial size if you reconfigure a cluster to autoscale between 5 and 10 nodes.

初始大小Initial size 重新配置后的大小Size after reconfiguration
66 66
1212 1010
33 55

自动缩放本地存储 Autoscaling local storage

通常,估算特定作业会占用的磁盘空间量十分困难。It can often be difficult to estimate how much disk space a particular job will take. 为了让你不必估算在创建时要附加到群集的托管磁盘的 GB 数,Azure Databricks 会自动在所有 Azure Databricks 群集上启用自动缩放本地存储。To save you from having to estimate how many gigabytes of managed disk to attach to your cluster at creation time, Azure Databricks automatically enables autoscaling local storage on all Azure Databricks clusters.

自动缩放本地存储时,Azure Databricks 会监视群集的 Spark 工作器上提供的可用磁盘空间量。With autoscaling local storage, Azure Databricks monitors the amount of free disk space available on your cluster’s Spark workers. 如果工作器开始出现磁盘空间严重不足的情况,则 Databricks 会在该工作器的磁盘空间耗尽之前自动将新的托管磁盘附加到该工作器。If a worker begins to run too low on disk, Databricks automatically attaches a new managed disk to the worker before it runs out of disk space. 附加磁盘时,每个虚拟机的总磁盘空间(包括虚拟机的初始本地存储)存在 5 TB 的限制。Disks are attached up to a limit of 5 TB of total disk space per virtual machine (including the virtual machine’s initial local storage).

仅当虚拟机返回到 Azure 时,才会拆离附加到虚拟机的托管磁盘。The managed disks attached to a virtual machine are detached only when the virtual machine is returned to Azure. 也就是说,只要虚拟机属于某个正在运行的群集,就永远不会将托管磁盘从该虚拟机中拆离。That is, managed disks are never detached from a virtual machine as long as it is part of a running cluster. 若要纵向缩减托管磁盘使用量,Azure Databricks 建议在配置了群集大小和自动缩放自动终止的群集中使用此功能。To scale down managed disk usage, Azure Databricks recommends using this feature in a cluster configured with Cluster size and autoscaling or Automatic termination.

Spark 配置 Spark configuration

若要微调 Spark 作业,可以在群集配置中提供自定义 Spark 配置属性To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.

  1. 在群集配置页面上,单击“高级选项”切换开关。On the cluster configuration page, click the Advanced Options toggle.

  2. 单击“Spark”选项卡。Click the Spark tab.

    Spark 配置Spark configuration

使用群集 API 配置群集时,请在创建群集请求编辑群集请求spark_conf 字段中设置 Spark 属性。When you configure a cluster using the Clusters API, set Spark properties in the spark_conf field in the Create cluster request or Edit cluster request.

若要为所有群集设置 Spark 属性,请创建一个全局初始化脚本To set Spark properties for all clusters, create a global init script:

dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
  |#!/bin/bash
  |
  |cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
  |[driver] {
  |  "spark.sql.sources.partitionOverwriteMode" = "DYNAMIC"
  |}
  |EOF
  """.stripMargin, true)

启用本地磁盘加密Enable local disk encryption

备注

此功能并非适用于所有 Azure Databricks 订阅。This feature is not available for all Azure Databricks subscriptions. 请联系 Microsoft 或 Databricks 客户代表,以申请访问权限。Contact your Microsoft or Databricks account representative to request access.

用于运行群集的某些实例类型可能有本地附加的磁盘。Some instance types you use to run clusters may have locally attached disks. Azure Databricks 可以在这些本地附加的磁盘上存储 shuffle 数据或临时数据。Azure Databricks may store shuffle data or ephemeral data on these locally attached disks. 为了确保针对所有存储类型加密所有静态数据(包括在群集的本地磁盘上暂时存储的 shuffle 数据),可以启用本地磁盘加密。To ensure that all data at rest is encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local disks, you can enable local disk encryption.

重要

工作负荷的运行速度可能会更慢,因为在本地卷中读取和写入加密的数据会影响性能。Your workloads may run more slowly because of the performance impact of reading and writing encrypted data to and from local volumes.

启用本地磁盘加密时,Azure Databricks 会在本地生成一个加密密钥,该密钥特定于每个群集节点,可以用来加密存储在本地磁盘上的所有数据。When local disk encryption is enabled, Azure Databricks generates an encryption key locally that is unique to each cluster node and is used to encrypt all data stored on local disks. 此密钥的作用域是每个群集节点的本地,会与群集节点本身一起销毁。The scope of the key is local to each cluster node and is destroyed along with the cluster node itself. 在其生存期内,密钥驻留在内存中进行加密和解密,并以加密形式存储在磁盘上。During its lifetime, the key resides in memory for encryption and decryption and is stored encrypted on the disk.

若要启用本地磁盘加密,必须使用群集 APITo enable local disk encryption, you must use the Clusters API. 在创建或编辑群集期间,请设置:During cluster creation or edit, set:

{
  "enable_local_disk_encryption": true
}

有关如何调用这些 API 的示例,请参阅群集 API 参考中的创建编辑See Create and Edit in the Clusters API reference for examples of how to invoke these APIs.

下面是一个启用本地磁盘加密的群集创建调用示例:Here is an example of a cluster create call that enables local disk encryption:

{
  "cluster_name": "my-cluster",
  "spark_version": "6.6.x-scala2.11",
  "node_type_id": "Standard_D3_v2",
  "enable_local_disk_encryption": true,
  "spark_conf": {
    "spark.speculation": true
  },
  "num_workers": 25
}

环境变量Environment variables

可以设置你可从群集上运行的脚本访问的环境变量。You can set environment variables that you can access from scripts running on a cluster.

  1. 在群集配置页面上,单击“高级选项”切换开关。On the cluster configuration page, click the Advanced Options toggle.

  2. 单击“Spark”选项卡。Click the Spark tab.

  3. 在“环境变量”字段中设置环境变量。Set the environment variables in the Environment Variables field.

    “环境变量”字段Environment Variables field

还可以使用创建群集请求编辑群集请求群集 API 终结点中的 spark_env_vars 字段来设置环境变量。You can also set environment variables using the spark_env_vars field in the Create cluster request or Edit cluster request Clusters API endpoints.

备注

在此字段中设置的环境变量在群集节点初始化脚本中不可用。The environment variables you set in this field are not available in Cluster node initialization scripts. 初始化脚本仅支持有限的一组预定义环境变量Init scripts support only a limited set of predefined Environment variables.

群集标记Cluster tags

可以使用群集标记轻松地监视组织中各种组所使用的云资源的成本。Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. 你可以在创建群集时将标记指定为键值对,Azure Databricks 会将这些标记应用于 VM 和磁盘卷等云资源。You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes.

群集标记将与池标记和工作区(资源组)标记一起传播到这些云资源。Cluster tags propagate to these cloud resources along with pool tags and workspace (resource group) tags. 若要详细了解这些标记类型如何协同工作,请参阅使用群集、池和工作区标记监视使用情况For more information about how these tag types work together, see Monitor usage using cluster, pool, and workspace tags.

为了方便起见,Azure Databricks 对每个群集应用四个默认标记:VendorCreatorClusterNameClusterIdFor convenience, Azure Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId.

此外,在作业群集上,Azure Databricks 应用两个默认标记:RunNameJobIdIn addition, on job clusters, Azure Databricks applies two default tags: RunName and JobId.

你可以在创建群集时添加自定义标记。You can add custom tags when you create a cluster. 若要配置群集标记,请执行以下步骤:To configure cluster tags:

  1. 在群集配置页面上,单击“高级选项”切换开关。On the cluster configuration page, click the Advanced Options toggle.

  2. 在页面底部,单击“标记”选项卡。At the bottom of the page, click the Tags tab.

    “标记”选项卡Tags tab

  3. 为每个自定义标记添加一个键值对。Add a key-value pair for each custom tag. 最多可以添加 43 个自定义标记。You can add up to 43 custom tags.

自定义标记显示在 Azure 帐单上,每当你添加、编辑或删除自定义标记时都会进行更新。Custom tags are displayed on Azure bills and updated whenever you add, edit, or delete a custom tag.

通过 SSH 访问群集 SSH access to clusters

可以通过 SSH 以远程方式登录到 Apache Spark 群集,以进行高级故障排除和安装自定义软件的操作。SSH allows you to log into Apache Spark clusters remotely for advanced troubleshooting and installing custom software.

出于安全原因,Azure Databricks 中的 SSH 端口默认处于关闭状态。For security reasons, in Azure Databricks the SSH port is closed by default. 若要允许通过 SSH 对 Spark 群集进行访问,请与 Azure Databricks 支持部门联系。If you want to enable SSH access to your Spark clusters, contact Azure Databricks support.

备注

仅当你的工作区部署在你自己的 Azure 虚拟网络中时,才能启用 SSH。SSH can be enabled only if your workspace is deployed in your own Azure virual network.

群集日志传送Cluster log delivery

创建群集时,可以指定一个位置,用于传送 Spark 驱动程序、工作器和事件日志。When you create a cluster, you can specify a location to deliver Spark driver, worker, and event logs. 日志每隔五分钟会传送到所选目标一次。Logs are delivered every five minutes to your chosen destination. 当群集被终止时,Azure Databricks 会确保传送在群集终止之前生成的所有日志。When a cluster is terminated, Azure Databricks guarantees to deliver all logs generated up until the cluster was terminated.

日志的目标取决于群集 ID。The destination of the logs depends on the cluster ID. 如果指定的目标为 dbfs:/cluster-log-delivery,则 0630-191345-leap375 的群集日志会传送到 dbfs:/cluster-log-delivery/0630-191345-leap375If the specified destination is dbfs:/cluster-log-delivery, cluster logs for 0630-191345-leap375 are delivered to dbfs:/cluster-log-delivery/0630-191345-leap375.

若要配置日志传送位置,请执行以下步骤:To configure the log delivery location:

  1. 在群集配置页面上,单击“高级选项”切换开关。On the cluster configuration page, click the Advanced Options toggle.

  2. 在页面底部,单击“日志记录”选项卡。At the bottom of the page, click the Logging tab.

    群集日志传送Cluster log delivery

  3. 选择目标类型。Select a destination type.

  4. 输入群集日志路径。Enter the cluster log path.

备注

此功能在 REST API 中也可用。This feature is also available in the REST API. 请参阅群集 API群集日志传送示例See Clusters API and Cluster log delivery examples.

初始化脚本Init scripts

群集节点初始化脚本是一个 shell 脚本,它会在每个群集节点启动期间 Spark 驱动程序或工作器 JVM 启动之前运行。A cluster node initialization—or init—script is a shell script that runs during startup for each cluster node before the Spark driver or worker JVM starts. 你可以使用初始化脚本安装未包括在 Databricks 运行时中的包和库、修改 JVM 系统类路径、设置 JVM 所使用的系统属性和环境变量、修改 Spark 配置参数,或者完成其他配置任务。You can use init scripts to install packages and libraries not included in the Databricks runtime, modify the JVM system classpath, set system properties and environment variables used by the JVM, or modify Spark configuration parameters, among other configuration tasks.

可以将初始化脚本附加到群集,方法是:展开“高级选项”部分,然后单击“初始化脚本”选项卡。You can attach init scripts to a cluster by expanding the Advanced Options section and clicking the Init Scripts tab.

有关详细说明,请参阅群集节点初始化脚本For detailed instructions, see Cluster node initialization scripts.