群集节点初始化脚本 Cluster node initialization scripts

Init 脚本是一个 shell 脚本,它在 Apache Spark 驱动程序或辅助 JVM 启动之前,在每个群集节点启动期间运行。An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM starts.

Init 脚本执行的任务的一些示例包括:Some examples of tasks performed by init scripts include:

  • 安装 Databricks Runtime 中未包含的包和库。Install packages and libraries not included in Databricks Runtime. 若要安装 Python 包,请使用位于 /databricks/python/bin/pip 的 Azure Databricks pip 二进制文件,确保 Python 包安装到 Azure Databricks Python 虚拟环境中,而不是系统 Python 环境中。To install Python packages, use the Azure Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into the Azure Databricks Python virtual environment rather than the system Python environment. 例如,/databricks/python/bin/pip install <package-name>For example, /databricks/python/bin/pip install <package-name>.
  • 特殊情况下,修改 JVM 系统类路径。Modify the JVM system classpath in special cases.
  • 设置 JVM 使用的系统属性和环境变量。Set system properties and environment variables used by the JVM.
  • 修改 Spark 配置参数。Modify Spark configuration parameters.

初始化脚本类型Init script types

Azure Databricks 支持两种 init 脚本:群集范围脚本和全局脚本。Azure Databricks supports two kinds of init scripts: cluster-scoped and global.

  • 群集范围脚本 :在使用脚本配置的每个群集上运行。Cluster-scoped : run on every cluster configured with the script. 这是运行 init 脚本的推荐方法。This is the recommended way to run an init script.

  • 全局脚本 (公共预览版):在工作区中的每个群集上运行。Global (Public Preview): run on every cluster in the workspace. 这些脚本有助于在工作区中强制实施一致的群集配置。They can help you to enforce consistent cluster configurations across your workspace. 请谨慎使用,因为它们可能会造成非预期影响,例如库冲突。Use them carefully because they can cause unanticipated impacts, like library conflicts. 只有管理员用户才能创建全局 init 脚本。Only admin users can create global init scripts.

    备注

    Azure Databricks 最近改进了全局 init 脚本的行为,以便以更安全、更可见和更安全的方式运行。Azure Databricks recently improved the behavior of global init scripts to work in a safer, more visible, and more secure manner. 应将现有的旧版全局 init 脚本迁移到新版全局 init 脚本框架。You should migrate existing “legacy” global init scripts to the new global init script framework. 请参阅从旧版本迁移到新版全局 init 脚本See Migrate from legacy to new global init scripts.

另外,还有两种不推荐使用的 init 脚本。In addition, there are two kinds of init scripts that are deprecated . Databricks 建议你将这些类型的 init 脚本迁移到上面列出的脚本:Databricks recommends that you migrate init scripts of these types to those listed above:

  • 群集命名脚本 :在与脚本名称相同的群集上运行。Cluster-named : run on a cluster with the same name as the script. 群集命名 init 脚本会尽力运行(以无提示方式忽略故障),并尝试继续执行群集启动过程。Cluster-named init scripts are best-effort (silently ignore failures), and attempt to continue the cluster launch process. 应改为使用群集范围 init 脚本,作为完整的替代方式。Cluster-scoped init scripts should be used instead and are a complete replacement.
  • 旧版全局脚本 :在每个群集上运行。Legacy global : run on every cluster. 它们不如新版全局 init 脚本框架安全,会以无提示方式忽略故障,并且无法引用环境变量They are less secure than the new global init script framework, silently ignore failures, and cannot reference environment variables. 应改用新版全局 init 脚本框架。The new global init script framework should be used instead.

每次更改任何类型的 init 脚本时,都必须重启受该脚本影响的所有群集。Whenever you change any type of init script you must restart all clusters affected by the script.

初始化脚本执行顺序Init script execution order

Init 脚本的执行顺序如下:The order of execution of init scripts is:

  1. 旧版全局脚本Legacy global
  2. 群集命名脚本Cluster-named
  3. 全局脚本(新版)Global (new)
  4. 群集范围脚本Cluster-scoped

环境变量 Environment variables

群集范围和全局 init 脚本(新版)支持以下环境变量:Cluster-scoped and global init scripts (new generation) support the following environment variables:

  • DB_CLUSTER_ID:运行脚本的群集的 ID。DB_CLUSTER_ID: the ID of the cluster on which the script is running. 请参阅群集 APISee Clusters API.
  • DB_CONTAINER_IP:运行 Spark 的容器的专用 IP 地址。DB_CONTAINER_IP: the private IP address of the container in which Spark runs. Init 脚本在此容器内运行。The init script is run inside this container. 请参阅 SparkNodeSee SparkNode.
  • DB_IS_DRIVER:脚本是否在驱动程序节点上运行。DB_IS_DRIVER: whether the script is running on a driver node.
  • DB_DRIVER_IP:驱动程序节点的 IP 地址。DB_DRIVER_IP: the IP address of the driver node.
  • DB_INSTANCE_TYPE:主机 VM 的实例类型。DB_INSTANCE_TYPE: the instance type of the host VM.
  • DB_CLUSTER_NAME:要在其上执行脚本的群集的名称。DB_CLUSTER_NAME: the name of the cluster the script is executing on.
  • DB_PYTHON_VERSION:在群集上使用的 Python 版本。DB_PYTHON_VERSION: the version of Python used on the cluster. 请参阅 Python 版本See Python version.
  • DB_IS_JOB_CLUSTER:是否创建群集来运行作业。DB_IS_JOB_CLUSTER: whether the cluster was created to run a job. 请参阅创建作业See Create a job.
  • SPARKPASSWORD:指向机密的路径。SPARKPASSWORD: a path to a secret.

例如,如果只想在驱动程序节点上运行脚本的一部分,则可以编写如下脚本:For example, if you want to run part of a script only on a driver node, you could write a script like:

echo $DB_IS_DRIVER
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  <run this part only on driver>
else
  <run this part only on workers>
fi
<run this part on both driver and workers>

日志记录Logging

在群集事件日志中捕获 init 脚本的开始和结束事件。Init script start and finish events are captured in cluster event logs. 在群集日志中捕获详细信息。Details are captured in cluster logs. 在帐户级别诊断日志中捕获全局 init 脚本的创建、编辑和删除事件。Global init script create, edit, and delete events are also captured in account-level diagnostic logs.

Init 脚本事件Init script events

群集事件日志捕获两个 init 脚本事件:INIT_SCRIPTS_STARTEDINIT_SCRIPTS_FINISHED,指示计划执行的脚本和已成功完成的脚本。Cluster event logs capture two init script events: INIT_SCRIPTS_STARTED and INIT_SCRIPTS_FINISHED, indicating which scripts are scheduled for execution and which have completed successfully. INIT_SCRIPTS_FINISHED 还捕获执行持续时间。INIT_SCRIPTS_FINISHED also captures execution duration.

在日志事件详细信息中,由密钥 "global" 指示全局 init 脚本,而由密钥 "cluster" 指示群集范围 init 脚本。Global init scripts are indicated in the log event details by the key "global" and cluster-scoped init scripts are indicated by the key "cluster".

备注

群集事件日志不记录每个群集节点的 init 脚本事件;仅选择一个节点来代表全部。Cluster event logs do not log init script events for each cluster node; only one node is selected to represent them all.

脚本日志 Init script logs

如果为群集配置了群集日志传送,则日志将传送到 /databricks/init_scriptsIf cluster log delivery is configured for a cluster, logs are delivered to /databricks/init_scripts. 对于每个容器,日志将显示在名为 init_scripts/<cluster_id>_<container_ip> 的子目录中。For each container, logs appear in a subdirectory called init_scripts/<cluster_id>_<container_ip>. 例如,如果将群集日志传送到 dbfs:/cluster-logs,则目录将为:dbfs:/cluster-logs/init_scripts/<cluster_id>_<container_ip>For example, if cluster logs are delivered to dbfs:/cluster-logs, the directory would be: dbfs:/cluster-logs/init_scripts/<cluster_id>_<container_ip>. 例如: 。For example:

dbfs ls dbfs:/cluster-logs/1001-234039-abcde739/init_scripts
1001-234039-abcde739_10_97_225_166
1001-234039-abcde739_10_97_231_88
1001-234039-abcde739_10_97_244_199

如果日志已传送到 DBFS,则可以使用文件系统实用工具查看日志。If the logs are delivered to DBFS, you can view the logs using File system utilities. 否则,你可以在笔记本中使用以下代码来查看日志:Otherwise, you can use the following code in a notebook to view the logs:

%sh
ls /databricks/init_scripts/

群集每次启动时,都会将日志写入 init 脚本日志文件夹。Every time a cluster launches, it writes a log to the init script log folder.

重要

创建群集并启用群集日志传送的任何用户都可以查看全局 init 脚本的 stderrstdout 输出。Any user who creates a cluster and enables cluster log delivery can view the stderr and stdout output from global init scripts. 应确保全局 init 脚本不输出任何敏感信息。You should ensure that your global init scripts do not output any sensitive information.

诊断日志 Diagnostic logs

Azure Databricks 诊断日志记录捕获事件类型为 globalInitScripts 的全局 init 脚本的创建、编辑和删除事件。Azure Databricks diagnostic logging captures global init script create, edit, and delete events under the event type globalInitScripts. 请参阅 Azure Databricks 中的诊断日志记录See Diagnostic logging in Azure Databricks.

群集范围的初始化脚本Cluster-scoped init scripts

群集范围 init 脚本是在群集配置中定义的 init 脚本。Cluster-scoped init scripts are init scripts defined in a cluster configuration. 群集范围 init 脚本同时适用于你创建的群集和为运行作业而创建的群集。Cluster-scoped init scripts apply to both clusters you create and those created to run jobs. 由于脚本是群集配置的一部分,因此可通过群集访问控制来控制可以更改脚本的用户。Since the scripts are part of the cluster configuration, cluster access control lets you control who can change the scripts.

可使用 UI、CLI 以及调用 Clusters API 来配置群集范围的 init 脚本。You can configure cluster-scoped init scripts using the UI, the CLI, and by invoking the Clusters API. 本部分重点介绍如何使用 UI 执行这些任务。This section focuses on performing these tasks using the UI. 有关其他方法,请参阅 Databricks CLI群集 APIFor the other methods, see Databricks CLI and Clusters API.

你可以添加任意数量的脚本,这些脚本会按照所提供的顺序依次执行。You can add any number of scripts, and the scripts are executed sequentially in the order provided.

如果群集范围 init 脚本返回非零的退出代码,则群集启动会失败。If a cluster-scoped init script returns a non-zero exit code, the cluster launch fails . 可通过配置日志传送并检查init 脚本日志来排查群集范围 init 脚本问题。You can troubleshoot cluster-scoped init scripts by configuring cluster log delivery and examining the init script log.

群集范围 init 脚本位置Cluster-scoped init script locations

可以将 init 脚本放在群集可访问的 DBFS 目录中。You can put init scripts in a DBFS directory accessible by a cluster. DBFS 中的群集节点 init 脚本必须存储在 DBFS 根目录中。Cluster-node init scripts in DBFS must be stored in the DBFS root. Azure Databricks 不支持将 init 脚本存储在通过安装对象存储创建的 DBFS 目录中。Azure Databricks does not support storing init scripts in a DBFS directory created by mounting object storage.

群集范围 init 脚本示例示例 Example cluster-scoped init scripts

本部分演示两个 init 脚本示例。This section shows two examples of init scripts.

示例:安装 PostgreSQL JDBC 驱动程序Example: Install PostgreSQL JDBC driver

在 Python 笔记本中运行的以下代码片段可创建用于安装 PostgreSQL JDBC 驱动程序的 init 脚本。The following snippets run in a Python notebook create an init script that installs a PostgreSQL JDBC driver.

  1. 创建要在其中存储 init 脚本的 DBFS 目录。Create a DBFS directory you want to store the init script in. 本示例使用 dbfs:/databricks/scriptsThis example uses dbfs:/databricks/scripts.

    dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
    
  2. 在该目录中创建一个名为 postgresql-install.sh 的脚本:Create a script named postgresql-install.sh in that directory:

    dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
    #!/bin/bash
    wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
    
  3. 检查该脚本是否存在。Check that the script exists.

    display(dbutils.fs.ls("dbfs:/databricks/scripts/postgresql-install.sh"))
    

也可在本地创建 init 脚本 postgresql-install.shAlternatively you can create the init script postgresql-install.sh locally:

#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar

然后使用 DBFS CLI 将其复制到 dbfs:/databricks/scriptsand copy it to dbfs:/databricks/scripts using DBFS CLI:

dbfs cp postgresql-install.sh dbfs:/databricks/scripts/postgresql-install.sh

示例:使用 conda 安装 Python 库Example: Use conda to install Python libraries

在 Databricks Runtime ML 中,使用 Conda 包管理器安装 Python 包。In Databricks Runtime ML, you use the Conda package manager to install Python packages. 若要在群集初始化时安装 Python 库,可以使用以下脚本:To install a Python library at cluster initialization, you can use a script like the following:

#!/bin/bash
set -ex
/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python
conda install -y astropy

备注

有关在群集上安装 Python 包的其他方法,请参阅For other ways to install Python packages on a cluster, see Libraries.

配置群集范围 init 脚本 Configure a cluster-scoped init script

可使用 UI 或 API 配置群集来运行 init 脚本。You can configure a cluster to run an init script using the UI or API.

重要

  • 该脚本必须存在于配置的位置。The script must exist at the configured location. 如果该脚本不存在,则群集将无法启动或自动纵向扩展。If the script doesn’t exist, the cluster will fail to start or be autoscaled up.
  • Init 脚本不能大于 64KB。The init script cannot be larger than 64KB. 如果脚本超过该大小,则群集将无法启动,并会在群集日志中显示失败消息。If a script exceeds that size, the cluster will fail to launch and a failure message will appear in the cluster log.

使用 UI 配置群集范围 init 脚本Configure a cluster-scoped init script using the UI

使用群集配置页面配置群集以运行 init 脚本:To use the cluster configuration page to configure a cluster to run an init script:

  1. 在群集配置页面上,单击“高级选项”切换开关。On the cluster configuration page, click the Advanced Options toggle.

  2. 在页面底部,单击“Init 脚本”选项卡。At the bottom of the page, click the Init Scripts tab.

    “Init 脚本”选项卡Init Scripts tab

  3. 在“目标”下拉列表中,选择一个目标类型。In the Destination drop-down, select a destination type. 在上一部分的示例中,目标为 DBFSIn the example in the preceding section, the destination is DBFS.

  4. 指定 init 脚本的路径。Specify a path to the init script. 在上一部分的示例中,路径为 dbfs:/databricks/scripts/postgresql-install.shIn the example in the preceding section, the path is dbfs:/databricks/scripts/postgresql-install.sh. 路径必须以 dbfs:/ 开头。The path must begin with dbfs:/.

  5. 单击“添加” 。Click Add .

若要从群集配置中删除脚本,请单击脚本右侧的To remove a script from the cluster configuration, click the “删除”图标 at the right of the script. 确认删除后,系统将提示你重启群集。When you confirm the delete you will be prompted to restart the cluster. (可选)可根据需要从脚本上传到的位置删除脚本文件。Optionally you can delete the script file from the location you uploaded it to.

使用 DBFS REST API 配置群集范围 init 脚本Configure a cluster-scoped init script using the DBFS REST API

若要使用 Clusters API 配置 ID 为 1202-211320-brick1 的群集,使其运行上一部分中的 init 脚本,请运行以下命令:To use the Clusters API to configure the cluster with ID 1202-211320-brick1 to run the init script in the preceding section, run the following command:

curl -n -X POST -H 'Content-Type: application/json' -d '{
  "cluster_id": "1202-211320-brick1",
  "num_workers": 1,
  "spark_version": "6.4.x-scala2.11",
  "node_type_id": "Standard_D3_v2",
  "cluster_log_conf": {
    "dbfs" : {
      "destination": "dbfs:/cluster-logs"
    }
  },
  "init_scripts": [ {
    "dbfs": {
      "destination": "dbfs:/databricks/scripts/postgresql-install.sh"
    }
  } ]
}' https://<databricks-instance>/api/2.0/clusters/edit

全局 init 脚本(新版) Global init scripts (new)

重要

新版全局 init 脚本框架以公共预览版提供。The new global init script framework is in Public Preview.

全局 init 脚本在工作区中创建的每个群集上运行。A global init script runs on every cluster created in your workspace. 如果希望强制执行组织范围的库配置或安全屏幕,建议使用全局 init 脚本。Global init scripts are useful when you want to enforce organization-wide library configurations or security screens. 只有管理员才能创建全局 init 脚本。Only admins can create global init scripts. 你可以使用 UI 或 REST API 来创建它们。You can create them using either the UI or REST API.

重要

请谨慎使用全局 init 脚本。Use global init scripts carefully. 可以轻松地添加库,或进行其他可能造成非预期影响的修改。It is easy to add libraries or make other modifications that cause unanticipated impacts. 应尽可能使用群集范围 init 脚本。Whenever possible, use cluster-scoped init scripts instead.

可通过配置日志传送并检查 init 脚本日志来排查全局 init 脚本问题。You can troubleshoot global init scripts by configuring cluster log delivery and examining the init script log.

重要

创建群集并启用群集日志传送的任何用户都可以查看全局 init 脚本的 stderrstdout 输出。Any user who creates a cluster and enables cluster log delivery can view the stderr and stdout output from global init scripts. 应确保全局 init 脚本不输出任何敏感信息。You should ensure that your global init scripts do not output any sensitive information.

使用 UI 添加全局 init 脚本 Add a global init script using the UI

若要使用管理控制台配置全局 init 脚本,请执行以下操作:To configure global init scripts using the Admin Console:

  1. 转到管理控制台,然后单击“全局 init 脚本”选项卡。Go to the Admin Console and click the Global Init Scripts tab.

    “全局 init 脚本”选项卡Global Init Scripts tab

  2. 单击“+ 添加”按钮。Click the + Add button.

  3. 通过在“脚本”字段中键入内容、粘贴或拖动文本文件,为脚本命名。Name the script and enter it by typing, pasting, or dragging a text file into the Script field.

    添加全局 init 脚本Add global init script

    备注

    Init 脚本不能大于 64KB。The init script cannot be larger than 64KB. 如果脚本超过该大小,则尝试保存时将出现错误消息。If a script exceeds that size, an error message will appear when you try to save.

  4. 如果为工作区配置了多个全局 init 脚本,请设置新脚本的运行顺序。If you have more than one global init script configured for your workspace, set the order in which the new script will run.

  5. 若要在保存后为所有新群集和重启的群集启用脚本,请打开“启动”开关。If you want the script to be enabled for all new and restarted clusters after you save, toggle on the Enabled switch.

    重要

    必须重启正在运行的群集,才能使对全局 init 脚本的更改生效,包括对运行顺序、名称和启用状态的更改。You must restart running clusters for changes to global init scripts to take effect, including changes to run order, name, and enablement state.

  6. 单击“添加” 。Click Add .

使用 UI 编辑全局 init 脚本Edit a global init script using the UI

  1. 转到管理控制台,然后单击“全局 init 脚本”选项卡。Go to the Admin Console and click the Global Init Scripts tab.
  2. 单击一个脚本。Click a script.
  3. 编辑该脚本。Edit the script.
  4. 单击“确认” 。Click Confirm .

使用 API 配置全局 init 脚本 Configure a global init script using the API

管理员可以使用全局 init 脚本 API 在工作区中添加、删除、重新排序并获取有关全局 init 脚本的信息。Admins can add, delete, re-order, and get information about the global init scripts in your workspace using the Global Init Scripts API.

从旧版迁移到新版全局 init 脚本 Migrate from legacy to new global init scripts

如果你的 Azure Databricks 工作区是在 2020 年 8 月之前启动的,则你可能仍然使用的是旧版全局 init 脚本。If your Azure Databricks workspace was launched before August 2020, you might still have legacy global init scripts. 应将它们迁移到新版全局 init 脚本框架中,以利用新版脚本框架中包含的安全性、一致性和可见性功能。You should migrate these to the new global init script framework to take advantage of the security, consistency, and visibility features included in the new script framework.

  1. 复制现有的旧版全局 init 脚本,然后使用 UIREST API 将它们添加到新版全局 init 脚本框架中。Copy your existing legacy global init scripts and add them to the new global init script framework using either the UI or the REST API.

    使它们保持禁用状态,直到完成下一步。Keep them disabled until you have completed the next step.

  2. 禁用所有旧版全局 init 脚本。Disable all legacy global init scripts.

    在管理员控制台中,转到“全局 init 脚本”标签,然后关闭“旧版全局 init 脚本”开关 。In the Admin Console, go to the Global Init Scripts tab and toggle off the Legacy Global Init Scripts switch.

    禁用旧版全局 init 脚本Disable legacy global init scripts

  3. 启用新版全局 init 脚本。Enable your new global init scripts.

    在“全局 init 脚本”选项卡上,为要启用的每个 init 脚本打开“启用”开关 。On the Global Init Scripts tab, toggle on the Enabled switch for each init script you want to enable.

    启动全局脚本Enable global scripts

  4. 重启所有群集。Restart all clusters.

    • 对于在自动纵向扩展运行的群集期间添加的新节点,旧版脚本将不会在这些节点上运行。Legacy scripts will not run on new nodes added during automated scale-up of running clusters. 新版全局 init 脚本也不会在这些新节点上运行。Nor will new global init scripts run on those new nodes. 如果全局脚本根本没有在新节点上运行,必须重启所有群集,才能确保新版脚本在这些新节点上运行,并且没有现有群集尝试添加新节点。You must restart all clusters to ensure that the new scripts run on them and that no existing clusters attempt to add new nodes with no global scripts running on them at all.
    • 迁移到新版全局 init 脚本框架并禁用旧脚本时,可能需要修改非幂等脚本。Non-idempotent scripts may need to be modified when you migrate to the new global init script framework and disable legacy scripts.

旧版全局 init 脚本(已弃用) Legacy global init scripts (deprecated)

全局 init 脚本在工作区中创建的每个群集上运行。A global init script runs on every cluster created in your workspace.

重要

已弃用全局 init 脚本,而推荐使用新版全局 init 脚本框架,该框架更安全,可以查看故障并可以引用与群集相关的环境变量Legacy global init scripts are deprecated in favor of the new global init script framework, which is more secure, provides visibility into failures, and can reference cluster-related environment variables. 将现有的旧版全局 init 脚本迁移到新框架,以利用这些改进。You should migrate existing legacy global init scripts to the new framework to take advantage of these improvements.

旧版全局 init 脚本必须存储在 dbfs:/databricks/init/中。A legacy global init script must be stored in dbfs:/databricks/init/.

重要

  • 请谨慎使用全局 init 脚本。Use global init scripts carefully. 可以轻松地添加库,或进行其他可能造成非预期影响的修改。It is easy to add libraries or make other modifications that cause unanticipated impacts. 应尽可能使用群集范围 init 脚本。Whenever possible, use cluster-scoped init scripts instead.
  • 如果有多个旧版全局 init 脚本,则执行顺序不确定,并且取决于 DBFS 客户端返回脚本的顺序。If there is more than one legacy global init script, the order of execution is undetermined and depends on the order that the DBFS client returns the scripts.

若要删除旧版全局 init 脚本,请删除 init 脚本文件。To delete a legacy global init script, delete the init script file. 可以使用 DBFS APIDBFS CLI 在笔记本中执行此操作。You can perform this in a notebook, using the DBFS API, or using the DBFS CLI. 例如: 。For example:

dbutils.fs.rm("/databricks/init/my-echo.sh")

如果创建的旧版全局 init 脚本阻止了新群集的启动,请使用 API 或 CLI 来移动或删除该脚本。If you have created a legacy global init script that is preventing new clusters from starting up, use the API or CLI to move or delete the script.

示例旧版全局 init 脚本Example legacy global init script

在 Python 笔记本中运行的以下代码片段在 DBFS 位置 /databricks/init/ 中创建了名为 my-echo.sh 的旧版全局 init 脚本:The following snippets run in a Python notebook create a legacy global init script named my-echo.sh in the DBFS location /databricks/init/:

  1. 如果 dbfs:/databricks/init/ 不存在,请进行创建。Create dbfs:/databricks/init/ if it doesn’t exist.

    dbutils.fs.mkdirs("dbfs:/databricks/init/")
    
  2. 显示现有的全局 init 脚本的列表。Display the list of existing global init scripts.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    
  3. 创建仅追加到文件的脚本。Create a script that simply appends to a file.

    dbutils.fs.put("dbfs:/databricks/init/my-echo.sh" ,"""
    #!/bin/bash
    
    echo "hello" >> /hello.txt
    """, True)
    
  4. 检查该脚本是否存在。Check that the script exists.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    

群集命名 init 脚本(已弃用) Cluster-named init scripts (deprecated)

群集命名脚本的作用域是单个群集,由群集名称指定。Cluster-named scripts scope to a single cluster, specified by the cluster’s name. 群集命名 init 脚本必须存储在目录 dbfs:/databricks/init/<cluster-name> 中。Cluster-named init scripts must be stored in the directory dbfs:/databricks/init/<cluster-name>. 例如,若要为名为 PostgreSQL 的群集指定 init 脚本,请创建目录 dbfs:/databricks/init/PostgreSQL,然后将应在群集 PostgreSQL 上运行的所有脚本都放在该目录中。For example, to specify init scripts for the cluster named PostgreSQL, create the directory dbfs:/databricks/init/PostgreSQL, and put all scripts that should run on cluster PostgreSQL in that directory.

重要

  • 已弃用群集命名 init 脚本。Cluster-named init scripts are deprecated. 应使用群集范围 init 脚本You should use cluster-scoped init scripts instead.
  • 不能对运行作业的群集使用群集命名 init 脚本,因为作业群集名称是动态生成的。You cannot use cluster-named init scripts for clusters that run jobs because job cluster names are generated on the fly. 但可以将群集范围 init 脚本用于作业群集。However, you can use cluster-scoped init scripts for job clusters.
  • 请避免在群集名称中使用空格,因为它们用于脚本和输出路径中。Avoid spaces in cluster names since they’re used in the script and output paths.
  • 如果有多个群集命名 init 脚本,则执行顺序不确定,并且取决于 DBFS 客户端返回脚本的顺序。If there is more than one cluster-named init script, the order of execution is undetermined and depends on the order that the DBFS client returns the scripts.

若要删除群集命名 init 脚本,请删除 init 脚本文件。To delete a cluster-named init script, delete the init script file. 可使用 DBFS APIDBFS CLI 在笔记本中执行此操作。You can perform this in a notebook, or using the DBFS API, or using the DBFS CLI. 例如: 。For example:

dbutils.fs.rm("dbfs:/databricks/init/PostgreSQL/postgresql-install.sh")

群集命名 init 脚本示例Example cluster-named init script

在 Python 笔记本中运行的以下代码片段可在 DBFS 位置 /databricks/init 中创建一个名为 PostgreSQL 的 init 脚本,该脚本用于在该群集上安装 PostgreSQL JDBC 驱动程序。The following snippets run in a Python notebook create an init script named PostgreSQL, in the DBFS location /databricks/init, that installs the PostgreSQL JDBC driver on that cluster. 如果创建包含群集名称的变量 clusterName,则可以创建可自定义的命令。You can create a customizable command if you create a variable clusterName that holds the cluster name.

  1. 如果 dbfs:/databricks/init/ 不存在,请进行创建。Create dbfs:/databricks/init/ if it doesn’t exist.

    dbutils.fs.mkdirs("dbfs:/databricks/init/")
    
  2. 配置群集名称变量。Configure a cluster name variable.

    clusterName = "PostgreSQL"
    
  3. 创建名为 PostgreSQL 的目录。Create a subdirectory named PostgreSQL.

    dbutils.fs.mkdirs("dbfs:/databricks/init/%s/"%clusterName)
    
  4. 在目录 PostgreSQL 中创建脚本。Create the script into the directory PostgreSQL.

    dbutils.fs.put("/databricks/init/PostgreSQL/postgresql-install.sh","""
    #!/bin/bash
    wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar
    wget --quiet -O /mnt/jars/driver-daemon/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
    
  5. 检查特定于群集的 init 脚本是否存在。Check that the cluster-specific init script exists.

    display(dbutils.fs.ls("dbfs:/databricks/init/%s/postgresql-install.sh"%clusterName))
    

旧版全局和群集命名的初始化脚本日志(已弃用)Legacy global and cluster-named init script logs (deprecated)

Databricks 将旧版全局和群集命名 init 脚本的所有 init 脚本输出保存到 DBFS 中的文件,命名方式如下:dbfs:/databricks/init/output/<cluster-name>/<date-timestamp>/<script-name>_<node-ip>.logDatabricks saves all init script output for legacy global and cluster-named init scripts to a file in DBFS named as follows: dbfs:/databricks/init/output/<cluster-name>/<date-timestamp>/<script-name>_<node-ip>.log. 例如,如果群集 PostgreSQL 具有两个 IP 地址为 10.0.0.110.0.0.2 的 Spark 节点,并且 init 脚本目录中包含名为 installpostgres.sh 的脚本,则在以下路径中将有两个输出文件:For example, if a cluster PostgreSQL has two Spark nodes with IPs 10.0.0.1 and 10.0.0.2, and the init script directory has a script called installpostgres.sh, there will be two output files at the following paths:

  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.1.log
  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.2.log