教程：创建第一个自定义 Databricks 资产捆绑包模板

2024-11-11

在本教程中，创建自定义 Databricks 资产捆绑包模板，用于创建使用特定 Docker 容器映像对群集运行具有特定 Python 任务的作业的捆绑包。

准备工作

安装 Databricks CLI 0.218.0 或更高版本。如果已安装，请从命令行运行 databricks -version，确认版本为 0.218.0 或更高版本。

定义用户提示变量

生成捆绑模板的第一步是定义 databricks bundle init 用户提示变量。通过命令行：

创建名为 dab-container-template 的空目录：
```
mkdir dab-container-template
```
在该目录的根中，创建一个名为 databricks_template_schema.json 的文件：
```
cd dab-container-template
touch databricks_template_schema.json
```

将以下代码添加到 databricks_template_schema.json 文件并进行保存。创建捆绑包期间，每个变量都将转换为用户提示。

{
  "properties": {
    "project_name": {
      "type": "string",
      "default": "project_name",
      "description": "Project name",
      "order": 1
    }
  }
}

创建捆绑文件夹结构

接下来，在模板目录中，创建名为 resources 和 src 的子目录。 template 文件夹包含生成的捆绑包的目录结构。从用户值派生时，子目录和文件的名称将遵循 Go 包模板语法。

  mkdir -p "template/resources"
  mkdir -p "template/src"

添加 YAML 配置模板

在 template 目录中，创建一个名为 databricks.yml.tmpl 的文件，并添加以下 YAML：

  touch template/databricks.yml.tmpl

  # This is a Databricks asset bundle definition for {{.project_name}}.
  # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
  bundle:
    name: {{.project_name}}

  include:
    - resources/*.yml

  targets:
    # The 'dev' target, used for development purposes.
    # Whenever a developer deploys using 'dev', they get their own copy.
    dev:
      # We use 'mode: development' to make sure everything deployed to this target gets a prefix
      # like '[dev my_user_name]'. Setting this mode also disables any schedules and
      # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
      mode: development
      default: true
      workspace:
        host: {{workspace_host}}

    # The 'prod' target, used for production deployment.
    prod:
      # For production deployments, we only have a single copy, so we override the
      # workspace.root_path default of
      # /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
      # to a path that is not specific to the current user.
      #
      # By making use of 'mode: production' we enable strict checks
      # to make sure we have correctly configured this target.
      mode: production
      workspace:
        host: {{workspace_host}}
        root_path: /Shared/.bundle/prod/${bundle.name}
      {{- if not is_service_principal}}
      run_as:
        # This runs as {{user_name}} in production. Alternatively,
        # a service principal could be used here using service_principal_name
        # (see Databricks documentation).
        user_name: {{user_name}}
      {{end -}}

创建名为 {{.project_name}}_job.yml.tmpl 的另一个 YAML 文件，并将其置于 template/resources 目录中。此新的 YAML 文件将项目作业定义与捆绑包的其余定义分开。将以下 YAML 添加到此文件以描述模板作业，其中包含使用特定 Docker 容器映像在作业群集上运行的特定 Python 任务：

  touch template/resources/{{.project_name}}_job.yml.tmpl

  # The main job for {{.project_name}}
  resources:
    jobs:
      {{.project_name}}_job:
        name: {{.project_name}}_job
        tasks:
          - task_key: python_task
            job_cluster_key: job_cluster
            spark_python_task:
              python_file: ../src/{{.project_name}}/task.py
        job_clusters:
          - job_cluster_key: job_cluster
            new_cluster:
              docker_image:
                url: databricksruntime/python:10.4-LTS
              node_type_id: i3.xlarge
              spark_version: 13.3.x-scala2.12

在此示例中，将使用默认 Databricks 基础 Docker 容器映像，但可以改为指定自己的自定义映像。

添加配置中引用的文件

接下来，创建 template/src/{{.project_name}} 目录，并在模板中创建作业引用的 Python 任务文件：

  mkdir -p template/src/{{.project_name}}
  touch template/src/{{.project_name}}/task.py

现在，将下列内容添加到 task.py：

  import pyspark
  from pyspark.sql import SparkSession

  spark = SparkSession.builder.master('local[*]').appName('example').getOrCreate()

  print(f'Spark version{spark.version}')

验证捆绑模板结构

查看捆绑包模板项目的文件夹结构。应如下所示：

  .
  ├── databricks_template_schema.json
  └── template
      ├── databricks.yml.tmpl
      ├── resources
      │   └── {{.project_name}}_job.yml.tmpl
      └── src
          └── {{.project_name}}
              └── task.py

测试模板

最后，测试捆绑模板。若要基于新的自定义模板生成捆绑包，请使用 databricks bundle init 命令，并指定新模板位置。从捆绑包项目根文件夹：

mkdir my-new-container-bundle
cd my-new-container-bundle
databricks bundle init dab-container-template

后续步骤

创建一个捆绑包，以将笔记本部署到 Azure Databricks 工作区，然后将部署的笔记本作为 Azure Databricks 作业运行。请参阅使用 Databricks 资产捆绑包在 Azure Databricks 上开发作业。
创建一个捆绑包，以将笔记本部署到 Azure Databricks 工作区，然后将部署的笔记本作为增量实时表管道运行。请参阅使用 Databricks 资产捆绑包开发增量实时表管道。
创建用于部署和运行 MLOps Stack 的捆绑包。请参阅适用于 MLOps 堆栈的 Databricks 资产捆绑包。
将捆绑包添加到 GitHub 中的 CI/CD（持续集成/持续部署）工作流。请参阅使用 Databricks 资产捆绑包和 GitHub Actions 运行 CI/CD 工作流。

资源

GitHub 中的捆绑包示例存储库

通过