捆绑包配置示例

本文提供 Databricks 资产捆绑包功能和常见捆绑包用例的示例配置。

捆绑包示例 GitHub 存储库中提供了下表中概述的完整捆绑示例：

捆绑名称	Description
app_with_database	包含由 OLTP Postgres 数据库支持的 Databricks 应用的捆绑包
dashboard_nyc_taxi	包含 AI/BI 仪表板和作业的捆绑包，用于捕获仪表板快照并将其通过电子邮件发送到订阅者
database_with_catalog	定义 OLTP 数据库实例和数据库目录的捆绑包
databricks_app	定义 Databricks 应用的捆绑包
development_cluster	定义和使用开发（用途全用途）群集的捆绑包
job_read_secret	定义机密范围和作业的捆绑包，其中包含从中读取的任务
job_with_multiple_wheels	定义和使用具有多个滚轮依赖项的作业的捆绑包
job_with_run_job_tasks	包含具有运行作业任务的多个作业的捆绑包
job_with_sql_notebook	包含使用 SQL 笔记本任务的作业的捆绑包
pipeline_with_schema	定义 Unity 目录架构和使用它的管道的捆绑包
private_wheel_packages	使用作业中专用滚轮包的捆绑包
python_wheel_poetry	使用诗歌构建的 `whl` 捆绑包
serverless_job	使用无服务器计算运行作业的捆绑包
share_files_across_bundles	一个捆绑包，其中包含位于捆绑包根目录外部的文件。
spark_jar_task	定义和使用 Spark JAR 任务的捆绑包
write_from_job_to_volume	将文件写入 Unity 目录卷的捆绑包

捆绑方案

本部分包含演示如何使用顶级捆绑包映射的配置示例。请参阅配置参考。

将 JAR 文件上传到 Unity 目录的捆绑包

可以将 Unity 目录卷指定为项目路径，以便所有项目（如 JAR 文件和 wheel 文件）都上传到 Unity 目录卷。以下示例捆绑包生成 JAR 文件并将其上传到 Unity 目录。有关映射的信息 artifact_path ，请参阅 artifact_path。有关信息 artifacts，请参阅项目。

bundle:
  name: jar-bundle

workspace:
  host: https://myworkspace.cloud.databricks.com
  artifact_path: /Volumes/main/default/my_volume

artifacts:
  my_java_code:
    path: ./sample-java
    build: 'javac PrintArgs.java && jar cvfm PrintArgs.jar META-INF/MANIFEST.MF PrintArgs.class'
    files:
      - source: ./sample-java/PrintArgs.jar

resources:
  jobs:
    jar_job:
      name: 'Spark Jar Job'
      tasks:
        - task_key: SparkJarTask
          new_cluster:
            num_workers: 1
            spark_version: '14.3.x-scala2.12'
            node_type_id: 'i3.xlarge'
          spark_jar_task:
            main_class_name: PrintArgs
          libraries:
            - jar: ./sample-java/PrintArgs.jar

作业配置

本部分包含作业配置示例。有关作业配置详细信息，请参阅作业。

使用无服务器计算的作业

Databricks 资产捆绑包支持在无服务器计算上运行的作业。请参阅使用适用于工作流的无服务器计算运行 Lakeflow 作业。若要配置此功能，可以省略 clusters 具有笔记本任务的作业的设置，也可以指定环境，如以下示例所示。对于 Python 脚本、Python 滚轮和 dbt 任务， environment_key 无服务器计算是必需的。请参阅 environment_key。

# A serverless job (no cluster definition)
resources:
  jobs:
    serverless_job_no_cluster:
      name: serverless_job_no_cluster

      email_notifications:
        on_failure:
          - someone@example.com

      tasks:
        - task_key: notebook_task
          notebook_task:
            notebook_path: ../src/notebook.ipynb

# A serverless job (environment spec)
resources:
  jobs:
    serverless_job_environment:
      name: serverless_job_environment

      tasks:
        - task_key: task
          spark_python_task:
            python_file: ../src/main.py

          # The key that references an environment spec in a job.
          # https://docs.databricks.com/api/workspace/jobs/create#tasks-environment_key
          environment_key: default

      # A list of task execution environment specifications that can be referenced by tasks of this job.
      environments:
        - environment_key: default

          # Full documentation of this spec can be found at:
          # https://docs.databricks.com/api/workspace/jobs/create#environments-spec
          spec:
            environment_version: '2'
            dependencies:
              - my-library

具有多个 wheel 文件的作业

以下示例配置定义包含具有多个 *.whl 文件的作业的捆绑包。

# job.yml
resources:
  jobs:
    example_job:
      name: 'Example with multiple wheels'
      tasks:
        - task_key: task

          spark_python_task:
            python_file: ../src/call_wheel.py

          libraries:
            - whl: ../my_custom_wheel1/dist/*.whl
            - whl: ../my_custom_wheel2/dist/*.whl

          new_cluster:
            node_type_id: i3.xlarge
            num_workers: 0
            spark_version: 14.3.x-scala2.12
            spark_conf:
              'spark.databricks.cluster.profile': 'singleNode'
              'spark.master': 'local[*, 4]'
            custom_tags:
              'ResourceClass': 'SingleNode'

# databricks.yml
bundle:
  name: job_with_multiple_wheels

include:
  - ./resources/job.yml

workspace:
  host: https://myworkspace.cloud.databricks.com

artifacts:
  my_custom_wheel1:
    type: whl
    build: poetry build
    path: ./my_custom_wheel1

  my_custom_wheel2:
    type: whl
    build: poetry build
    path: ./my_custom_wheel2

targets:
  dev:
    default: true
    mode: development

包含参数的作业

以下示例配置使用参数定义作业。有关参数化作业的详细信息，请参阅参数化作业。

resources:
  jobs:
    job_with_parameters:
      name: job_with_parameters

      tasks:
        - task_key: task_a
          spark_python_task:
            python_file: ../src/file.py
            parameters:
              - '--param1={{ job.parameters.param1 }}'
              - '--param2={{ job.parameters.param2 }}'

          new_cluster:
            node_type_id: i3.xlarge
            num_workers: 1
            spark_version: 14.3.x-scala2.12

      parameters:
        - name: param1
          default: value1
        - name: param2
          default: value1

可以在运行时通过将作业参数传递给bundle run 来设置这些参数，例如：

databricks bundle run -- --param1=value2 --param2=value2

使用 requirements.txt 文件的作业

以下示例配置定义使用 requirements.txt 文件的作业。

resources:
  jobs:
    job_with_requirements_txt:
      name: 'Example job that uses a requirements.txt file'
      tasks:
        - task_key: task
          job_cluster_key: default
          spark_python_task:
            python_file: ../src/main.py
          libraries:
            - requirements: /Workspace/${workspace.file_path}/requirements.txt

按计划作业

以下示例显示了按计划运行的作业的配置。有关作业计划和触发器的信息，请参阅使用计划和触发器自动执行作业。

此配置定义在指定时间每天运行的作业：

resources:
  jobs:
    my-notebook-job:
      name: my-notebook-job
      tasks:
        - task_key: my-notebook-task
          notebook_task:
            notebook_path: ./my-notebook.ipynb
      schedule:
        quartz_cron_expression: '0 0 8 * * ?' # daily at 8am
        timezone_id: UTC
        pause_status: UNPAUSED

在此配置中，作业在上次运行作业后的一周运行：

resources:
  jobs:
    my-notebook-job:
      name: my-notebook-job
      tasks:
        - task_key: my-notebook-task
          notebook_task:
            notebook_path: ./my-notebook.ipynb
      trigger:
        pause_status: UNPAUSED
        periodic:
          interval: 1
          unit: WEEKS

管道配置

本部分包含管道配置示例。有关管道配置信息，请参阅管道。

使用无服务器计算的管道

Databricks 资产捆绑包支持无服务器计算上运行的管道。若要对此进行配置，请将管道 serverless 设置设置为 true。以下示例配置定义一个管道，该管道在安装了依赖项的无服务器计算上运行，以及每小时触发一次管道刷新的作业。

# A pipeline that runs on serverless compute
resources:
  pipelines:
    my_pipeline:
      name: my_pipeline
      target: ${bundle.environment}
      serverless: true
      environment:
        dependencies:
          - 'dist/*.whl'
      catalog: users
      libraries:
        - notebook:
            path: ../src/my_pipeline.ipynb

      configuration:
        bundle.sourcePath: /Workspace/${workspace.file_path}/src

# This defines a job to refresh a pipeline that is triggered every hour
resources:
  jobs:
    my_job:
      name: my_job

      # Run this job once an hour.
      trigger:
        periodic:
          interval: 1
          unit: HOURS

      email_notifications:
        on_failure:
          - someone@example.com

      tasks:
        - task_key: refresh_pipeline
          pipeline_task:
            pipeline_id: ${resources.pipelines.my_pipeline.id}

Last updated on 2026-01-29

通过