CLI (v2) Spark 组件 YAML 架构

项目
08/29/2024

注意

本文档中详细介绍的 YAML 语法基于最新版本的 ML CLI v2 扩展的 JSON 架构。此语法必定仅适用于最新版本的 ML CLI v2 扩展。可以在 https://azuremlschemasprod.azureedge.net/ 上查找早期扩展版本的架构。

YAML 语法

密钥	类型	说明	允许的值
`$schema`	字符串	YAML 架构。如果使用 Azure 机器学习 VS Code 扩展来创作 YAML 文件，则可通过在文件顶部包含 `$schema` 来调用架构和资源完成操作。
`type`	const	必需。组件的类型。	`spark`
`name`	字符串	必需。组件的名称。必须以小写字母开头。允许的字符是小写字母、数字和下划线 (_)。最大长度为 255 个字符。
`version`	字符串	组件的版本。如果省略，Azure 机器学习将自动生成一个版本。
`display_name`	字符串	组件在工作室 UI 中的显示名称。在工作区中可以不唯一。
`description`	字符串	组件的说明。
`tags`	object	组件的标记字典。
`code`	字符串	必需。包含此组件的源代码和脚本的文件夹的位置。
`entry`	对象	必需。组件的入口点。它可以定义 `file`。
`entry.file`	string	包含此组件的源代码和脚本的文件夹的位置。
`py_files`	object	要放置在 `PYTHONPATH` 中的 `.zip`、`.egg` 或 `.py` 文件的列表，以便通过此组件成功执行作业。
`jars`	对象	要包含在 Spark 驱动程序和执行程序 `CLASSPATH` 上的 `.jar` 文件列表，以便通过此组件成功执行作业。
`files`	对象	应复制到每个执行程序的工作目录的文件列表，以便通过此组件成功执行作业。
`archives`	对象	定义应提取到每个执行程序的工作目录中的存档列表，以便通过此组件成功执行作业。
`conf`	对象	Spark 驱动程序和执行程序属性。请参阅 `conf` 键的属性
`environment`	字符串或对象	用于组件的环境。此值可以是对工作区中现有版本受控环境的引用，也可以是对内联环境规范的引用。若要引用现有环境，请使用 `azureml:<environment_name>:<environment_version>` 语法或 `azureml:<environment_name>@latest`（引用环境的最新版本）。若要以内联方式定义环境，请遵循环境架构。排除 `name` 和 `version` 属性，因为内联环境不支持这两个属性。
`args`	string	应传递到组件入口点 Python 脚本的命令行参数。这些参数可能包含输入数据的路径以及写入输出的位置，例如 `"--input_data ${{inputs.<input_name>}} --output_path ${{outputs.<output_name>}}"`
`inputs`	object	组件输入的字典。键是组件上下文中的输入名称，值是输入值。可以在 `args` 中使用 `${{ inputs.<input_name> }}` 表达式引用输入。
`inputs.<input_name>`	数字、整数、布尔值、字符串或对象	文字值（数字、整数、布尔值或字符串类型）或包含组件输入数据规范的对象之一。
`outputs`	object	组件的输出配置字典。键是组件上下文中的输出名称，值是输出配置。可以在 `args` 中使用 `${{ outputs.<output_name> }}` 表达式引用输出。
`outputs.<output_name>`	object	Spark 组件输出。通过提供包含组件输出规范的对象，可以将 Spark 组件的输出写入到某个文件或文件夹位置。

`conf` 键的属性

键	类型	说明	默认值
`spark.driver.cores`	整型	Spark 驱动程序的核心数。
`spark.driver.memory`	字符串	为 Spark 驱动程序分配的内存，以千兆字节 (GB) 为单位，例如 `2g`。
`spark.executor.cores`	integer	Spark 执行程序的核心数。
`spark.executor.memory`	字符串	为 Spark Executor 分配的内存，以千兆字节 (GB) 为单位，例如 `2g`。
`spark.dynamicAllocation.enabled`	boolean	是否应动态分配执行程序，作为 `True` 或 `False` 值。如果此属性设置为 `True`，则定义 `spark.dynamicAllocation.minExecutors` 和 `spark.dynamicAllocation.maxExecutors`。如果此属性设置为 `False`，则定义 `spark.executor.instances`。	`False`
`spark.dynamicAllocation.minExecutors`	integer	用于动态分配的 Spark 执行程序实例的最小数目。
`spark.dynamicAllocation.maxExecutors`	integer	用于动态分配的 Spark 执行程序实例的最大数目。
`spark.executor.instances`	integer	Spark 执行程序实例的数目。

组件输入

密钥	类型	说明	允许的值	默认值
`type`	字符串	组件输入的类型。为指向单个文件源的输入数据指定 `uri_file`，或为指向文件夹源的输入数据指定 `uri_folder`。详细了解数据访问。	`uri_file`、`uri_folder`
`mode`	字符串	将数据传送到计算目标的模式。 `direct` 模式会将存储位置的 URL 作为组件输入传入。你完全负责处理存储访问凭据。	`direct`

组件输出

密钥	类型	说明	允许的值	默认值
`type`	字符串	组件输出的类型。	`uri_file`、`uri_folder`
`mode`	string	将输出文件传送到目标存储资源的模式。	`direct`

备注

az ml component 命令可用于管理 Azure 机器学习 Spark 组件。

示例

示例 GitHub 存储库中提供了示例。下面显示了几个示例。

YAML：Spark 组件示例

# spark-job-component.yaml
$schema: https://azuremlschemas.azureedge.net/latest/sparkComponent.schema.json
name: titanic_spark_component
type: spark
version: 1
display_name: Titanic-Spark-Component
description: Spark component for Titanic data

code: ./src
entry:
  file: titanic.py

inputs:
  titanic_data:
    type: uri_file
    mode: direct

outputs:
  wrangled_data:
    type: uri_folder
    mode: direct

args: >-
  --titanic_data ${{inputs.titanic_data}}
  --wrangled_data ${{outputs.wrangled_data}}

conf:
  spark.driver.cores: 1
  spark.driver.memory: 2g
  spark.executor.cores: 2
  spark.executor.memory: 2g
  spark.dynamicAllocation.enabled: True
  spark.dynamicAllocation.minExecutors: 1
  spark.dynamicAllocation.maxExecutors: 4

YAML：包含 Spark 组件的管道作业示例

# attached-spark-pipeline-user-identity.yaml
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: Titanic-Spark-CLI-Pipeline-2
description: Spark component for Titanic data in Pipeline

jobs:
  spark_job:
    type: spark
    component: ./spark-job-component.yml
    inputs:
      titanic_data: 
        type: uri_file
        path: azureml://datastores/workspaceblobstore/paths/data/titanic.csv
        mode: direct

    outputs:
      wrangled_data:
        type: uri_folder
        path: azureml://datastores/workspaceblobstore/paths/data/wrangled/
        mode: direct

    identity:
      type: user_identity

    compute: <ATTACHED_SPARK_POOL_NAME>

通过

CLI (v2) Spark 组件 YAML 架构

YAML 语法

`conf` 键的属性

组件输入

组件输出

备注

示例

YAML：Spark 组件示例

YAML：包含 Spark 组件的管道作业示例

后续步骤

其他资源

通过

CLI (v2) Spark 组件 YAML 架构

YAML 语法

conf 键的属性

组件输入

组件输出

备注

示例

YAML：Spark 组件示例

YAML：包含 Spark 组件的管道作业示例

后续步骤

其他资源

`conf` 键的属性