CLI (v2) 命令作业 YAML 架构

项目
09/14/2024

源 JSON 架构可在 https://azuremlschemas.azureedge.net/latest/commandJob.schema.json 中找到。

注意

本文档中详细介绍的 YAML 语法基于最新版本的 ML CLI v2 扩展的 JSON 架构。此语法必定仅适用于最新版本的 ML CLI v2 扩展。可以在 https://azuremlschemasprod.azureedge.net/ 上查找早期扩展版本的架构。

YAML 语法

密钥	类型	说明	允许的值	默认值
`$schema`	字符串	YAML 架构。如果使用 Azure 机器学习 VS Code 扩展来创作 YAML 文件，则可通过在文件顶部包含 `$schema` 来调用架构和资源完成操作。
`type`	const	作业类型。	`command`	`command`
`name`	字符串	作业的名称。对工作区中的所有作业必须唯一。如果省略此项，Azure 机器学习将为该名称自动生成一个 GUID。
`display_name`	字符串	作业在工作室 UI 中的显示名称。在工作区中可以不唯一。如果省略此项，Azure 机器学习将为显示名称自动生成人类可读的形容词-名词标识符。
`experiment_name`	字符串	用于对作业进行组织的试验名称。每个作业的运行记录都在工作室的“试验”选项卡中的相应试验下进行组织。如果省略，Azure 机器学习会将其默认为创建作业的工作目录的名称。
`description`	string	作业的说明。
`tags`	object	作业的标记字典。
`command`	字符串	必需（如果不使用 `component` 字段）。要执行的命令。
`code`	字符串	要上传并用于作业的源代码目录的本地路径。
`environment`	字符串或对象	必需（如果不使用 `component` 字段）。用于作业的环境。可以是对工作区中现有版本受控环境的引用，也可以是对内联环境规范的引用。若要引用现有环境，请使用 `azureml:<environment_name>:<environment_version>` 语法或 `azureml:<environment_name>@latest`（引用环境的最新版本）。若要以内联方式定义环境，请遵循环境架构。排除 `name` 和 `version` 属性，因为内联环境不支持这些属性。
`environment_variables`	object	要在执行命令的进程上设置的环境变量键/值对的字典。
`distribution`	object	分布式训练方案的分布配置。 MpiConfiguration、PyTorchConfiguration 或 TensorFlowConfiguration。
`compute`	字符串	要在其上执行作业的计算目标的名称。可以是对工作区中现有计算的引用（使用 `azureml:<compute_name>` 语法），也可以是对 `local` 的引用，以指定本地执行。注意：管道中的作业不支持将 `local` 作为 `compute`		`local`
`resources.instance_count`	整型	用于作业的节点数。		`1`
`resources.instance_type`	字符串	用于作业的实例类型。适用于在启用了 Azure Arc 的 Kubernetes 计算上运行的作业（其中 `compute` 字段中指定的计算目标是 `type: kubernentes`）。如果省略，则默认为 Kubernetes 群集的默认实例类型。有关更多信息，请参阅创建和选择 Kubernetes 实例类型。
`resources.shm_size`	字符串	Docker 容器的共享内存块的大小。应采用 `<number><unit>` 格式，其中数字必须大于 0，单位可以是 `b`（字节）、`k`（千字节）、`m`（兆字节）或 `g`（千兆字节）之一。		`2g`
`limits.timeout`	整型	允许作业运行的最长时间（秒）。达到此限制时，系统会取消作业。
`inputs`	object	作业的输入字典。键是作业上下文中的输入名称，值是输入值。可以在 `command` 中使用 `${{ inputs.<input_name> }}` 表达式引用输入。
`inputs.<input_name>`	数字、整数、布尔值、字符串或对象	文字值（数字、整数、布尔值或字符串类型）或包含作业输入数据规范的对象之一。
`outputs`	object	作业的输出配置字典。键是作业上下文中的输出名称，值是输出配置。可以在 `command` 中使用 `${{ outputs.<output_name> }}` 表达式引用输出。
`outputs.<output_name>`	object	可以将该对象留空，在这种情况下，输出会默认为 `uri_folder` 类型，且 Azure 机器学习会为输出生成输出位置。输出目录的文件会通过读写装载写入。如果要为输出指定不同的模式，请提供一个包含作业输出规范的对象。
`identity`	对象	此标识用于数据访问。它可以是 UserIdentityConfiguration、ManagedIdentityConfiguration 或 None。如果为 UserIdentityConfiguration，则使用作业提交者的标识以访问、输入数据并将结果写入输出文件夹，否则使用计算目标的托管标识。

分布配置

MpiConfiguration

密钥	类型	说明	允许的值
`type`	const	必需。分布类型。	`mpi`
`process_count_per_instance`	整型	必需。要为作业启动的每节点进程数。

PyTorchConfiguration

密钥	类型	说明	允许的值	默认值
`type`	const	必需。分布类型。	`pytorch`
`process_count_per_instance`	整型	要为作业启动的每节点进程数。		`1`

TensorFlowConfiguration

密钥	类型	说明	允许的值	默认值
`type`	const	必需。分布类型。	`tensorflow`
`worker_count`	整型	要为作业启动的工作线程数。		默认为 `resources.instance_count`。
`parameter_server_count`	整型	要为作业启动的参数服务器数。		`0`

作业输入

键	类型	说明	允许的值	默认值
`type`	字符串	作业输入的类型。为指向单个文件源的输入数据指定 `uri_file`，或为指向文件夹源的输入数据指定 `uri_folder`。	`uri_file`，`uri_folder`，`mlflow_model`，`custom_model`	`uri_folder`
`path`	字符串	用作输入的数据的路径。可以通过几种方式进行指定： - 数据源文件或文件夹的本地路径，例如 `path: ./iris.csv`。数据在作业提交期间上传。 - 要用作输入的文件或文件夹的云路径的 URI。支持的 URI 类型为 `azureml`、`https`、`wasbs`、`abfss`、`adl`。有关如何使用 `azureml://` URI 格式的详细信息，请参阅核心 YAML 语法。 - 要用作输入的现有已注册的 Azure 机器学习数据资产。若要引用已注册的数据资产，请使用 `azureml:<data_name>:<data_version>` 语法或 `azureml:<data_name>@latest`（用于引用数据资产的最新版本），例如 `path: azureml:cifar10-data:1` 或 `path: azureml:cifar10-data@latest`。
`mode`	字符串	将数据传送到计算目标的模式。对于只读装载（`ro_mount`），该数据将用作装载路径。文件夹是文件夹装载的，而文件则作为文件装载。 Azure 机器学习会将输入解析为装载路径。对于 `download` 模式，数据将下载到计算目标。 Azure 机器学习会将输入解析为下载的路径。如果只想要数据项目的存储位置的 URL，而不是装载或下载数据本身，则可以使用 `direct` 模式。此模式将存储位置的 URL 作为作业输入传入。在这种情况下，你全权负责处理凭证以访问存储。 `eval_mount` 和 `eval_download` 模式对于 MLTable 是唯一的，并且将数据装载为路径或将数据下载到计算目标。有关详细信息，请参阅访问作业中的数据	`ro_mount`，`download`，`direct`，`eval_download`，`eval_mount`	`ro_mount`

作业输出

键	类型	说明	允许的值	默认值
`type`	字符串	作业输出的类型。对于默认的 `uri_folder` 类型，输出对应于某个文件夹。	`uri_folder` ，`mlflow_model`，`custom_model`	`uri_folder`
`mode`	string	输出文件如何传送到目标存储的模式。对于读写装载模式 (`rw_mount`)，输出目录是装载的目录。对于上传模式，写入的文件在作业结束时上传。	`rw_mount`，`upload`	`rw_mount`

标识配置

UserIdentityConfiguration

密钥	类型	说明	允许的值
`type`	const	必需。标识类型。	`user_identity`

ManagedIdentityConfiguration

密钥	类型	说明	允许的值
`type`	const	必需。标识类型。	`managed` 或 `managed_identity`

备注

az ml job 命令可用于管理 Azure 机器学习作业。

示例

示例 GitHub 存储库中提供了示例。以下各部分显示了一些示例。

YAML：hello world

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest

YAML：显示名称、试验名称、说明和标记

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
tags:
  hello: world
display_name: hello-world-example
experiment_name: hello-world-example
description: |
  # Azure Machine Learning "hello world" job

  This is a "hello world" job running in the cloud via Azure Machine Learning!

  ## Description

  Markdown is supported in the studio for job descriptions! You can edit the description there or via CLI.

YAML：环境变量

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo $hello_env_var
environment:
  image: library/python:latest
environment_variables:
  hello_env_var: "hello world"

YAML：源代码

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: ls
code: src
environment:
  image: library/python:latest

YAML：文字输入

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  echo ${{inputs.hello_string}}
  echo ${{inputs.hello_number}}
environment:
  image: library/python:latest
inputs:
  hello_string: "hello world"
  hello_number: 42

YAML：写入默认输出

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ./outputs/helloworld.txt
environment:
  image: library/python:latest

YAML：写入命名数据输出

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ${{outputs.hello_output}}/helloworld.txt
outputs:
  hello_output:
environment:
  image: python

YAML：数据存储 URI 文件输入

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  echo "--iris-csv: ${{inputs.iris_csv}}"
  python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
  iris_csv:
    type: uri_file 
    path: azureml://datastores/workspaceblobstore/paths/example-data/iris.csv
environment: azureml://registries/azureml/environments/sklearn-1.0/labels/latest

YAML：数据存储 URI 文件夹输入

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  ls ${{inputs.data_dir}}
  echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
  python hello-iris.py --iris-csv ${{inputs.data_dir}}/iris.csv
code: src
inputs:
  data_dir:
    type: uri_folder 
    path: azureml://datastores/workspaceblobstore/paths/example-data/
environment: azureml://registries/azureml/environments/sklearn-1.0/labels/latest

YAML：URI 文件输入

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  echo "--iris-csv: ${{inputs.iris_csv}}"
  python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
  iris_csv:
    type: uri_file 
    path: https://azuremlexamples.blob.core.chinacloudapi.cn/datasets/iris.csv
environment: azureml://registries/azureml/environments/sklearn-1.0/labels/latest

YAML：URI 文件夹输入

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  ls ${{inputs.data_dir}}
  echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
  python hello-iris.py --iris-csv ${{inputs.data_dir}}/iris.csv
code: src
inputs:
  data_dir:
    type: uri_folder 
    path: wasbs://datasets@azuremlexamples.blob.core.chinacloudapi.cn/
environment: azureml://registries/azureml/environments/sklearn-1.0/labels/latest

YAML：通过 papermill 的笔记本

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  pip install ipykernel papermill
  papermill hello-notebook.ipynb outputs/out.ipynb -k python
code: src
environment:
  image: library/python:3.11.6

YAML：基本 Python 模型训练

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python main.py 
  --iris-csv ${{inputs.iris_csv}}
  --C ${{inputs.C}}
  --kernel ${{inputs.kernel}}
  --coef0 ${{inputs.coef0}}
inputs:
  iris_csv: 
    type: uri_file
    path: wasbs://datasets@azuremlexamples.blob.core.chinacloudapi.cn/iris.csv
  C: 0.8
  kernel: "rbf"
  coef0: 0.1
environment: azureml://registries/azureml/environments/sklearn-1.0/labels/latest
compute: azureml:cpu-cluster
display_name: sklearn-iris-example
experiment_name: sklearn-iris-example
description: Train a scikit-learn SVM on the Iris dataset.

YAML：使用本地 Docker 生成上下文进行基本 R 模型训练

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >
  Rscript train.R 
  --data_folder ${{inputs.iris}}
code: src
inputs:
  iris: 
    type: uri_file
    path: https://azuremlexamples.blob.core.chinacloudapi.cn/datasets/iris.csv
environment:
  build:
    path: docker-context
compute: azureml:cpu-cluster
display_name: r-iris-example
experiment_name: r-iris-example
description: Train an R model on the Iris dataset.

YAML：分布式 PyTorch

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py
  --epochs ${{inputs.epochs}}
  --learning-rate ${{inputs.learning_rate}}
  --data-dir ${{inputs.cifar}}
inputs:
  epochs: 1
  learning_rate: 0.2
  cifar:
     type: uri_folder
     path: azureml:cifar-10-example@latest
environment: azureml:AzureML-acpt-pytorch-1.13-cuda11.7@latest
compute: azureml:gpu-cluster
distribution:
  type: pytorch
  process_count_per_instance: 1
resources:
  instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.

YAML：分布式 TensorFlow

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py
  --epochs ${{inputs.epochs}}
  --model-dir ${{inputs.model_dir}}
inputs:
  epochs: 1
  model_dir: outputs/keras-model
environment: azureml:AzureML-tensorflow-2.12-cuda11@latest
compute: azureml:gpu-cluster
resources:
  instance_count: 2
distribution:
  type: tensorflow
  worker_count: 2
display_name: tensorflow-mnist-distributed-example
experiment_name: tensorflow-mnist-distributed-example
description: Train a basic neural network with TensorFlow on the MNIST dataset, distributed via TensorFlow.

YAML：分布式 MPI

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py
  --epochs ${{inputs.epochs}}
inputs:
  epochs: 1
environment: azureml:AzureML-tensorflow-2.12-cuda11@latest
compute: azureml:gpu-cluster
resources:
  instance_count: 2
distribution:
  type: mpi
  process_count_per_instance: 1
display_name: tensorflow-mnist-distributed-horovod-example
experiment_name: tensorflow-mnist-distributed-horovod-example
description: Train a basic neural network with TensorFlow on the MNIST dataset, distributed via Horovod.

后续步骤

安装并使用 CLI (v2)

通过