CLI (v2) 功能集规范 YAML 架构

项目
08/14/2024

注意

本文档中详细介绍的 YAML 语法基于最新版本的 ML CLI v2 扩展的 JSON 架构。此语法必定仅适用于最新版本的 ML CLI v2 扩展。可以在 https://azuremlschemasprod.azureedge.net/ 上查找早期扩展版本的架构。

YAML 语法

密钥	类型	说明	允许的值
$schema	字符串	YAML 架构。如果使用 Azure 机器学习 VS Code 扩展来创作 YAML 文件，则可通过在文件顶部包含 $schema 来调用架构和资源完成操作。
source	对象	必需。功能集的数据源。
source.type	字符串	必需。数据源类型。	mltable、csv、parquet、deltaTable
source.path	字符串	必需。数据源的路径。可以是单个文件、文件夹的路径或带通配符的路径。仅支持 Azure 存储和 ABFS 架构。
source.timestamp_column	对象	必需。源数据中的时间戳列。
source.timestamp_column.name	字符串	必需。源数据中的时间戳列名。
source.timestamp_column.format	字符串	时间戳列格式。如果未提供，请使用 Spark 推断时间戳值。
source.source_delay	对象	源数据延迟。
source.source_delay.days	integer	源数据延迟的天数。
source.source_delay.hours	integer	源数据延迟的小时数。
source.source_delay.minutes	integer	源数据延迟的分钟数。
feature_transformation_code	对象	转换代码定义所在的文件夹。
feature_transformation_code.path	字符串	功能集规范文件夹中用于查找转换器代码文件夹的相对路径。
feature_transformation_code.transformer_class	字符串	这是格式为 `{module_name}.{transformer_class_name}` 的 Spark 机器学习转换器类。系统预期在 `feature_transformation_code.path` 下找到 `{module_name}.py` 文件。 `{transformer_class_name}` 在此 python 文件中定义。
features	对象列表	必需。此功能集的功能。
features.name	字符串	必需。功能名称。
features.type	字符串	必需。功能数据类型。	字符串、整数、长整数、浮点数、双精度数、二进制、日期/时间、布尔值
index_columns	对象列表	必需。功能的索引列。源数据应包含这些列。
index_columns.name	字符串	必需。索引列名称。
index_columns.type	字符串	必需。索引列数据类型。	字符串、整数、长整数、浮点数、双精度数、二进制、日期/时间、布尔值
source_lookback	对象	源数据的回溯时间范围。
source_lookback.days	integer	源回溯的天数。
source_lookback.hours	integer	源回溯的小时数。
source_lookback.minutes	integer	源回溯的分钟数。
temporal_join_lookback	对象	执行时间点联接时的回溯时间范围。
temporal_join_lookback.days	integer	临时联接回溯的天数。
temporal_join_lookback.hours	integer	临时联接回溯的小时数。
temporal_join_lookback.minutes	integer	临时联接回溯的分钟数。

示例

示例 GitHub 存储库中提供了示例。下面是一些常见示例。

不包含转换代码的 YAML 示例

$schema: http://azureml/sdk-2-0/FeatureSetSpec.json
source:
  type: deltatable
  path: abfs://{container}@{storage}.dfs.core.chinacloudapi.cn/top_folder/transactions
  timestamp_column: # name of the column representing the timestamp.
    name: timestamp
features: # schema and properties of features generated by the feature_transformation_code
  - name: accountCountry
    type: string
    description: country of the account
  - name: numPaymentRejects1dPerUser
    type: double
    description: upper limit of number of payment rejects per day on the account
  - name: accountAge
    type: double
    description: age of the account
index_columns:
  - name: accountID
    type: string

包含转换代码的 YAML 示例

$schema: http://azureml/sdk-2-0/FeatureSetSpec.json

source:
  type: parquet
  path: abfs://file_system@account_name.dfs.core.chinacloudapi.cn/datasources/transactions-source/*.parquet
  timestamp_column: # name of the column representing the timestamp.
    name: timestamp
  source_delay:
    days: 0
    hours: 3
    minutes: 0
feature_transformation_code:
  path: ./code
  transformer_class: transaction_transform.TrsactionFeatureTransformer
features:
  - name: transaction_7d_count
    type: long
  - name: transaction_amount_7d_sum
    type: double
  - name: transaction_amount_7d_avg
    type: double
  - name: transaction_3d_count
    type: long
  - name: transaction_amount_3d_sum
    type: double
  - name: transaction_amount_3d_avg
    type: double
index_columns:
  - name: accountID
    type: string
source_lookback:
  days: 7
  hours: 0
  minutes: 0
temporal_join_lookback:
  days: 1
  hours: 0
  minutes: 0

也可以使用 azureml-feautrestore SDK 创建上述功能集规范。

transactions_featureset_code_path = "<local path to the code folder>"

transactions_featureset_spec = create_feature_set_spec(
    source = FeatureSource(
        type = SourceType.parquet,
        path = "wasbs://data@azuremlexampledata.blob.core.chinacloudapi.cn/feature-store-prp/datasources/transactions-source/*.parquet",
        timestamp_column = TimestampColumn(name = "timestamp"),
        source_delay = DateTimeOffset(days = 0, hours = 3, minutes = 0)
    ),
    transformation_code = TransformationCode(
        path = transactions_featureset_code_path,
        transformer_class = "transaction_transform.TransactionFeatureTransformer"
    ),
    index_columns = [
        Column(name = "accountID", type = ColumnType.string)
    ],
    source_lookback = DateTimeOffset(days = 7, hours = 0, minutes = 0),
    temporal_join_lookback = DateTimeOffset(days = 1, hours = 0, minutes = 0),
    infer_schema = True,
)

通过

CLI (v2) 功能集规范 YAML 架构

YAML 语法

示例

不包含转换代码的 YAML 示例

包含转换代码的 YAML 示例

后续步骤

其他资源