CLI (v2) feature set specification YAML schema

APPLIES TO: Azure CLI ml extension v2 (current)

Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.

YAML syntax

Key Type Description Allowed values Default value
$schema string The YAML schema. If you use the Azure Machine Learning VS Code extension to author the YAML file, including $schema at the top of your file enables you to invoke schema and resource completions.
source object Required. Data source for feature set.
source.type string Required. Type of data source. mltable, csv, parquet, deltaTable
source.path string Required. Path of the data source. It could be a path to a single file, a folder, or path with wildcard. Only Azure storage and ABFS schema are supported.
source.timestamp_column object Required. Timestamp column in source data.
source.timestamp_column.name string Required. The timestamp column name in source data.
source.timestamp_column.format string The timestamp column format. If not provided, use Spark to infer the timestamp value.
source.source_delay object The source data delay.
source.source_delay.days integer The number of days of the source data delay.
source.source_delay.hours integer The number of hours of the source data delay.
source.source_delay.minutes integer The number of minutes of the source data delay.
feature_transformation_code object The folder where transformation code definition is located.
feature_transformation_code.path string The relative path within feature set spec folder to find the transformer code folder.
feature_transformation_code.transformer_class string This is a Spark machine learning transformer class in the format of {module_name}.{transformer_class_name}. The system expects to find a {module_name}.py file under the feature_transformation_code.path. The {transformer_class_name} is defined in this python file.
features list of object Required. The features for this feature set.
features.name string Required. The feature name.
features.type string Required. The feature data type. string, integer, long, float, double, binary, datetime, boolean
index_columns list of object Required. The index columns for the features. The source data should contain these columns.
index_columns.name string Required. The index column name.
index_columns.type string Required. The index column data type. string, integer, long, float, double, binary, datetime, boolean
source_lookback object The look back time window for source data.
source_lookback.days integer The number of days of the source look back.
source_lookback.hours integer The number of hours of the source look back.
source_lookback.minutes integer The number of minutes of the source look back.
temporal_join_lookback object The look back time window when doing point-of-time join.
temporal_join_lookback.days integer The number of days of the temporal join lookback.
temporal_join_lookback.hours integer The number of hours of the temporal join lookback.
temporal_join_lookback.minutes integer The number of minutes of the temporal join lookback.

Examples

Examples are available in the examples GitHub repository. Some common examples are shown below.

YAML Example without Transformation Code

$schema: http://azureml/sdk-2-0/FeatureSetSpec.json
source:
  type: deltatable
  path: abfs://{container}@{storage}.dfs.core.chinacloudapi.cn/top_folder/transactions
  timestamp_column: # name of the column representing the timestamp.
    name: timestamp
features: # schema and properties of features generated by the feature_transformation_code
  - name: accountCountry
    type: string
    description: country of the account
  - name: numPaymentRejects1dPerUser
    type: double
    description: upper limit of number of payment rejects per day on the account
  - name: accountAge
    type: double
    description: age of the account
index_columns:
  - name: accountID
    type: string

YAML Example with Transformation Code

$schema: http://azureml/sdk-2-0/FeatureSetSpec.json

source:
  type: parquet
  path: abfs://file_system@account_name.dfs.core.chinacloudapi.cn/datasources/transactions-source/*.parquet
  timestamp_column: # name of the column representing the timestamp.
    name: timestamp
  source_delay:
    days: 0
    hours: 3
    minutes: 0
feature_transformation_code:
  path: ./code
  transformer_class: transaction_transform.TrsactionFeatureTransformer
features:
  - name: transaction_7d_count
    type: long
  - name: transaction_amount_7d_sum
    type: double
  - name: transaction_amount_7d_avg
    type: double
  - name: transaction_3d_count
    type: long
  - name: transaction_amount_3d_sum
    type: double
  - name: transaction_amount_3d_avg
    type: double
index_columns:
  - name: accountID
    type: string
source_lookback:
  days: 7
  hours: 0
  minutes: 0
temporal_join_lookback:
  days: 1
  hours: 0
  minutes: 0

The feature set spec above can also be created using azureml-feautrestore SDK.

transactions_featureset_code_path = "<local path to the code folder>"

transactions_featureset_spec = create_feature_set_spec(
    source = FeatureSource(
        type = SourceType.parquet,
        path = "wasbs://data@azuremlexampledata.blob.core.chinacloudapi.cn/feature-store-prp/datasources/transactions-source/*.parquet",
        timestamp_column = TimestampColumn(name = "timestamp"),
        source_delay = DateTimeOffset(days = 0, hours = 3, minutes = 0)
    ),
    transformation_code = TransformationCode(
        path = transactions_featureset_code_path,
        transformer_class = "transaction_transform.TransactionFeatureTransformer"
    ),
    index_columns = [
        Column(name = "accountID", type = ColumnType.string)
    ],
    source_lookback = DateTimeOffset(days = 7, hours = 0, minutes = 0),
    temporal_join_lookback = DateTimeOffset(days = 1, hours = 0, minutes = 0),
    infer_schema = True,
)

Next steps