CLI (v2) MLtable YAML schema

APPLIES TO: Azure CLI ml extension v2 (current)

You can find the source JSON schema at https://azuremlschemas.azureedge.net/latest/MLTable.schema.json.

Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.

How to author `MLTable` files

This article presents information about the MLTable YAML schema only. For more information about MLTable, including

MLTable file authoring
MLTable artifacts creation
consumption in Pandas and Spark
end-to-end examples

visit Working with tables in Azure Machine Learning.

YAML syntax

Key	Type	Description	Allowed values	Default value
`$schema`	string	The YAML schema. If you use the Azure Machine Learning Visual Studio Code extension to author the YAML file, you can invoke schema and resource completions if you include `$schema` at the top of your file
`type`	const	`mltable` abstracts the schema definition for tabular data. Data consumers can more easily materialize the table into a Pandas/Dask/Spark dataframe	`mltable`	`mltable`
`paths`	array	Paths can be a `file` path, `folder` path, or `pattern` for paths. `pattern` supports globbing patterns that specify sets of filenames with wildcard characters (`*`, `?`, `[abc]`, `[a-z]`). Supported URI types: `azureml`, `https`, `wasbs`, `abfss`, and `adl`. Visit Core yaml syntax for more information about use of the `azureml://` URI format	`file` `folder` `pattern`
`transformations`	array	A defined transformation sequence, applied to data loaded from defined paths. Visit Transformations for more information	`read_delimited` `read_parquet` `read_json_lines` `read_delta_lake` `take` `take_random_sample` `drop_columns` `keep_columns` `convert_column_types` `skip` `filter` `extract_columns_from_partition_format`

Transformations

Read transformations

Read Transformation	Description	Parameters
`read_delimited`	Adds a transformation step to read the delimited text file(s) provided in `paths`	`infer_column_types`: Boolean to infer column data types. Defaults to True. Type inference requires that the current compute can access the data source. Currently, type inference only pulls the first 200 rows. `encoding`: Specify the file encoding. Supported encodings: `utf8`, `iso88591`, `latin1`, `ascii`, `utf16`, `utf32`, `utf8bom`, and `windows1252`. Default encoding: `utf8`. `header`: the user can choose one of these options: `no_header`, `from_first_file`, `all_files_different_headers`, `all_files_same_headers`. Defaults to `all_files_same_headers`. `delimiter`: The separator that splits the columns. `empty_as_string`: Specifies if empty field values should load as empty strings. The default value (False) reads empty field values as nulls. Passing this setting as True reads empty field values as empty strings. For values converted to numeric or datetime data types, this setting has no effect, because empty values are converted to nulls. `include_path_column`: Boolean to keep path information as column in the table. Defaults to False. This setting helps when reading multiple files, and you want to know the originating file for a specific record. Additionally, you can keep useful information in the file path. `support_multi_line`: By default (`support_multi_line=False`), all line breaks, including line breaks in quoted field values, are interpreted as a record break. This approach to data reading increases speed, and it offers optimization for parallel execution on multiple CPU cores. However, it might result in silent production of more records with misaligned field values. Set this value to `True` when the delimited files are known to contain quoted line breaks
`read_parquet`	Adds a transformation step to read the Parquet formatted file(s) provided in `paths`	`include_path_column`: Boolean to keep the path information as a table column. Defaults to False. This setting helps when you read multiple files, and you want to know the originating file for a specific record. Additionally, you can keep useful information in the file path. NOTE: MLTable only supports reads of parquet files that have columns consisting of primitive types. Columns containing arrays are not supported
`read_delta_lake`	Adds a transformation step to read a Delta Lake folder provided in `paths`. You can read the data at a particular timestamp or version	`timestamp_as_of`: String. Timestamp to be specified for time-travel on the specific Delta Lake data. To read data at a specific point in time, the datetime string should have an RFC-3339/ISO-8601 format (for example: "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00", "2022-10-01T01:30:00-08:00"). `version_as_of`: Integer. Version to be specified for time-travel on the specific Delta Lake data. You must provide one value of `timestamp_as_of` or `version_as_of`
`read_json_lines`	Adds a transformation step to read the json file(s) provided in `paths`	`include_path_column`: Boolean to keep path information as an MLTable column. Defaults to False. This setting helps when you read multiple files, and you want to know the originating file for a specific record. Additionally, you can keep useful information in the file path `invalid_lines`: Determines how to handle lines that have invalid JSON. Supported values: `error` and `drop`. Defaults to `error` `encoding`: Specify the file encoding. Supported encodings: `utf8`, `iso88591`, `latin1`, `ascii`, `utf16`, `utf32`, `utf8bom`, and `windows1252`. Defaults to `utf8`

Other transformations

Transformation	Description	Parameters	Example(s)
`convert_column_types`	Adds a transformation step to convert the specified columns into their respective specified new types	`columns` An array of column names to convert `column_type` The type into which you want to convert (`int`, `float`, `string`, `boolean`, `datetime`)	`- convert_column_types: - columns: [Age] column_type: int` Convert the Age column to integer. `- convert_column_types: - columns: date column_type: datetime: formats: - "%d/%m/%Y"` Convert the date column to the format `dd/mm/yyyy`. Read `to_datetime` for more information about datetime conversion. `- convert_column_types: - columns: [is_weekday] column_type: boolean: true_values:['yes', 'true', '1'] false_values:['no', 'false', '0']` Convert the is_weekday column to a boolean; yes/true/1 values in the column map to `True`, and no/false/0 values in the column map to `False`. Read `to_bool` for more information about boolean conversion
`drop_columns`	Adds a transformation step to remove specific columns from the dataset	An array of column names to drop	`- drop_columns: ["col1", "col2"]`
`keep_columns`	Adds a transformation step to keep the specified columns, and remove all others from the dataset	An array of column names to preserve	`- keep_columns: ["col1", "col2"]`
`extract_columns_from_partition_format`	Adds a transformation step to use the partition information of each path, and then extract them into columns based on the specified partition format.	partition format to use	`- extract_columns_from_partition_format: {column_name:yyyy/MM/dd/HH/mm/ss}` creates a datetime column, where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extract year, month, day, hour, minute, and second values for the datetime type
`filter`	Filter the data, leaving only the records that match the specified expression.	An expression as a string	`- filter: col('temperature') > 32 and col('location') == 'UK'` Only leave rows where the temperature exceeds 32, and UK is the location
`skip`	Adds a transformation step to skip the first count rows of this MLTable.	A count of the number of rows to skip	`- skip: 10` Skip first 10 rows
`take`	Adds a transformation step to select the first count rows of this MLTable.	A count of the number of rows from the top of the table to take	`- take: 5` Take the first five rows.
`take_random_sample`	Adds a transformation step to randomly select each row of this MLTable, with probability chance.	`probability` The probability of selecting an individual row. Must be in the range [0,1]. `seed` Optional random seed	`- take_random_sample: probability: 0.10 seed:123` Take a 10 percent random sample of rows using a random seed of 123

Examples

Examples of MLTable use. Find more examples at:

Quickstart

This quickstart reads the famous iris dataset from a public https server. To proceed, you must place the MLTable files in a folder. First, create the folder and MLTable file with:

mkdir ./iris
cd ./iris
touch ./MLTable

Next, place this content in the MLTable file:

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json

type: mltable
paths:
    - file: https://azuremlexamples.blob.core.chinacloudapi.cn/datasets/iris.csv

transformations:
    - read_delimited:
        delimiter: ','
        header: all_files_same_headers
        include_path_column: true

You can then materialize into Pandas with:

Important

You must have the mltable Python SDK installed. Install this SDK with:

pip install mltable.

import mltable

tbl = mltable.load("./iris")
df = tbl.to_pandas_dataframe()

Ensure that the data includes a new column named Path. This column contains the https://azuremlexamples.blob.core.chinacloudapi.cn/datasets/iris.csv data path.

The CLI can create a data asset:

az ml data create --name iris-from-https --version 1 --type mltable --path ./iris

The folder containing the MLTable automatically uploads to cloud storage (the default Azure Machine Learning datastore).

Tip

An Azure Machine Learning data asset is similar to web browser bookmarks (favorites). Instead of remembering long URIs (storage paths) that point to your most frequently-used data, you can create a data asset, and then access that asset with a friendly name.

Delimited text files

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
  - file: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/ # a specific file on ADLS
  # additional options
  # - folder: ./<folder> a specific folder
  # - pattern: ./*.csv # glob all the csv files in a folder

transformations:
    - read_delimited:
        encoding: ascii
        header: all_files_same_headers
        delimiter: ","
        include_path_column: true
        empty_as_string: false
    - keep_columns: [col1, col2, col3, col4, col5, col6, col7]
    # or you can drop_columns...
    # - drop_columns: [col1, col2, col3, col4, col5, col6, col7]
    - convert_column_types:
        - columns: col1
          column_type: int
        - columns: col2
          column_type:
            datetime:
                formats:
                    - "%d/%m/%Y"
        - columns: [col1, col2, col3] 
          column_type:
            boolean:
                mismatch_as: error
                true_values: ["yes", "true", "1"]
                false_values: ["no", "false", "0"]
      - filter: col('col1') > 32 and col('col7') == 'a_string'
      # create a column called timestamp with the values extracted from the folder information
      - extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
      - skip: 10
      - take_random_sample:
          probability: 0.50
          seed: 1394
      # or you can take the first n records
      # - take: 200

Parquet

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
  - pattern: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>/*.parquet
  
transformations:
  - read_parquet:
        include_path_column: false
  - filter: col('temperature') > 32 and col('location') == 'UK'
  - skip: 1000 # skip first 1000 rows
  # create a column called timestamp with the values extracted from the folder information
  - extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}

Delta Lake

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
- folder: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/

# NOTE: for read_delta_lake, you are *required* to provide either
# timestamp_as_of OR version_as_of.
# timestamp should be in RFC-3339/ISO-8601 format (for example:
# "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00",
# "2022-10-01T01:30:00-08:00")
# To get the latest, set the timestamp_as_of at a future point (for example: '2999-08-26T00:00:00Z')

transformations:
 - read_delta_lake:
      timestamp_as_of: '2022-08-26T00:00:00Z'
      # alternative:
      # version_as_of: 1