CLI (v2) MLtable YAML 架构

项目
04/08/2024

可以从 https://azuremlschemas.azureedge.net/latest/MLTable.schema.json 找到源 JSON 架构。

注意

本文档中详细介绍的 YAML 语法基于最新版本的 ML CLI v2 扩展的 JSON 架构。此语法必定仅适用于最新版本的 ML CLI v2 扩展。可以在 https://azuremlschemasprod.azureedge.net/ 上查找早期扩展版本的架构。

如何创作 `MLTable` 文件

本文仅提供了有关 MLTable YAML 架构的信息。有关 MLTable 的详细信息，包括

MLTable 文件创作
MLTable 项目创建
在 Pandas 和 Spark 中的使用
端到端示例

请访问在 Azure 机器学习中使用表。

YAML 语法

密钥	类型	说明	允许的值	默认值
`$schema`	字符串	YAML 架构。如果你使用 Azure 机器学习 Visual Studio Code 扩展来创作 YAML 文件，则可以调用架构和资源完成（如果在该文件的顶部包括了 `$schema`）。
`type`	const	`mltable` 提取表格数据的架构定义。数据使用者可以更轻松地将表具体化为 Pandas/Dask/Spark 数据帧	`mltable`	`mltable`
`paths`	array	路径可以是 `file` 路径、`folder` 路径或路径的 `pattern`。 `pattern` 支持 glob 模式，这种模式使用通配符（`*`、`?`、`[abc]`、`[a-z]`）指定文件名集。支持的 URI 类型：`azureml`、`https`、`wasbs`、`abfss` 和 `adl`。有关如何使用 `azureml://` URI 格式的详细信息，请访问核心 yaml 语法	`file` `folder` `pattern`
`transformations`	array	定义的转换序列，应用于从定义的路径加载的数据。有关详细信息，请参阅转换	`read_delimited` `read_parquet` `read_json_lines` `read_delta_lake` `take` `take_random_sample` `drop_columns` `keep_columns` `convert_column_types` `skip` `filter` `extract_columns_from_partition_format`

转换

读取转换

读取转换	说明	参数
`read_delimited`	添加一个转换步骤以读取 `paths` 中提供的带分隔符的文本文件	`infer_column_types`：用于推理列数据类型的布尔值。默认为 True。类型推理要求当前计算可以访问数据源。目前，类型推理仅拉取前 200 行。 `encoding`：指定文件编码。支持的编码：`utf8`、`iso88591`、`latin1`、`ascii`、`utf16`、`utf32`、`utf8bom` 和 `windows1252`。默认编码：`utf8`。 `header`：用户可以选择以下选项之一：`no_header`、`from_first_file`、`all_files_different_headers`、`all_files_same_headers`。默认为 `all_files_same_headers`。 `delimiter`：用于拆分列的分隔符。 `empty_as_string`：指定空字段值是否应加载为空字符串。默认值 (False) 将空字段值读取为 null。如果将此设置传递为 True，则会将空字段值读取为空字符串。对于转换为数值或日期/时间数据类型的值，此设置不起作用，因为空值将转换为 null。 `include_path_column`：用于将路径信息保留为表中的列的布尔值。默认为 False。当读取多个文件并且你希望了解特定记录的来源文件时，此设置很有帮助。此外，还可以在文件路径中保留有用的信息。 `support_multi_line`：默认情况下 (`support_multi_line=False`)，所有换行符（包括带引号的字段值中的换行符）都将被解释为记录中断。这种数据读取方法提高了速度，并且优化了多个 CPU 核心上的并行执行。但是，它可能会导致无声地生成更多字段值未对齐的记录。在已知带分隔符的文件包含带引号的换行符的情况下，请将此值设置为 `True`。
`read_parquet`	添加一个转换步骤以读取 `paths` 中提供的 Parquet 格式的文件	`include_path_column`：布尔值，用于将路径信息保留为表列。默认为 False。当读取多个文件并且你希望了解特定记录的来源文件时，此设置很有帮助。此外，还可以在文件路径中保留有用的信息。注意：MLTable 仅支持读取其中的列由基元类型构成的 parquet 文件。不支持包含数组的列。
`read_delta_lake`	添加转换步骤以读取 `paths` 中提供的 Delta Lake 文件夹。你可以读取处于特定时间戳或版本的数据	`timestamp_as_of`：字符串。为特定 Delta Lake 数据的时间旅行指定的时间戳。若要读取处于特定时间点的数据，日期/时间字符串应当采用 RFC-3339/ISO-8601 格式（例如：“2022-10-01T00:00:00Z”、“2022-10-01T00:00:00+08:00”、“2022-10-01T01:30:00-08:00”）。 `version_as_of`：整数。为特定 Delta Lake 数据的时间旅行指定的版本。你必须提供以下值：`timestamp_as_of` 或 `version_as_of`
`read_json_lines`	添加一个转换步骤以读取 `paths` 中提供的 json 文件	`include_path_column`：用于将路径信息保留为 MLTable 列的布尔值。默认为 False。当读取多个文件并且你希望了解特定记录的来源文件时，此设置很有帮助。此外，还可以在文件路径中保留有用的信息 `invalid_lines`：确定如何处理包含无效 JSON 的行。支持的值：`error` 和 `drop`。默认为 `error` `encoding`：指定文件编码。支持的编码：`utf8`、`iso88591`、`latin1`、`ascii`、`utf16`、`utf32`、`utf8bom` 和 `windows1252`。默认为 `utf8`

其他转换

转换	描述	参数	示例
`convert_column_types`	添加一个转换步骤以将指定的列转换为各自的指定新类型	`columns` 要转换的列名的数组 `column_type` 要转换为的类型（`int`、`float`、`string`、`boolean`、`datetime`）	`- convert_column_types: - columns: [Age] column_type: int` 将 Age 列转换为整数。 `- convert_column_types: - columns: date column_type: datetime: formats: - "%d/%m/%Y"` 将日期列转换为格式 `dd/mm/yyyy`。有关日期/时间转换的详细信息，请参阅 `to_datetime`。 `- convert_column_types: - columns: [is_weekday] column_type: boolean： true_values:['yes', 'true', '1'] false_values:['no', 'false', '0']` 将 is_weekday 列转换为布尔值；该列中的 yes/true/1 值将映射到 `True`，该列中的 no/false/0 值将映射到 `False`。有关布尔值转换的详细信息，请阅读 `to_bool`
`drop_columns`	添加一个转换步骤以从数据集中移除特定列	要删除的列名的数组	`- drop_columns: ["col1", "col2"]`
`keep_columns`	添加一个转换步骤以在数据集中保留指定的列，并移除所有其他列	要保留的列名的数组	`- keep_columns: ["col1", "col2"]`
`extract_columns_from_partition_format`	添加转换步骤以使用每个路径的分区信息，然后根据指定的分区格式将其提取到列中。	要使用的分区格式	`- extract_columns_from_partition_format: {column_name:yyyy/MM/dd/HH/mm/ss}` 创建日期/时间列，其中，“yyyy”、“MM”、“dd”、“HH”、“mm”和“ss”用于提取日期时间类型的年、月、日、小时、分钟和秒值
`filter`	筛选数据，仅保留与指定表达式匹配的记录。	一个字符串格式的表达式	`- filter: 'col("temperature") > 32 and col("location") == "UK"'` 仅保留其中的温度超过 32 且位置为 UK 的行
`skip`	添加转换步骤以跳过此 MLTable 的第一个计数行。	要跳过的行数	`- skip: 10` 跳过前 10 行
`take`	添加转换步骤以选择此 MLTable 的第一个计数行。	要从表顶部提取的行数	`- take: 5` 取前五行。
`take_random_sample`	添加一个转换步骤，随机选择此 MLTable 的每一行，并具有概率。	`probability` 选择单个行的概率。必须在 [0,1] 范围内。 `seed` 可选的随机种子	`- take_random_sample: 概率：0.10 种子：123` 使用随机种子 123 对行进行 10% 的随机采样

示例

MLTable 的使用示例。从以下位置查找更多示例：

快速入门

本快速入门从公共 https 服务器读取著名的 iris 数据集。若要继续，必须将 MLTable 文件放在一个文件夹中。首先，使用以下命令创建该文件夹和 MLTable 文件：

mkdir ./iris
cd ./iris
touch ./MLTable

接下来，将此内容置于 MLTable 文件中：

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json

type: mltable
paths:
    - file: https://azuremlexamples.blob.core.chinacloudapi.cn/datasets/iris.csv

transformations:
    - read_delimited:
        delimiter: ','
        header: all_files_same_headers
        include_path_column: true

然后，可以使用以下方法具体化到 Pandas 中：

重要

必须安装 mltable Python SDK。使用以下命令安装此 SDK：

pip install mltable。

import mltable

tbl = mltable.load("./iris")
df = tbl.to_pandas_dataframe()

确保数据包含名为 Path 的新列。此列包含 https://azuremlexamples.blob.core.chinacloudapi.cn/datasets/iris.csv 数据路径。

此 CLI 可以创建数据资产：

az ml data create --name iris-from-https --version 1 --type mltable --path ./iris

包含 MLTable 的文件夹会自动上传到云存储（默认的 Azure 机器学习数据存储）。

提示

Azure 机器学习数据资产类似于 Web 浏览器书签（收藏夹）。你可以创建数据资产，然后使用友好名称访问该资产，而不是记住指向最常用数据的长 URI（存储路径）。

带分隔符的文本文件

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
  - file: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/ # a specific file on ADLS
  # additional options
  # - folder: ./<folder> a specific folder
  # - pattern: ./*.csv # glob all the csv files in a folder

transformations:
    - read_delimited:
        encoding: ascii
        header: all_files_same_headers
        delimiter: ","
        include_path_column: true
        empty_as_string: false
    - keep_columns: [col1, col2, col3, col4, col5, col6, col7]
    # or you can drop_columns...
    # - drop_columns: [col1, col2, col3, col4, col5, col6, col7]
    - convert_column_types:
        - columns: col1
          column_type: int
        - columns: col2
          column_type:
            datetime:
                formats:
                    - "%d/%m/%Y"
        - columns: [col1, col2, col3] 
          column_type:
            boolean:
                mismatch_as: error
                true_values: ["yes", "true", "1"]
                false_values: ["no", "false", "0"]
      - filter: 'col("col1") > 32 and col("col7") == "a_string"'
      # create a column called timestamp with the values extracted from the folder information
      - extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}
      - skip: 10
      - take_random_sample:
          probability: 0.50
          seed: 1394
      # or you can take the first n records
      # - take: 200

Parquet

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
  - pattern: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>/*.parquet
  
transformations:
  - read_parquet:
        include_path_column: false
  - filter: 'col("temperature") > 32 and col("location") == "UK"'
  - skip: 1000 # skip first 1000 rows
  # create a column called timestamp with the values extracted from the folder information
  - extract_columns_from_partition_format: {timestamp:yyyy/MM/dd}

Delta Lake

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
type: mltable

# Supported paths include:
# local: ./<path>
# blob: wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>
# Public http(s) server: https://<url>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://subscriptions/<subid>/resourcegroups/<rg>/workspaces/<ws>/datastores/<datastore_name>/paths/<path>

paths:
- folder: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/

# NOTE: for read_delta_lake, you are *required* to provide either
# timestamp_as_of OR version_as_of.
# timestamp should be in RFC-3339/ISO-8601 format (for example:
# "2022-10-01T00:00:00Z", "2022-10-01T00:00:00+08:00",
# "2022-10-01T01:30:00-08:00")
# To get the latest, set the timestamp_as_of at a future point (for example: '2999-08-26T00:00:00Z')

transformations:
 - read_delta_lake:
      timestamp_as_of: '2022-08-26T00:00:00Z'
      # alternative:
      # version_as_of: 1

重要

限制：mltable 不支持从 Delta Lake 读取数据时提取分区键。通过 mltable 读取 Delta Lake 数据时，mltable 转换 extract_columns_from_partition_format 不起作用。

JSON

$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
paths:
  - file: ./order_invalid.jsonl
transformations:
  - read_json_lines:
        encoding: utf8
        invalid_lines: drop
        include_path_column: false

CLI (v2) MLtable YAML 架构

如何创作 `MLTable` 文件

YAML 语法

转换

读取转换

其他转换

示例

快速入门

带分隔符的文本文件

Parquet

Delta Lake

JSON

后续步骤

其他资源

CLI (v2) MLtable YAML 架构

如何创作 MLTable 文件

YAML 语法

转换

读取转换

其他转换

示例

快速入门

带分隔符的文本文件

Parquet

Delta Lake

JSON

后续步骤

其他资源

如何创作 `MLTable` 文件