Create data assets

APPLIES TO: Python SDK azure-ai-ml v2 (current)

APPLIES TO: Azure CLI ml extension v2 (current)

In this article, you learn how to create a Data asset in Azure Machine Learning. By creating a Data asset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk the integrity of your data sources. You can create Data from Datastores, Azure Storage, public URLs, and local files.

The benefits of creating Data assets are:

  • You can share and reuse data with other members of the team such that they do not need to remember file locations.

  • You can seamlessly access data during model training (on any supported compute type) without worrying about connection strings or data paths.

  • You can version the data.

Prerequisites

To create and work with Data assets, you need:

Supported paths

When you create a data asset in Azure Machine Learning, you'll need to specify a path parameter that points to its location. Below is a table that shows the different data locations supported in Azure Machine Learning and examples for the path parameter:

Location Examples
A path on your local computer ./home/username/data/my_data
A path on a public http(s) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
A path on Azure Storage https://<account_name>.blob.core.chinacloudapi.cn/<container_name>/path
abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>
A path on a datastore azureml://datastores/<data_store_name>/paths/<path>

Note

When you create a data asset from a local path, it will be automatically uploaded to the default Azure Machine Learning datastore in the cloud.

Create a uri_folder data asset

Below shows you how to create a folder as an asset:

Create a YAML file (<file-name>.yml):

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:
# local: ./<path>
# blob:  https://<account_name>.blob.core.chinacloudapi.cn/<container_name>/<path>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
type: uri_folder
name: <name_of_data>
description: <description goes here>
path: <path>

Next, create the data asset using the CLI:

az ml data create -f <file-name>.yml

Create a uri_file data asset

Below shows you how to create a specific file as a data asset:

Sample YAML file <file-name>.yml for data in local path is as below:

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# Supported paths include:
# local: ./<path>/<file>
# blob:  https://<account_name>.blob.core.chinacloudapi.cn/<container_name>/<path>/<file>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/<file>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>/<file>

type: uri_file
name: <name>
description: <description>
path: <uri>
> az ml data create -f <file-name>.yml

Create a mltable data asset

mltable is a way to abstract the schema definition for tabular data to make it easier to share data assets (an overview can be found in MLTable).

In this section, we show you how to create a data asset when the type is an mltable.

The MLTable file

The MLTable file is a file that provides the specification of the data's schema so that the mltable engine can materialize the data into an in-memory object (Pandas/Dask/Spark). An example MLTable file is provided below:

type: mltable

paths:
  - pattern: ./*.txt
transformations:
  - read_delimited:
      delimiter: ,
      encoding: ascii
      header: all_files_same_headers

Important

We recommend co-locating the MLTable file with the underlying data in storage. For example:

├── my_data
│   ├── MLTable
│   ├── file_1.txt
.
.
.
│   ├── file_n.txt

Co-locating the MLTable with the data ensures a self-contained artifact where all that is needed is stored in that one folder (my_data); regardless of whether that folder is stored on your local drive or in your cloud store or on a public http server. You should not specify absolute paths in the MLTable file.

In your Python code, you materialize the MLTable artifact into a Pandas dataframe using:

import mltable

tbl = mltable.load(uri="./my_data")
df = tbl.to_pandas_dataframe()

The uri parameter in mltable.load() should be a valid path to a local or cloud folder which contains a valid MLTable file.

Note

You will need the mltable library installed in your Environment (pip install mltable).

Below shows you how to create an mltable data asset. The path can be any of the supported path formats outlined above.

Create a YAML file (<file-name>.yml):

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json

# path must point to **folder** containing MLTable artifact (MLTable file + data
# Supported paths include:
# local: ./<path>
# blob:  https://<account_name>.blob.core.chinacloudapi.cn/<container_name>/<path>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://datastores/<data_store_name>/paths/<path>

type: mltable
name: <name_of_data>
description: <description goes here>
path: <path>

Note

The path points to the folder containing the MLTable artifact.

Next, create the data asset using the CLI:

az ml data create -f <file-name>.yml

Next steps