Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
APPLIES TO:
Python SDK azure-ai-ml v2 (current)
APPLIES TO:
Azure CLI ml extension v2 (current)
In this article, you learn how to create a Data asset in Azure Machine Learning. By creating a Data asset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk the integrity of your data sources. You can create Data from Datastores, Azure Storage, public URLs, and local files.
The benefits of creating Data assets are:
You can share and reuse data with other members of the team such that they do not need to remember file locations.
You can seamlessly access data during model training (on any supported compute type) without worrying about connection strings or data paths.
You can version the data.
Prerequisites
To create and work with Data assets, you need:
An Azure subscription. If you don't have one, create a Trial before you begin. Try the free or paid version of Azure Machine Learning.
An Azure Machine Learning workspace. Create workspace resources.
The Azure Machine Learning CLI/SDK installed and MLTable package installed (
pip install mltable
).
Supported paths
When you create a data asset in Azure Machine Learning, you'll need to specify a path
parameter that points to its location. Below is a table that shows the different data locations supported in Azure Machine Learning and examples for the path
parameter:
Location | Examples |
---|---|
A path on your local computer | ./home/username/data/my_data |
A path on a public http(s) server | https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv |
A path on Azure Storage | https://<account_name>.blob.core.chinacloudapi.cn/<container_name>/path abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path> |
A path on a datastore | azureml://datastores/<data_store_name>/paths/<path> |
Note
When you create a data asset from a local path, it will be automatically uploaded to the default Azure Machine Learning datastore in the cloud.
Create a uri_folder
data asset
Below shows you how to create a folder as an asset:
Create a YAML
file (<file-name>.yml
):
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
# Supported paths include:
# local: ./<path>
# blob: https://<account_name>.blob.core.chinacloudapi.cn/<container_name>/<path>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
type: uri_folder
name: <name_of_data>
description: <description goes here>
path: <path>
Next, create the data asset using the CLI:
az ml data create -f <file-name>.yml
Create a uri_file
data asset
Below shows you how to create a specific file as a data asset:
Sample YAML
file <file-name>.yml
for data in local path is as below:
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
# Supported paths include:
# local: ./<path>/<file>
# blob: https://<account_name>.blob.core.chinacloudapi.cn/<container_name>/<path>/<file>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/<file>
# Datastore: azureml://datastores/<data_store_name>/paths/<path>/<file>
type: uri_file
name: <name>
description: <description>
path: <uri>
> az ml data create -f <file-name>.yml
Create a mltable
data asset
mltable
is a way to abstract the schema definition for tabular data to make it easier to share data assets (an overview can be found in MLTable).
In this section, we show you how to create a data asset when the type is an mltable
.
The MLTable file
The MLTable file is a file that provides the specification of the data's schema so that the mltable
engine can materialize the data into an in-memory object (Pandas/Dask/Spark). An example MLTable file is provided below:
type: mltable
paths:
- pattern: ./*.txt
transformations:
- read_delimited:
delimiter: ,
encoding: ascii
header: all_files_same_headers
Important
We recommend co-locating the MLTable file with the underlying data in storage. For example:
├── my_data
│ ├── MLTable
│ ├── file_1.txt
.
.
.
│ ├── file_n.txt
Co-locating the MLTable with the data ensures a self-contained artifact where all that is needed is stored in that one folder (my_data
); regardless of whether that folder is stored on your local drive or in your cloud store or on a public http server. You should not specify absolute paths in the MLTable file.
In your Python code, you materialize the MLTable artifact into a Pandas dataframe using:
import mltable
tbl = mltable.load(uri="./my_data")
df = tbl.to_pandas_dataframe()
The uri
parameter in mltable.load()
should be a valid path to a local or cloud folder which contains a valid MLTable file.
Note
You will need the mltable
library installed in your Environment (pip install mltable
).
Below shows you how to create an mltable
data asset. The path
can be any of the supported path formats outlined above.
Create a YAML
file (<file-name>.yml
):
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
# path must point to **folder** containing MLTable artifact (MLTable file + data
# Supported paths include:
# local: ./<path>
# blob: https://<account_name>.blob.core.chinacloudapi.cn/<container_name>/<path>
# ADLS gen2: abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>/
# Datastore: azureml://datastores/<data_store_name>/paths/<path>
type: mltable
name: <name_of_data>
description: <description goes here>
path: <path>
Note
The path points to the folder containing the MLTable artifact.
Next, create the data asset using the CLI:
az ml data create -f <file-name>.yml