Transform data in Azure Machine Learning designer

Article
03/03/2023

In this article, you'll learn how to transform and save datasets in the Azure Machine Learning designer, to prepare your own data for machine learning.

You'll use the sample Adult Census Income Binary Classification dataset to prepare two datasets: one dataset that includes adult census information from only the United States, and another dataset that includes census information from non-US adults.

In this article, you'll learn how to:

Transform a dataset to prepare it for training.
Export the resulting datasets to a datastore.
View the results.

This how-to is a prerequisite for the how to retrain designer models article. In that article, you'll learn how to use the transformed datasets to train multiple models, with pipeline parameters.

Important

If you do not see graphical elements mentioned in this document, such as buttons in studio or designer, you may not have the right level of permissions to the workspace. Please contact your Azure subscription administrator to verify that you have been granted the correct level of access. For more information, see Manage users and roles.

Transform a dataset

In this section, you'll learn how to import the sample dataset, and split the data into US and non-US datasets. See how to import data for more information about how to import your own data into the designer.

Import data

Use these steps to import the sample dataset:

Sign in to studio.ml.azure.cn, and select the workspace you want to use.
Go to the designer. Select Easy-to-use-prebuild components to create a new pipeline.
Select a default compute target to run the pipeline.
To the left of the pipeline canvas, you'll see a palette of datasets and components. Select Datasets. Then view the Samples section.
Drag and drop the Adult Census Income Binary classification dataset onto the canvas.
Right-click the Adult Census Income dataset component, and select Visualize > Dataset output
Use the data preview window to explore the dataset. Take special note of the "native-country" column values.

Split the data

In this section, you'll use the Split Data component to identify and split rows that contain "United-States" in the "native-country" column.

To the left of the canvas, in the component palette, expand the Data Transformation section, and find the Split Data component.
Drag the Split Data component onto the canvas, and drop that component below the dataset component.
Connect the dataset component to the Split Data component.
Select the Split Data component.
To the right of the canvas in the component details pane, set Splitting mode to Regular Expression.
Enter the Regular Expression: \"native-country" United-States.

The Regular expression mode tests a single column for a value. See the related algorithm component reference page for more information on the Split Data component.

Your pipeline should look like this:

Screenshot that shows how to configure the pipeline and the Split Data component

Save the datasets

Now that you set up your pipeline to split the data, you must specify where to persist the datasets. For this example, use the Export Data component to save your dataset to a datastore. See Connect to Azure storage services for more information about datastores.

To the left of the canvas in the component palette, expand the Data Input and Output section, and find the Export Data component.
Drag and drop two Export Data components below the Split Data component.
Connect each output port of the Split Data component to a different Export Data component.

Your pipeline should look something like this:

.
Select the Export Data component connected to the left-most port of the Split Data component.

For the Split Data component, the output port order matters. The first output port contains the rows where the regular expression is true. In this case, the first port contains rows for US-based income, and the second port contains rows for non-US based income.
In the component details pane to the right of the canvas, set the following options:

Datastore type: Azure Blob Storage

Datastore: Select an existing datastore, or select "New datastore" to create one now.

Path: /data/us-income

File format: csv

Note

This article assumes that you have access to a datastore registered to the current Azure Machine Learning workspace. See Connect to Azure storage services for datastore setup instructions.

You can create a datastore if you don't have one now. For example purposes, this article will save the datasets to the default blob storage account associated with the workspace. It will save the datasets into the azureml container, in a new folder named data.
Select the Export Data component connected to the right-most port of the Split Data component.
To the right of the canvas in the component details pane, set the following options:

Datastore type: Azure Blob Storage

Datastore: Select the same datastore as above

Path: /data/non-us-income

File format: csv
Verify that the Export Data component connected to the left port of the Split Data has the Path /data/us-income.
Verify that the Export Data component connected to the right port has the Path /data/non-us-income.

Your pipeline and settings should look like this:

.

Submit the job

Now that you set up your pipeline to split and export the data, submit a pipeline job.

Select Submit at the top of the canvas.
Select Create new in the Set up pipeline job, to create an experiment.

Experiments logically group related pipeline jobs together. If you run this pipeline in the future, you should use the same experiment for logging and tracking purposes.
Provide a descriptive experiment name - for example "split-census-data".
Select Submit.

View results

After the pipeline finishes running, you can navigate to your Azure portal blob storage to view your results. You can also view the intermediary results of the Split Data component to confirm that your data has been split correctly.

Select the Split Data component.
In the component details pane to the right of the canvas, select Outputs + logs.
Select the visualize icon next to Results dataset1.
Verify that the "native-country" column contains only the value "United-States".
Select the visualize icon next to Results dataset2.
Verify that the "native-country" column does not contain the value "United-States".

Clean up resources

To continue with part two of this Retrain models with Azure Machine Learning designer how-to, skip this section.

Important

You can use the resources that you created as prerequisites for other Azure Machine Learning tutorials and how-to articles.

Delete everything

If you don't plan to use anything that you created, delete the entire resource group so you don't incur any charges.

In the Azure portal, select Resource groups on the left side of the window.
In the list, select the resource group that you created.
Select Delete resource group.

Deleting the resource group also deletes all resources that you created in the designer.

Delete individual assets

In the designer where you created your experiment, delete individual assets by selecting them and then selecting the Delete button.

The compute target that you created here automatically autoscales to zero nodes when it's not being used. This action is taken to minimize charges. If you want to delete the compute target, take these steps:

Delete assets

You can unregister datasets from your workspace by selecting each dataset and selecting Unregister.

Unregister dataset

To delete a dataset, go to the storage account by using the Azure portal or Azure Storage Explorer and manually delete those assets.

Next steps

In this article, you learned how to transform a dataset, and save it to a registered datastore.

Continue to the next part of this how-to series with Retrain models with Azure Machine Learning designer, to use your transformed datasets and pipeline parameters to train machine learning models.