Tutorial: Designer - train a no-code regression model

Train a linear regression model that predicts car prices using the Azure Machine Learning designer. This tutorial is part one of a two-part series.

This tutorial uses the Azure Machine Learning designer, for more information, see What is Azure Machine Learning designer?

In part one of the tutorial, you learn how to:

  • Create a new pipeline.
  • Import data.
  • Prepare data.
  • Train a machine learning model.
  • Evaluate a machine learning model.

In part two of the tutorial, you deploy your model as a real-time inferencing endpoint to predict the price of any car based on technical specifications you send it.

Note

A completed version of this tutorial is available as a sample pipeline.

To find it, go to the designer in your workspace. In the New pipeline section, select Sample 1 - Regression: Automobile Price Prediction(Basic).

Important

If you do not see graphical elements mentioned in this document, such as buttons in studio or designer, you may not have the right level of permissions to the workspace. Please contact your Azure subscription administrator to verify that you have been granted the correct level of access. For more information, see Manage users and roles.

Create a new pipeline

Azure Machine Learning pipelines organize multiple machine learning and data processing steps into a single resource. Pipelines let you organize, manage, and reuse complex machine learning workflows across projects and users.

To create an Azure Machine Learning pipeline, you need an Azure Machine Learning workspace. In this section, you learn how to create both these resources.

Create a new workspace

You need an Azure Machine Learning workspace to use the designer. The workspace is the top-level resource for Azure Machine Learning, it provides a centralized place to work with all the artifacts you create in Azure Machine Learning. For instruction on creating a workspace, see Create workspace resources.

Note

If your workspace uses a Virtual network, there are additional configuration steps you must use to use the designer. For more information, see Use Azure Machine Learning studio in an Azure virtual network

Create the pipeline

Note

Designer supports two type of components, classic prebuilt components and custom components. These two types of components are not compatible.

Classic prebuilt components provides prebuilt components majorly for data processing and traditional machine learning tasks like regression and classification. This type of component continues to be supported but will not have any new components added.

Custom components allow you to provide your own code as a component. It supports sharing across workspaces and seamless authoring across Studio, CLI, and SDK interfaces.

This article applies to classic prebuilt components.

  1. Sign in to https://studio.ml.azure.cn, and select the workspace you want to work with.

  2. Select Designer -> Classic prebuilt

    Screenshot of the visual workspace showing how to access the designer.

  3. Select Create a new pipeline using classic prebuilt components.

  4. Click the pencil icon beside the automatically generated pipeline draft name, rename it to Automobile price prediction. The name doesn't need to be unique.

Screenshot of pencil icon to change pipeline draft name.

Set the default compute target

A pipeline jobs on a compute target, which is a compute resource that's attached to your workspace. After you create a compute target, you can reuse it for future jobs.

Important

Attached compute is not supported, use compute instances or clusters instead.

You can set a Default compute target for the entire pipeline, which will tell every component to use the same compute target by default. However, you can specify compute targets on a per-module basis.

  1. Select Screenshot of the gear icon that is in the UI.Settings to the right of the canvas to open the Settings pane.

  2. Select Create Azure Machine Learning compute instance.

    If you already have an available compute target, you can select it from the Select Azure Machine Learning compute instance drop-down to run this pipeline.

  3. Enter a name for the compute resource.

  4. Select Create.

    Note

    It takes approximately five minutes to create a compute resource. After the resource is created, you can reuse it and skip this wait time for future jobs.

    The compute resource autoscales to zero nodes when it's idle to save cost. When you use it again after a delay, you might experience approximately five minutes of wait time while it scales back up.

Import data

There are several sample datasets included in the designer for you to experiment with. For this tutorial, use Automobile price data (Raw).

  1. To the left of the pipeline canvas is a palette of datasets and components. Select Component -> Sample data.

  2. Select the dataset Automobile price data (Raw), and drag it onto the canvas.

    Gif of dragging the Automobile price data to the canvas.

Visualize the data

You can visualize the data to understand the dataset that you'll use.

  1. Right-click the Automobile price data (Raw) and select Preview Data.

  2. Select the different columns in the data window to view information about each one.

    Each row represents an automobile, and the variables associated with each automobile appear as columns. There are 205 rows and 26 columns in this dataset.

Prepare data

Datasets typically require some preprocessing before analysis. You might have noticed some missing values when you inspected the dataset. These missing values must be cleaned so that the model can analyze the data correctly.

Remove a column

When you train a model, you have to do something about the data that's missing. In this dataset, the normalized-losses column is missing many values, so you'll exclude that column from the model altogether.

  1. In the datasets and component palette to the left of the canvas, click Component and search for the Select Columns in Dataset component.

  2. Drag the Select Columns in Dataset component onto the canvas. Drop the component below the dataset component.

  3. Connect the Automobile price data (Raw) dataset to the Select Columns in Dataset component. Drag from the dataset's output port, which is the small circle at the bottom of the dataset on the canvas, to the input port of Select Columns in Dataset, which is the small circle at the top of the component.

    Tip

    You create a flow of data through your pipeline when you connect the output port of one component to an input port of another.

    Screenshot of connecting Automobile price data component to select columns in dataset component.

  4. Select the Select Columns in Dataset component.

  5. Click on the arrow icon under Settings to the right of the canvas to open the component details pane. Alternatively, you can double-click the Select Columns in Dataset component to open the details pane.

  6. Select Edit column to the right of the pane.

  7. Expand the Column names drop down next to Include, and select All columns.

  8. Select the + to add a new rule.

  9. From the drop-down menus, select Exclude and Column names.

  10. Enter normalized-losses in the text box.

  11. In the lower right, select Save to close the column selector.

    Screenshot of select columns with exclude highlighted.

  12. In the Select Columns in Dataset component details pane, expand Node info.

  13. Select the Comment text box and enter Exclude normalized losses.

    Comments will appear on the graph to help you organize your pipeline.

Clean missing data

Your dataset still has missing values after you remove the normalized-losses column. You can remove the remaining missing data by using the Clean Missing Data component.

Tip

Cleaning the missing values from input data is a prerequisite for using most of the components in the designer.

  1. In the datasets and component palette to the left of the canvas, click Component and search for the Clean Missing Data component.

  2. Drag the Clean Missing Data component to the pipeline canvas. Connect it to the Select Columns in Dataset component.

  3. Select the Clean Missing Data component.

  4. Click on the arrow icon under Settings to the right of the canvas to open the component details pane. Alternatively, you can double-click the Clean Missing Data component to open the details pane.

  5. Select Edit column to the right of the pane.

  6. In the Columns to be cleaned window that appears, expand the drop-down menu next to Include. Select, All columns

  7. Select Save

  8. In the Clean Missing Data component details pane, under Cleaning mode, select Remove entire row.

  9. In the Clean Missing Data component details pane, expand Node info.

  10. Select the Comment text box and enter Remove missing value rows.

    Your pipeline should now look something like this:

    Screenshot of automobile price data connected to select columns in dataset component, which is connected to clean missing data.

Train a machine learning model

Now that you have the components in place to process the data, you can set up the training components.

Because you want to predict price, which is a number, you can use a regression algorithm. For this example, you use a linear regression model.

Split the data

Splitting data is a common task in machine learning. You'll split your data into two separate datasets. One dataset will train the model and the other will test how well the model performed.

  1. In the datasets and component palette to the left of the canvas, click Component and search for the Split Data component.

  2. Drag the Split Data component to the pipeline canvas.

  3. Connect the left port of the Clean Missing Data component to the Split Data component.

    Important

    Make sure that the left output port of Clean Missing Data connects to Split Data. The left port contains the cleaned data. The right port contains the discarded data.

  4. Select the Split Data component.

  5. Click on the arrow icon under Settings to the right of the canvas to open the component details pane. Alternatively, you can double-click the Split Data component to open the details pane.

  6. In the Split Data details pane, set the Fraction of rows in the first output dataset to 0.7.

    This option splits 70 percent of the data to train the model and 30 percent for testing it. The 70 percent dataset will be accessible through the left output port. The remaining data will be available through the right output port.

  7. In the Split Data details pane, expand Node info.

  8. Select the Comment text box and enter Split the dataset into training set (0.7) and test set (0.3).

Train the model

Train the model by giving it a dataset that includes the price. The algorithm constructs a model that explains the relationship between the features and the price as presented by the training data.

  1. In the datasets and component palette to the left of the canvas, click Component and search for the Linear Regression component.

  2. Drag the Linear Regression component to the pipeline canvas.

  3. In the datasets and component palette to the left of the canvas, click Component and search for the Train Model component.

  4. Drag the Train Model component to the pipeline canvas.

  5. Connect the output of the Linear Regression component to the left input of the Train Model component.

  6. Connect the training data output (left port) of the Split Data component to the right input of the Train Model component.

    Important

    Make sure that the left output port of Split Data connects to Train Model. The left port contains the training set. The right port contains the test set.

    Screenshot showing the Linear Regression  connects to left port of Train Model  and the Split Data connects to right port of Train Model.

  7. Select the Train Model component.

  8. Click on the arrow icon under Settings to the right of the canvas to open the component details pane. Alternatively, you can double-click the Train Model component to open the details pane.

  9. Select Edit column to the right of the pane.

  10. In the Label column window that appears, expand the drop-down menu and select Column names.

  11. In the text box, enter price to specify the value that your model is going to predict.

    Important

    Make sure you enter the column name exactly. Do not capitalize price.

    Your pipeline should look like this:

    Screenshot showing the correct configuration of the pipeline after adding the Train Model component.

Add the Score Model component

After you train your model by using 70 percent of the data, you can use it to score the other 30 percent to see how well your model functions.

  1. In the datasets and component palette to the left of the canvas, click Component and search for the Score Model component.

  2. Drag the Score Model component to the pipeline canvas.

  3. Connect the output of the Train Model component to the left input port of Score Model. Connect the test data output (right port) of the Split Data component to the right input port of Score Model.

Add the Evaluate Model component

Use the Evaluate Model component to evaluate how well your model scored the test dataset.

  1. In the datasets and component palette to the left of the canvas, click Component and search for the Evaluate Model component.

  2. Drag the Evaluate Model component to the pipeline canvas.

  3. Connect the output of the Score Model component to the left input of Evaluate Model.

    The final pipeline should look something like this:

    Screenshot showing the correct configuration of the pipeline.

Submit the pipeline

Now that your pipeline is all setup, you can submit a pipeline job to train your machine learning model. You can submit a valid pipeline job at any point, which can be used to review changes to your pipeline during development.

  1. At the top of the canvas, select Submit.

  2. In the Set up pipeline job dialog box, select Create new.

    Note

    Experiments group similar pipeline jobs together. If you run a pipeline multiple times, you can select the same experiment for successive jobs.

    1. For New experiment Name, enter Tutorial-CarPrices.

    2. Select Submit.

    3. You'll see a submission list in the left pane of the canvas, and a notification will pop up at the top right corner of the page. You can select the Job detail link to go to job detail page for debugging.

      Screenshot of the submitted jobs list with a success notification.

    If this is the first job, it may take up to 20 minutes for your pipeline to finish running. The default compute settings have a minimum node size of 0, which means that the designer must allocate resources after being idle. Repeated pipeline jobs will take less time since the compute resources are already allocated. Additionally, the designer uses cached results for each component to further improve efficiency.

View scored labels

In the job detail page, you can check the pipeline job status, results and logs.

Screenshot showing the pipeline job detail page.

After the job completes, you can view the results of the pipeline job. First, look at the predictions generated by the regression model.

  1. Right-click the Score Model component, and select Preview data > Scored dataset to view its output.

    Here you can see the predicted prices and the actual prices from the testing data.

    Screenshot of the output visualization highlighting the Scored Label column.

Evaluate models

Use the Evaluate Model to see how well the trained model performed on the test dataset.

  1. Right-click the Evaluate Model component and select Preview data > Evaluation results to view its output.

The following statistics are shown for your model:

  • Mean Absolute Error (MAE): The average of absolute errors. An error is the difference between the predicted value and the actual value.
  • Root Mean Squared Error (RMSE): The square root of the average of squared errors of predictions made on the test dataset.
  • Relative Absolute Error: The average of absolute errors relative to the absolute difference between actual values and the average of all actual values.
  • Relative Squared Error: The average of squared errors relative to the squared difference between the actual values and the average of all actual values.
  • Coefficient of Determination: Also known as the R squared value, this statistical metric indicates how well a model fits the data.

For each of the error statistics, smaller is better. A smaller value indicates that the predictions are closer to the actual values. For the coefficient of determination, the closer its value is to one (1.0), the better the predictions.

Clean up resources

Skip this section if you want to continue on with part 2 of the tutorial, deploying models.

Important

You can use the resources that you created as prerequisites for other Azure Machine Learning tutorials and how-to articles.

Delete everything

If you don't plan to use anything that you created, delete the entire resource group so you don't incur any charges.

  1. In the Azure portal, select Resource groups on the left side of the window.

    Delete resource group in the Azure portal

  2. In the list, select the resource group that you created.

  3. Select Delete resource group.

Deleting the resource group also deletes all resources that you created in the designer.

Delete individual assets

In the designer where you created your experiment, delete individual assets by selecting them and then selecting the Delete button.

The compute target that you created here automatically autoscales to zero nodes when it's not being used. This action is taken to minimize charges. If you want to delete the compute target, take these steps:

Delete assets

You can unregister datasets from your workspace by selecting each dataset and selecting Unregister.

Unregister dataset

To delete a dataset, go to the storage account by using the Azure portal or Azure Storage Explorer and manually delete those assets.

Next steps

In part two, you'll learn how to deploy your model as a real-time endpoint.