教程:使用设计器预测汽车价格Tutorial: Predict automobile price with the designer

本教程分为两部分,介绍如何使用 Azure 机器学习设计器来训练并部署一个可预测汽车价格的机器学习模型。In this two-part tutorial, you learn how to use the Azure Machine Learning designer to train and deploy a machine learning model that predicts the price of any car. 该设计器是一个拖放式的工具,在其中可以创建机器学习模型,而无需编写任何代码。The designer is a drag-and-drop tool that lets you create machine learning models without a single line of code.

本教程的第一部分介绍如何:In part one of the tutorial, you'll learn how to:

  • 创建新管道。Create a new pipeline.
  • 导入数据。Import data.
  • 准备数据。Prepare data.
  • 训练机器学习模型。Train a machine learning model.
  • 评估机器学习模型。Evaluate a machine learning model.

在本教程的第二部分,你要将模型部署为实时推理终结点,以根据发送它的技术规范来预测任何汽车的价格。In part two of the tutorial, you'll deploy your model as a real-time inferencing endpoint to predict the price of any car based on technical specifications you send it.


我们提供了本教程的已完成版本作为示例管道。A completed version of this tutorial is available as a sample pipeline.

若要找到该示例,请转到工作区中的设计器。To find it, go to the designer in your workspace. 在“新建管道部分,选择”示例 1 - 回归: 汽车价格预测(基本)”。In the New pipeline section, select Sample 1 - Regression: Automobile Price Prediction(Basic).


如果看不到本文档中提到的图形元素(例如工作室或设计器中的按钮),则你可能没有适当级别的工作区权限。If you do not see graphical elements mentioned in this document, such as buttons in studio or designer, you may not have the right level of permissions to the workspace. 请与 Azure 订阅管理员联系,验证是否已向你授予正确级别的访问权限。Please contact your Azure subscription administrator to verify that you have been granted the correct level of access. 有关详细信息,请参阅管理用户和角色For more information, see Manage users and roles.

创建新管道Create a new pipeline

Azure 机器学习管道可将多个机器学习和数据处理步骤组织成单个资源。Azure Machine Learning pipelines organize multiple machine learning and data processing steps into a single resource. 管道可让你在不同的项目和用户之间组织、管理与重用复杂的机器学习工作流。Pipelines let you organize, manage, and reuse complex machine learning workflows across projects and users.

若要创建 Azure 机器学习管道,需要一个 Azure 机器学习工作区。To create an Azure Machine Learning pipeline, you need an Azure Machine Learning workspace. 本部分介绍如何创建这两个资源。In this section, you learn how to create both these resources.

创建新的工作区Create a new workspace

需要一个 Azure 机器学习工作区来使用设计器。You need an Azure Machine Learning workspace to use the designer. 工作区是 Azure 机器学习的顶级资源,提供一个中心位置用于处理 Azure 机器学习中创建的所有项目。The workspace is the top-level resource for Azure Machine Learning, it provides a centralized place to work with all the artifacts you create in Azure Machine Learning. 有关创建工作区的说明,请参阅创建和管理 Azure 机器学习工作区For instruction on creating a workspace, see Create and manage Azure Machine Learning workspaces.


如果工作区使用虚拟网络,则必须执行其他配置步骤才能使用设计器。If your workspace uses a Virtual network, there are additional configuration steps you must use to use the designer. 有关详细信息,请参阅在 Azure 虚拟网络中使用 Azure 机器学习工作室For more information, see Use Azure Machine Learning studio in an Azure virtual network

创建管道Create the pipeline

  1. 登录到 studio.ml.azure.cn,选择要使用的工作区。Sign in to studio.ml.azure.cn, and select the workspace you want to work with.

  2. 选择“设计器”。Select Designer.


  3. 选择“易用的预生成模块”。Select Easy-to-use prebuilt modules.

  4. 在画布顶部,选择默认管道名称“Pipeline-Created-on”。At the top of the canvas, select the default pipeline name Pipeline-Created-on. 将其重命名为“汽车价格预测”。Rename it to Automobile price prediction. 名称不需唯一。The name doesn't need to be unique.

设置默认计算目标Set the default compute target

管道在计算目标上运行,该目标是附加到工作区的计算资源。A pipeline runs on a compute target, which is a compute resource that's attached to your workspace. 创建计算目标后,就可以在以后的运行中重用它。After you create a compute target, you can reuse it for future runs.

可为整个管道设置 默认计算目标,告知每个模块要默认使用同一个计算目标。You can set a Default compute target for the entire pipeline, which will tell every module to use the same compute target by default. 但是,可以基于每个模块指定计算目标。However, you can specify compute targets on a per-module basis.

  1. 在管道名称旁边,选择画布顶部的 齿轮图标 齿轮图标的屏幕截图 打开“设置”窗格。Next to the pipeline name, select the Gear icon Screenshot of the gear icon at the top of the canvas to open the Settings pane.

  2. 在画布右侧的“设置”窗格中,选择“选择计算目标”。 In the Settings pane to the right of the canvas, select Select compute target.

    如果已有可用的计算目标,则可以选择它来运行此管道。If you already have an available compute target, you can select it to run this pipeline.


    此设计器只能在 Azure 机器学习计算目标上运行训练试验,而不会显示其他计算目标。The designer can only run training experiments on Azure Machine Learning Compute but other compute targets won't be shown.

  3. 输入计算资源的名称。Enter a name for the compute resource.

  4. 选择“保存”。Select Save.


    创建计算资源大约需要五分钟。It takes approximately five minutes to create a compute resource. 创建资源之后,可以重用它,并跳过此等待时间,以便将来运行。After the resource is created, you can reuse it and skip this wait time for future runs.

    计算资源在空闲时会自动缩放为 0 个节点以节省成本。The compute resource autoscales to zero nodes when it's idle to save cost. 在延迟之后再次使用它时,可能会经历大约五分钟的等待时间,同时它会重新扩展。When you use it again after a delay, you might experience approximately five minutes of wait time while it scales back up.

导入数据Import data

此设计器中包含多个示例数据集供你进行试验。There are several sample datasets included in the designer for you to experiment with. 本教程使用“汽车价格数据(原始)”。For this tutorial, use Automobile price data (Raw).

  1. 管道画布左侧是数据集和模块的控制板。To the left of the pipeline canvas is a palette of datasets and modules. 选择“示例数据集”以查看可用的示例数据集。Select Sample datasets to view the available sample datasets.

  2. 选择数据集“汽车价格数据(原始)”,然后将其拖到画布上。Select the dataset Automobile price data (Raw), and drag it onto the canvas.


可视化数据Visualize the data

可将数据可视化以了解要使用的数据集。You can visualize the data to understand the dataset that you'll use.

  1. 右键单击“汽车价格数据(原始)”并选择“可视化” > “数据集输出” 。Right-click the Automobile price data (Raw) and select Visualize > Dataset output.

  2. 选择数据窗口中的不同列,查看有关每个列的信息。Select the different columns in the data window to view information about each one.

    每行代表一辆汽车,与每辆汽车关联的变量显示为列。Each row represents an automobile, and the variables associated with each automobile appear as columns. 此数据集中有 205 行和 26 列。There are 205 rows and 26 columns in this dataset.

准备数据Prepare data

数据集通常需要在分析之前经过某种预处理。Datasets typically require some preprocessing before analysis. 在检查数据集时,你可能已经注意到某些值缺失。You might have noticed some missing values when you inspected the dataset. 必须清除这些缺失值,使模型能够正确分析数据。These missing values must be cleaned so that the model can analyze the data correctly.

删除列Remove a column

训练模型时,必须对缺失的数据执行某些操作。When you train a model, you have to do something about the data that's missing. 在此数据集中,normalized-losses 列缺失许多值,因此需要从模型中完全排除该列。In this dataset, the normalized-losses column is missing many values, so you will exclude that column from the model altogether.

  1. 在画布左侧的模块控制板中,展开“数据转换”部分并找到“选择数据集中的列”模块。 In the module palette to the left of the canvas, expand the Data Transformation section and find the Select Columns in Dataset module.

  2. 将“选择数据集中的列”模块拖到画布上。Drag the Select Columns in Dataset module onto the canvas. 将该模块放在数据集模块下面。Drop the module below the dataset module.

  3. 将“汽车价格数据(原始)”数据集连接到“选择数据集中的列”模块 。Connect the Automobile price data (Raw) dataset to the Select Columns in Dataset module. 从数据集的输出端口(画布上数据集底部的小圆圈)拖到“选择数据集中的列”的输入端口(模块顶部的小圆圈)。Drag from the dataset's output port, which is the small circle at the bottom of the dataset on the canvas, to the input port of Select Columns in Dataset, which is the small circle at the top of the module.


    将一个模块的输出端口连接到另一个模块的输入端口时,即可通过管道创建数据流。You create a flow of data through your pipeline when you connect the output port of one module to an input port of another.


  4. 选择“在数据集中选择列”模块。Select the Select Columns in Dataset module.

  5. 在画布右侧的模块详细信息窗格中,选择“编辑列”。In the module details pane to the right of the canvas, select Edit column.

  6. 展开“包含”旁边的“列名”下拉列表,然后选择“所有列”。 Expand the Column names drop down next to Include, and select All columns.

  7. 选择 + 以添加新规则。Select the + to add a new rule.

  8. 在下拉菜单中,选择“排除”和“列名”。 From the drop-down menus, select Exclude and Column names.

  9. 在文本框中输入“normalized-losses”。Enter normalized-losses in the text box.

  10. 在右下角,选择“保存”以关闭列选择器。In the lower right, select Save to close the column selector.


  11. 选择“在数据集中选择列”模块。Select the Select Columns in Dataset module.

  12. 在画布右侧的模块详细信息窗格中,选择“注释”文本框,并输入“排除规范化损失”。In the module details pane to the right of the canvas, select the Comment text box and enter Exclude normalized losses.

    注释将显示在图形中,以帮助你组织管道。Comments will appear on the graph to help you organize your pipeline.

清理缺失数据Clean missing data

删除“normalized-losses”列后,数据集仍缺失值。Your dataset still has missing values after you remove the normalized-losses column. 可以使用“清理缺失数据”模块来删除剩余的缺失数据。You can remove the remaining missing data by using the Clean Missing Data module.


在设计器中使用大多数模块时,都必须从输入数据中清除缺失值。Cleaning the missing values from input data is a prerequisite for using most of the modules in the designer.

  1. 在画布左侧的模块控制板中,展开“数据转换”部分并找到“清理缺失数据”模块。 In the module palette to the left of the canvas, expand the section Data Transformation, and find the Clean Missing Data module.

  2. 将“清理缺失数据”模块拖到管道画布上。Drag the Clean Missing Data module to the pipeline canvas. 将它连接到“选择数据集中的列”模块。Connect it to the Select Columns in Dataset module.

  3. 选择“清理缺失数据”模块。Select the Clean Missing Data module.

  4. 在画布右侧的模块详细信息窗格中,选择“编辑列”。In the module details pane to the right of the canvas, select Edit Column.

  5. 在出现的“要清理的列”窗口中,展开“包括”旁边的下拉菜单。In the Columns to be cleaned window that appears, expand the drop-down menu next to Include. 选择“所有列”Select, All columns

  6. 选择“保存”Select Save

  7. 在画布右侧的模块详细信息窗格中,选择“清理模式”下的“删除整行”。 In the module details pane to the right of the canvas, select Remove entire row under Cleaning mode.

  8. 在画布右侧的模块详细信息窗格中,选择“注释”框并输入“删除缺失值行”。In the module details pane to the right of the canvas, select the Comment box, and enter Remove missing value rows.

    管道现在应如下所示:Your pipeline should now look something like this:


训练机器学习模型Train a machine learning model

准备好用于处理数据的模块后,接下来可以设置训练模块。Now that you have the modules in place to process the data, you can set up the training modules.

由于你要预测价格(一个数字),因此可以使用回归算法。Because you want to predict price, which is a number, you can use a regression algorithm. 本示例将使用线性回归模型。For this example, you use a linear regression model.

拆分数据Split the data

拆分数据是机器学习中的一项常见任务。Splitting data is a common task in machine learning. 你要将数据拆分成两个独立的数据集。You will split your data into two separate datasets. 一个数据集训练模型,另一个数据集测试模型的表现。One dataset will train the model and the other will test how well the model performed.

  1. 在模块控制板中,展开“数据转换”部分并找到“拆分数据”模块。 In the module palette, expand the section Data Transformation and find the Split Data module.

  2. 将“拆分数据”模块拖到管道画布上。Drag the Split Data module to the pipeline canvas.

  3. 将“清理缺失数据”模块的左侧端口连接到“拆分数据”模块。 Connect the left port of the Clean Missing Data module to the Split Data module.


    请确保“清理缺失数据”的左侧输出端口连接到“拆分数据”。 Be sure that the left output ports of Clean Missing Data connects to Split Data. 左侧端口包含清理的数据。The left port contains the the cleaned data. 右侧端口包含丢弃的数据。The right port contains the discarted data.

  4. 选择“拆分数据”模块。Select the Split Data module.

  5. 在画布右侧的模块详细信息窗格中,将“第一个输出数据集中的行的比例”设置为 0.7。In the module details pane to the right of the canvas, set the Fraction of rows in the first output dataset to 0.7.

    此选项使用 70% 的数据来训练模型,保留 30% 的数据用于测试。This option splits 70 percent of the data to train the model and 30 percent for testing it. 可通过左侧输出端口访问 70% 的数据集。The 70 percent dataset will be accessible through the left output port. 可通过右侧输出端口访问剩余的数据。The remaining data will be available through the right output port.

  6. 在画布右侧的模块详细信息窗格中,选择“注释”框并输入“将数据集拆分为训练集 (0.7) 和测试集 (0.3)”。In the module details pane to the right of the canvas, select the Comment box, and enter Split the dataset into training set (0.7) and test set (0.3).

定型模型Train the model

在模型中提供包含价格的数据集以对其进行训练。Train the model by giving it a dataset that includes the price. 算法将构造一个模型,用于解释训练数据提供的特征与价格之间的关系。The algorithm constructs a model that explains the relationship between the features and the price as presented by the training data.

  1. 在模块控制板中,展开“机器学习算法”。In the module palette, expand Machine Learning Algorithms.

    此选项显示多个可用于初始化学习算法的模块类别。This option displays several categories of modules that you can use to initialize learning algorithms.

  2. 选择“回归” > “线性回归”并将其拖到管道画布上 。Select Regression > Linear Regression, and drag it to the pipeline canvas.

  3. 在模块控制板中展开“模块训练”部分,然后将“训练模型”模块拖到画布上。 In the module palette, expand the section Module training, and drag the Train Model module to the canvas.

  4. 将“线性回归”模块的输出连接到“训练模型”模块的左侧输入。 Connect the output of the Linear Regression module to the left input of the Train Model module.

  5. 将“拆分数据”模块的训练数据输出(左侧端口)连接到“训练模型”模块的右侧输入。 Connect the training data output (left port) of the Split Data module to the right input of the Train Model module.


    请确保“拆分数据”的左侧输出端口连接到“训练模型”。 Be sure that the left output ports of Split Data connects to Train Model. 左侧端口包含训练集。The left port contains the the training set. 右侧端口包含测试集。The right port contains the test set.


  6. 选择 训练模型 模块。Select the Train Model module.

  7. 在画布右侧的模块详细信息窗格中,选择“编辑列”选择器。In the module details pane to the right of the canvas, select Edit column selector.

  8. 在“标签列”对话框中展开下拉菜单,然后选择“列名” 。In the Label column dialog box, expand the drop-down menu and select Column names.

  9. 在文本框中,输入“价格”以指定模型要预测的值。In the text box, enter price to specify the value that your model is going to predict.


    请确保确切地输入列名称。Make sure you enter the column name exactly. 不要将 price 的首字母大写。Do not capitalize price.

    管道应如下所示:Your pipeline should look like this:


添加“评分模型”模块Add the Score Model module

使用 70% 的数据训练模型后,可以使用该模型为另外 30% 的数据评分,确定模型的运行情况。After you train your model by using 70 percent of the data, you can use it to score the other 30 percent to see how well your model functions.

  1. 在搜索框中输入“评分模型”以找到“评分模型”模块。Enter score model in the search box to find the Score Model module. 将该模块拖到管道画布上。Drag the module to the pipeline canvas.

  2. 训练模型 模块的输出连接到 评分模型 的左侧输入端口。Connect the output of the Train Model module to the left input port of Score Model. 拆分数据 模型的测试数据输出(右端口)连接到 评分模型 的右侧输入端口。Connect the test data output (right port) of the Split Data module to the right input port of Score Model.

添加“评估模型”模块Add the Evaluate Model module

使用“评估模型”模块来评估模型为测试数据集评分的准确度。Use the Evaluate Model module to evaluate how well your model scored the test dataset.

  1. 在搜索框中输入“评估”以找到“评估模型”模块。Enter evaluate in the search box to find the Evaluate Model module. 将该模块拖到管道画布上。Drag the module to the pipeline canvas.

  2. 将“评分模型”模块的输出连接到“评估模型”的左侧输入。 Connect the output of the Score Model module to the left input of Evaluate Model.

    最终的管道应如下所示:The final pipeline should look something like this:


提交管道Submit the pipeline

完成管道的所有设置后,可以提交管道运行来训练机器学习模型。Now that your pipeline is all setup, you can submit a pipeline run to train your machine learning model. 可以随时提交有效的管道运行(可用于查看在开发期间对管道所做的更改)。You can submit a valid pipeline run at any point, which can be used to review changes to your pipeline during development.

  1. 在画布顶部选择“提交”。At the top of the canvas, select Submit.

  2. 在“设置管道运行”对话框中,选择“新建”。In the Set up pipeline run dialog box, select Create new.


    试验将相似的管道运行组合在一起。Experiments group similar pipeline runs together. 如果多次运行管道,则可以选择相同的试验进行连续运行。If you run a pipeline multiple times, you can select the same experiment for successive runs.

    1. 对于“新试验名称”,输入“Tutorial-CarPrices” 。For New experiment Name, enter Tutorial-CarPrices.

    2. 选择“提交”。 Select Submit.

    可以在画布的右上角查看运行状态和详细信息。You can view run status and details at the top right of the canvas.

    如果这是第一次运行,则管道可能需要长达 20 分钟的时间才能完成运行。If this is the first run, it may take up to 20 minutes for your pipeline to finish running. 默认计算设置中的最小节点大小为 0,这意味着设计器必须在空闲后分配资源。The default compute settings have a minimum node size of 0, which means that the designer must allocate resources after being idle. 由于计算资源已分配,因此,重复的管道运行花费的时间会更少。Repeated pipeline runs will take less time since the compute resources are already allocated. 此外,设计器还对每个模块使用缓存的结果,以便进一步提高效率。Additionally, the designer uses cached results for each module to further improve efficiency.

查看评分标签View scored labels

运行完成后,可以查看管道运行的结果。After the run completes, you can view the results of the pipeline run. 首先查看回归模型生成的预测。First, look at the predictions generated by the regression model.

  1. 右键单击“评分模型”模块,选择“可视化” > “评分数据集”以查看其输出 。Right-click the Score Model module, and select Visualize > Scored dataset to view its output.

    在此处可以看到从测试数据预测的价格和实际价格。Here you can see the predicted prices and the actual prices from the testing data.


评估模型Evaluate models

使用“评估模型”来确定已训练的模型处理测试数据集时的表现。Use the Evaluate Model to see how well the trained model performed on the test dataset.

  1. 右键单击“评估模型”模块,选择“可视化” > “评估结果”以查看其输出 。Right-click the Evaluate Model module and select Visualize > Evaluation results to view its output.

针对模型显示了以下统计信息:The following statistics are shown for your model:

  • 平均绝对误差(MAE) :绝对误差的平均值。Mean Absolute Error (MAE): The average of absolute errors. 误差是指预测值与实际值之间的差。An error is the difference between the predicted value and the actual value.
  • 均方根误差(RMSE) :对测试数据集所做预测的平均误差的平方根。Root Mean Squared Error (RMSE): The square root of the average of squared errors of predictions made on the test dataset.
  • 相对绝对误差:相对于实际值与所有实际值平均值之间的绝对差异的绝对误差平均值。Relative Absolute Error: The average of absolute errors relative to the absolute difference between actual values and the average of all actual values.
  • 相对平方误差:相对于实际值与所有实际值平均值之间的平方差异的平方误差平均值。Relative Squared Error: The average of squared errors relative to the squared difference between the actual values and the average of all actual values.
  • 决定系数:也称为 R 平方值,这是一个统计指标,表示模型的数据拟合度。Coefficient of Determination: Also known as the R squared value, this statistical metric indicates how well a model fits the data.

每个误差统计值越小越好。For each of the error statistics, smaller is better. 值越小,表示预测越接近实际值。A smaller value indicates that the predictions are closer to the actual values. 对于决定系数,其值越接近 1 (1.0),预测就越精确。For the coefficient of determination, the closer its value is to one (1.0), the better the predictions.

清理资源Clean up resources

若要继续学习本教程的第 2 部分部署模型,请跳过本部分。Skip this section if you want to continue on with part 2 of the tutorial, deploying models.


可以使用你创建的、用作其他 Azure 机器学习教程和操作指南文章的先决条件的资源。You can use the resources that you created as prerequisites for other Azure Machine Learning tutorials and how-to articles.

删除所有内容Delete everything

如果你不打算使用所创建的任何内容,请删除整个资源组,以免产生任何费用。If you don't plan to use anything that you created, delete the entire resource group so you don't incur any charges.

  1. 在 Azure 门户的窗口左侧选择“资源组” 。In the Azure portal, select Resource groups on the left side of the window.

    在 Azure 门户中删除资源组

  2. 在列表中选择你创建的资源组。In the list, select the resource group that you created.

  3. 选择“删除资源组” 。Select Delete resource group.

删除该资源组也会删除在设计器中创建的所有资源。Deleting the resource group also deletes all resources that you created in the designer.

删除各项资产Delete individual assets

在创建试验的设计器中删除各个资产,方法是将其选中,然后选择“删除”按钮。 In the designer where you created your experiment, delete individual assets by selecting them and then selecting the Delete button.

此处创建的计算目标在未使用时,会自动缩减到零个节点。 The compute target that you created here automatically autoscales to zero nodes when it's not being used. 此操作旨在最大程度地减少费用。This action is taken to minimize charges. 若要删除计算目标,请执行以下步骤: If you want to delete the compute target, take these steps:


可以通过选择每个数据集并选择“注销” ,从工作区中注销数据集。You can unregister datasets from your workspace by selecting each dataset and selecting Unregister.


若要删除数据集,请使用 Azure 门户或 Azure 存储资源管理器访问存储帐户,然后手动删除这些资产。To delete a dataset, go to the storage account by using the Azure portal or Azure Storage Explorer and manually delete those assets.

后续步骤Next steps

第二部分介绍如何将模型部署为实时终结点。In part two, you'll learn how to deploy your model as a real-time endpoint.