快速入门:在 Azure 机器学习工作室(经典版)中创建第一个数据科学试验Quickstart: Create your first data science experiment in Azure Machine Learning Studio (classic)

备注

Studio(经典)中的 Notebooks(预览版)功能将在 2020 年 4 月 13 日关闭。The Notebooks (preview) feature in Studio (classic) will be shut down on April 13th, 2020. 4 月 13 日以后,“Notebooks”选项卡将与所有已保存的笔记本一起删除。After April 13th, the Notebooks tab will be removed along with any saved notebooks.

提示

鼓励当前正在使用或评估机器学习工作室(经典版)的客户尝试 Azure 机器学习设计器(预览版),它提供拖放 ML 模块以及 可伸缩性、版本控制和企业安全性。Customers currently using or evaluating Machine Learning Studio (classic) are encouraged to try Azure Machine Learning designer (preview), which provides drag-n-drop ML modules plus scalability, version control, and enterprise security.

在本快速入门中,你将在 Azure 机器学习工作室(经典版)中创建一个机器学习试验,用于根据制造商和技术规格等不同变量来预测汽车的价格。In this quickstart, you create a machine learning experiment in Azure Machine Learning Studio (classic) that predicts the price of a car based on different variables such as make and technical specifications.

本快速入门遵循默认的工作流开展试验:This quickstart follows the default workflow for an experiment:

  1. 创建模型Create a model
  2. 模型Train the model
  3. 对模型进行评分和测试Score and test the model

获取数据Get the data

若要进行机器学习,首先需要获取数据。The first thing you need in machine learning is data. 可以使用工作室(经典)随附的多个示例数据集,也可以从多种源导入数据。There are several sample datasets included with Studio (classic) that you can use, or you can import data from many sources. 本示例将使用工作区中包含的示例数据集“汽车价格数据(原始)” 。For this example, we'll use the sample dataset, Automobile price data (Raw), that's included in your workspace. 此数据集包含各辆汽车的条目,包括制造商、车型、技术规格、价格等方面的信息。This dataset includes entries for various individual automobiles, including information such as make, model, technical specifications, and price.

提示

可在 Azure AI Gallery(Azure AI 库)中找到以下试验的工作副本。You can find a working copy of the following experiment in the Azure AI Gallery. 请转到 Your first data science experiment - Automobile price prediction(第一个数据科学试验 - 汽车价格预测),并单击“在工作室中打开”将试验副本下载到机器学习工作室(经典版)的工作区 。Go to Your first data science experiment - Automobile price prediction and click Open in Studio to download a copy of the experiment into your Machine Learning Studio (classic) workspace.

下面介绍如何将数据集导入试验中。Here's how to get the dataset into your experiment.

  1. 单击机器学习工作室(经典版)窗口底部的“+新建”以创建新试验 。Create a new experiment by clicking +NEW at the bottom of the Machine Learning Studio (classic) window. 选择“试验” > “空白试验”。 Select EXPERIMENT > Blank Experiment.

  2. 试验有一个默认名称,显示在画布顶部。The experiment is given a default name that you can see at the top of the canvas. 选中该名称,将试验重命名为某个有意义的名称,例如“汽车价格预测”。 Select this text and rename it to something meaningful, for example, Automobile price prediction. 名称不需唯一。The name doesn't need to be unique.

    将试验重命名

  3. 试验画布左侧是数据集和模块的控制板。To the left of the experiment canvas is a palette of datasets and modules. 在此控制板顶部的“搜索”框中键入汽车,找到标有“汽车价格数据(原始)” 的数据集。Type automobile in the Search box at the top of this palette to find the dataset labeled Automobile price data (Raw). 将该数据集拖放到试验画布上。Drag this dataset to the experiment canvas.

    找到汽车数据集并将其拖放到试验画布上

若要查看此数据的大致外观,请单击汽车数据集底部的输出端口,并选择“可视化” 。To see what this data looks like, click the output port at the bottom of the automobile dataset then select Visualize.

单击输出端口,并选择“可视化”

提示

数据集和模块都有由小圆圈表示的输入和输出端口 - 输入端口位于顶部,输出端口位于底部。Datasets and modules have input and output ports represented by small circles - input ports at the top, output ports at the bottom. 要通过试验创建数据流,需将一个模块的输出端口连接到另一个模块的输入端口。To create a flow of data through your experiment, you'll connect an output port of one module to an input port of another. 可以随时单击数据集或模块的输出端口,查看数据流中的数据在该时刻的情况。At any time, you can click the output port of a dataset or module to see what the data looks like at that point in the data flow.

在此数据集中,每行代表一辆汽车,与每辆汽车关联的变量显示为列。In this dataset, each row represents an automobile, and the variables associated with each automobile appear as columns. 使用特定汽车的变量在最右列(第 26 列,标题为“价格”)中预测价格。We'll predict the price in far-right column (column 26, titled "price") using the variables for a specific automobile.

在数据可视化窗口中查看汽车数据

单击右上角的“x” 关闭可视化窗口。Close the visualization window by clicking the "x" in the upper-right corner.

准备数据Prepare the data

数据集通常需要经过一定的预处理才能进行分析。A dataset usually requires some preprocessing before it can be analyzed. 可能已注意到在各个行的列中存在缺失值。You might have noticed the missing values present in the columns of various rows. 需要清除这些缺失值,使模型能够正确分析数据。These missing values need to be cleaned so the model can analyze the data correctly. 将删除包含缺失值的所有行。We'll remove any rows that have missing values. 此外,“规范化损失” 列包含较大比例的缺失值,因此要将该列从模型中完全排除。Also, the normalized-losses column has a large proportion of missing values, so we'll exclude that column from the model altogether.

提示

使用大多数模块时,都必须从输入数据中清除缺失值。Cleaning the missing values from input data is a prerequisite for using most of the modules.

首先添加一个彻底删除“规范化损失”列的模块。 First, we add a module that removes the normalized-losses column completely. 然后添加另一个删除任何有缺失数据的行的模块。Then we add another module that removes any row that has missing data.

  1. 在模块面板顶部的搜索框中键入“选择列”,以查找选择数据集中的列模块。 Type select columns in the search box at the top of the module palette to find the Select Columns in Dataset module. 然后将该模块拖放到试验画布上。Then drag it to the experiment canvas. 使用此模块可以选择要将哪些列包含在模型中,或者从模型中排除。This module allows us to select which columns of data we want to include or exclude in the model.

  2. 将“汽车价格数据(原始)”数据集的输出端口连接到“选择数据集中的列”模块的输入端口。 Connect the output port of the Automobile price data (Raw) dataset to the input port of the Select Columns in Dataset.

    将“选择数据集中的列”模块添加到试验画布并进行连接

  3. 单击选择数据集中的列模块,并单击“属性”窗格中的“启动列选择器” 。Click the Select Columns in Dataset module and click Launch column selector in the Properties pane.

    • 在左侧单击“使用规则” On the left, click With rules

    • 开头为下面,单击所有列Under Begin With, click All columns. 这些规则指示选择数据集中的列传递所有列(但要排除的列除外)。These rules direct Select Columns in Dataset to pass through all the columns (except those columns we're about to exclude).

    • 在下拉列表中,选择“排除” 和“列名称” ,并在文本框内部单击。From the drop-downs, select Exclude and column names, and then click inside the text box. 此时会显示列的列表。A list of columns is displayed. 选择“规范化损失” ,该列随即添加到文本框中。Select normalized-losses, and it's added to the text box.

    • 单击复选标记(“确定”)按钮,关闭列选择器(右下角)。Click the check mark (OK) button to close the column selector (on the lower right).

      启动列选择器,排除“规范化损失”列

      此时“选择数据集中的列” 的属性窗格指示它将传入数据集中的所有列,但“规范化损失” 除外。Now the properties pane for Select Columns in Dataset indicates that it will pass through all columns from the dataset except normalized-losses.

      属性窗格显示“规范化损失”列已排除

      提示

      可以双击模块并输入文本,为模块添加注释。You can add a comment to a module by double-clicking the module and entering text. 这有助于快速查看模块在实验中的运行情况。This can help you see at a glance what the module is doing in your experiment. 在本例中,请双击选择数据集中的列模块,并键入注释“排除规范化损失”。In this case double-click the Select Columns in Dataset module and type the comment "Exclude normalized losses."

      双击要添加注释的模块

  4. 清理缺失数据 模块拖到试验画布上,然后将其连接到 在数据集中选择列 模块。Drag the Clean Missing Data module to the experiment canvas and connect it to the Select Columns in Dataset module. 在“属性” 窗格的“清理模式”下选择“删除整行” 。In the Properties pane, select Remove entire row under Cleaning mode. 这些选项指示 清理缺失数据 通过删除存在缺失值的行来清理数据。These options direct Clean Missing Data to clean the data by removing rows that have any missing values. 双击该模块并键入注释“删除缺失值行”。Double-click the module and type the comment "Remove missing value rows."

    将“清理缺失数据”模块的清理模式设置为“删除整行”

  5. 通过单击页面底部的“运行”运行此试验。 Run the experiment by clicking RUN at the bottom of the page.

    试验运行完以后,所有模块都会出现绿色复选标记,表示已成功完成。When the experiment has finished running, all the modules have a green check mark to indicate that they finished successfully. 另请留意右上角的“已完成运行” 状态。Notice also the Finished running status in the upper-right corner.

    运行后,试验看起来应与上图类似

提示

为什么我们现在运行此试验?Why did we run the experiment now? 运行此试验,数据的列定义就会从数据集传入选择数据集中的列模块和清理缺失数据模块。By running the experiment, the column definitions for our data pass from the dataset, through the Select Columns in Dataset module, and through the Clean Missing Data module. 这意味着,只要连接到清理缺失数据,任何模块也都会有此类相同信息。This means that any modules we connect to Clean Missing Data will also have this same information.

现已清理数据。Now we have clean data. 要查看已清理的数据集,请单击清理缺失数据模块左侧的输出端口,并选择“可视化” 。If you want to view the cleaned dataset, click the left output port of the Clean Missing Data module and select Visualize. 请注意,此时不再包含“规范化损失” 列,并且也没有缺失值。Notice that the normalized-losses column is no longer included, and there are no missing values.

现已清理数据,接下来可以指定要在预测模型中使用哪些特征。Now that the data is clean, we're ready to specify what features we're going to use in the predictive model.

定义特征Define features

在机器学习中, 特征 是用户感兴趣的某些内容的各个可测量属性。In machine learning, features are individual measurable properties of something you’re interested in. 在此处的数据集中,每个行代表一辆汽车,每个列是该汽车的特征。In our dataset, each row represents one automobile, and each column is a feature of that automobile.

若要找到一组理想的特征来创建预测模型,需要针对要解决的问题进行试验,并且具有相关知识。Finding a good set of features for creating a predictive model requires experimentation and knowledge about the problem you want to solve. 有些特征比其他特征更适合用于预测目标。Some features are better for predicting the target than others. 某些特征与其他特征有很强的关联性,可将其删除。Some features have a strong correlation with other features and can be removed. 例如,city-mpg(市区油耗)和 highway-mpg(高速公路油耗)密切相关,因此可以保留一个,删除另一个,不会对预测产生明显影响。For example, city-mpg and highway-mpg are closely related so we can keep one and remove the other without significantly affecting the prediction.

让我们构建一个模型,它使用数据集中的一部分特征。Let's build a model that uses a subset of the features in our dataset. 以后还可以返回此处,选择不同的特征,再次运行试验,并确认是否获得了理想的结果。You can come back later and select different features, run the experiment again, and see if you get better results. 不过,让我们先尝试使用以下特征:But to start, let's try the following features:

make, body-style, wheel-base, engine-size, horsepower, peak-rpm, highway-mpg, price
  1. 将另一选择数据集中的列模块拖放到试验画布上。Drag another Select Columns in Dataset module to the experiment canvas. 清理缺失数据模块左侧的输出端口连接到选择数据集中的列模块的输入。Connect the left output port of the Clean Missing Data module to the input of the Select Columns in Dataset module.

    将“选择数据集中的列”模块连接到“清理缺失数据”模块

  2. 双击该模块,并键入“选择要预测的特征”。Double-click the module and type "Select features for prediction."

  3. 单击“属性” 窗格中的“启动列选择器” 。Click Launch column selector in the Properties pane.

  4. 单击“使用规则” 。Click With rules.

  5. 在“开头为” 下面,单击“没有列” 。Under Begin With, click No columns. 在筛选器行中,选择“包括” 和“列名” ,并在文本框中选择列名列表。In the filter row, select Include and column names and select our list of column names in the text box. 此筛选器指示模块不要传入任何列(特征),我们指定的列除外。This filter directs the module to not pass through any columns (features) except the ones that we specify.

  6. 单击复选标记(“确定”)按钮。Click the check mark (OK) button.

    选择要包括在预测中的列(特征)

此模块生成经过筛选的数据集,只包含需要传递到下一步使用的学习算法中的特征。This module produces a filtered dataset containing only the features we want to pass to the learning algorithm we'll use in the next step. 稍后可以返回,选择不同的特征重试生成结果。Later, you can return and try again with a different selection of features.

选择并应用算法Choose and apply an algorithm

准备好数据后,构造预测模型的过程包括训练和测试。Now that the data is ready, constructing a predictive model consists of training and testing. 我们将使用数据对模型定型,然后测试模型,看其预测价格时准确性如何。We'll use our data to train the model, and then we'll test the model to see how closely it's able to predict prices.

分类回归 是两种监督式机器学习算法。Classification and regression are two types of supervised machine learning algorithms. 分类可以从一组定义的类别预测答案,例如颜色(红、蓝或绿)。Classification predicts an answer from a defined set of categories, such as a color (red, blue, or green). 回归用于预测数字。Regression is used to predict a number.

由于要预测价格(一个数字),因此需使用回归算法。Because we want to predict price, which is a number, we'll use a regression algorithm. 本示例将使用线性回归模型。 For this example, we'll use a linear regression model.

对模型定型时,我们会为其提供一组包含价格的数据。We train the model by giving it a set of data that includes the price. 模型会扫描数据,查找汽车特征与其价格的关联性。The model scans the data and look for correlations between an automobile's features and its price. 然后,我们会测试模型 - 我们会为模型提供一组熟悉的汽车特征,看模型预测已知价格的准确性如何。Then we'll test the model - we'll give it a set of features for automobiles we're familiar with and see how close the model comes to predicting the known price.

我们会将数据拆分为单独的定型数据集和测试数据集,用于模型定型和测试。We'll use our data for both training the model and testing it by splitting the data into separate training and testing datasets.

  1. 选择拆分数据模块并将其拖到试验画布,然后将其连接到最后一个选择数据集中的列模块。Select and drag the Split Data module to the experiment canvas and connect it to the last Select Columns in Dataset module.

  2. 单击拆分数据模块将其选中。Click the Split Data module to select it. 找到“第一个输出数据集中的行的比例” (位于画布右侧的“属性”窗格中) ,将其设置为 0.75。Find the Fraction of rows in the first output dataset (in the Properties pane to the right of the canvas) and set it to 0.75. 这样,我们将使用 75% 的数据来训练模型,保留 25% 的数据用于测试。This way, we'll use 75 percent of the data to train the model, and hold back 25 percent for testing.

    将“拆分数据”模块的拆分比例设置为 0.75

    提示

    更改“随机种子” 参数可为训练和测试生成不同的随机样本。By changing the Random seed parameter, you can produce different random samples for training and testing. 此参数控制伪随机数生成器的种子。This parameter controls the seeding of the pseudo-random number generator.

  3. 运行试验。Run the experiment. 运行试验时,选择数据集中的列拆分数据模块会将列定义传递到接下来要添加的模块。When the experiment is run, the Select Columns in Dataset and Split Data modules pass column definitions to the modules we'll be adding next.

  4. 要选择学习算法,请在画布左侧的模块控制板中展开“机器学习” 类别,并展开“初始化模型” 。To select the learning algorithm, expand the Machine Learning category in the module palette to the left of the canvas, and then expand Initialize Model. 此时会显示多个可用于初始化机器学习算法的模块类别。This displays several categories of modules that can be used to initialize machine learning algorithms. 对于本试验,请选择“回归”类别下的线性回归模块,然后将其拖放到试验画布上 。For this experiment, select the Linear Regression module under the Regression category, and drag it to the experiment canvas. (也可以在控制板的“搜索”框中键入“线性回归”找到该模块。)(You can also find the module by typing "linear regression" in the palette Search box.)

  5. 找到 训练模型 模块并将其拖到试验画布上。Find and drag the Train Model module to the experiment canvas. 线性回归模块的输出连接到训练模型模块左侧的输入,将拆分数据模块的训练数据输出(左端口)连接到训练模型模块右侧的输入。Connect the output of the Linear Regression module to the left input of the Train Model module, and connect the training data output (left port) of the Split Data module to the right input of the Train Model module.

    将“训练模型”模块连接到“线性回归”和“拆分数据”模块

  6. 选择训练模型模块,单击“属性” 窗格中的“启动列选择器” ,并选择“价格” 列。Click the Train Model module, click Launch column selector in the Properties pane, and then select the price column. “价格”是模型要预测的值。 Price is the value that our model is going to predict.

    在列选择器中选择“价格”列,方法是将其从“可用列”列表移至“所选列”列表。 You select the price column in the column selector by moving it from the Available columns list to the Selected columns list.

    选择“训练模型”模块的“价格”列

  7. 运行试验。Run the experiment.

我们现在获得了一个经过定型的回归模型,用来为新的汽车数据评分,以便进行价格预测。We now have a trained regression model that can be used to score new automobile data to make price predictions.

运行后,试验现在看起来应与上图类似

预测新汽车价格Predict new automobile prices

使用 75% 的数据训练模型后,可以使用该模型为另外 25% 的数据评分,确定模型的运行情况。Now that we've trained the model using 75 percent of our data, we can use it to score the other 25 percent of the data to see how well our model functions.

  1. 找到评分模型模块并将其拖放到试验画布上。Find and drag the Score Model module to the experiment canvas. 训练模型 模块的输出连接到 评分模型 的左侧输入端口。Connect the output of the Train Model module to the left input port of Score Model. 拆分数据 模型的测试数据输出(右端口)连接到 评分模型 的右侧输入端口。Connect the test data output (right port) of the Split Data module to the right input port of Score Model.

    将“评分模型”模块连接到“训练模型”和“拆分数据”模块

  2. 运行试验,通过单击评分模型的输出端口并选择“可视化” 来查看评分模型模块的输出。Run the experiment and view the output from the Score Model module by clicking the output port of Score Model and select Visualize. 输出显示价格预测值,以及来自测试数据的已知值。The output shows the predicted values for price and the known values from the test data.

    “评分模型”模块的输出

  3. 最后,我们对结果的质量进行测试。Finally, we test the quality of the results. 选择评估模型模块并将其拖放到试验画布上,然后将评分模型模块的输出连接到评估模型的左侧输入。Select and drag the Evaluate Model module to the experiment canvas, and connect the output of the Score Model module to the left input of Evaluate Model. 最终试验看起来应与下图类似:The final experiment should look something like this:

    最终试验

  4. 运行试验。Run the experiment.

要查看评估模型模块的输出,请单击输出端口,并选择“可视化” 。To view the output from the Evaluate Model module, click the output port, and then select Visualize.

试验的评估结果

针对本例中的模型显示了以下统计信息:The following statistics are shown for our model:

  • 平均绝对误差 (MAE):绝对误差的平均值(误差 是指预测值与实际值之间的差异)。Mean Absolute Error (MAE): The average of absolute errors (an error is the difference between the predicted value and the actual value).
  • 均方根误差 (RMSE):对测试数据集所做预测的平均误差的平方根。Root Mean Squared Error (RMSE): The square root of the average of squared errors of predictions made on the test dataset.
  • 相对绝对误差:相对于实际值与所有实际值平均值之间的绝对差异的绝对误差平均值。Relative Absolute Error: The average of absolute errors relative to the absolute difference between actual values and the average of all actual values.
  • 相对平方误差:相对于实际值与所有实际值平均值之间的平方差异的平方误差平均值。Relative Squared Error: The average of squared errors relative to the squared difference between the actual values and the average of all actual values.
  • 决定系数:也称为 R 平方值,这是一个统计度量值,表示模型的数据拟合度。Coefficient of Determination: Also known as the R squared value, this is a statistical metric indicating how well a model fits the data.

每个误差统计值越小越好。For each of the error statistics, smaller is better. 值越小,表示预测越接近实际值。A smaller value indicates that the predictions more closely match the actual values. 对于 决定系数,其值越接近 1 (1.0),预测就越精确。For Coefficient of Determination, the closer its value is to one (1.0), the better the predictions.

清理资源Clean up resources

如果不再需要通过本文创建的资源,请删除它们,以免产生费用。If you no longer need the resources you created using this article, delete them to avoid incurring any charges. 导出和删除产品内用户数据一文中了解具体信息。Learn how in the article, Export and delete in-product user data.

后续步骤Next steps

在本快速入门中,你使用示例数据集创建了一个简单的试验。In this quickstart, you created a simple experiment using a sample dataset. 若要更深入地了解创建和部署模型的过程,请继续阅读预测解决方案教程。To explore the process of creating and deploying a model in more depth, continue to the predictive solution tutorial.