教程 2:训练信用风险模型 - Azure 机器学习工作室(经典版)Tutorial 2: Train credit risk models - Azure Machine Learning Studio (classic)

适用于: yes机器学习工作室(经典) noAzure 机器学习APPLIES TO: yesMachine Learning Studio (classic) noAzure Machine Learning

在本教程中,我们将深入探讨开发预测分析解决方案的过程。In this tutorial, you take an extended look at the process of developing a predictive analytics solution. 我们将在机器学习工作室(经典版)中开发一个简单模型。You develop a simple model in Machine Learning Studio (classic). 然后将该模型部署为 Azure 机器学习 Web 服务。You then deploy the model as an Azure Machine Learning web service. 部署的模型将使用新数据进行预测。This deployed model can make predictions using new data. 本教程是由三个部分构成的系列教程的第二部分This tutorial is part two of a three-part tutorial series.

假设用户需要根据他们提供的贷款申请相关信息预测个人的信用风险。Suppose you need to predict an individual's credit risk based on the information they gave on a credit application.

信用风险评估是个较为复杂的问题,但本教程会将其适当简化。Credit risk assessment is a complex problem, but this tutorial will simplify it a bit. 我们将使用它作为示例,展示如何使用 Microsoft Azure 机器学习工作室(经典版)来创建预测分析解决方案。You'll use it as an example of how you can create a predictive analytics solution using Microsoft Azure Machine Learning Studio (classic). 对此解决方案,我们将使用 Azure 机器学习工作室(经典)和机器学习 Web 服务。You'll use Azure Machine Learning Studio (classic) and a Machine Learning web service for this solution.

在这篇由三个部分构成的教程中,我们将从公开的信用风险数据着手。In this three-part tutorial, you start with publicly available credit risk data. 然后开发并训练预测模型。You then develop and train a predictive model. 最后将该模型部署为 Web 服务。Finally you deploy the model as a web service.

本教程的第一部分中,已创建了一个机器学习工作室(经典版)工作区、上传了数据,并创建了试验。In part one of the tutorial, you created a Machine Learning Studio (classic) workspace, uploaded data, and created an experiment.

本教程部分介绍以下操作:In this part of the tutorial you:

  • 训练多个模型Train multiple models
  • 评分和评估模型Score and evaluate the models

本教程的第三部分,我们要将此模型部署为 Web 服务。In part three of the tutorial, you'll deploy the model as a web service.

先决条件Prerequisites

完成本教程的第一部分Complete part one of the tutorial.

训练多个模型Train multiple models

使用 Azure 机器学习工作室(经典)创建机器学习模型的优势之一是,能够在单个试验中一次性尝试多种模型并比较结果。One of the benefits of using Azure Machine Learning Studio (classic) for creating machine learning models is the ability to try more than one type of model at a time in a single experiment and compare the results. 此类型的实验有助于找到最适合解决问题的解决方案。This type of experimentation helps you find the best solution for your problem.

在本教程开发的试验中,你将创建两个不同类型的模型,然后比较其评分结果,从而确定我们希望用于最终试验的算法。In the experiment we're developing in this tutorial, you'll create two different types of models and then compare their scoring results to decide which algorithm you want to use in our final experiment.

可从多种模型中进行选择。There are various models you could choose from. 要查看可用的模型,请在模块调色板中展开“机器学习”节点,并展开“初始化模型”及其下面的节点。To see the models available, expand the Machine Learning node in the module palette, and then expand Initialize Model and the nodes beneath it. 为实现本试验的目的,将选择两类支持向量机 (SVM) 和两类提升决策树模块。For the purposes of this experiment, you'll select the Two-Class Support Vector Machine (SVM) and the Two-Class Boosted Decision Tree modules.

提示

要获取有关确定哪种机器学习算法最适合你正尝试解决的特定问题的帮助,请参阅如何选择 Microsoft Azure 机器学习工作室(经典版)的算法To get help deciding which Machine Learning algorithm best suits the particular problem you're trying to solve, see How to choose algorithms for Microsoft Azure Machine Learning Studio (classic).

在此试验中,将同时添加两类提升决策树模块和两类支持向量机模块。You'll add both the Two-Class Boosted Decision Tree module and Two-Class Support Vector Machine module in this experiment.

两类提升决策树Two-Class Boosted Decision Tree

首先设置提升决策树模型。First, set up the boosted decision tree model.

  1. 在模块面板中找到两类提升决策树模块,并将其拖动到画布上。Find the Two-Class Boosted Decision Tree module in the module palette and drag it onto the canvas.

  2. 找到训练模型模块、将其拖动到画布上,然后将两类提升决策树模块的输出连接到训练模型模块的左侧输入端口。Find the Train Model module, drag it onto the canvas, and then connect the output of the Two-Class Boosted Decision Tree module to the left input port of the Train Model module.

    两类提升决策树模块初始化泛型模型,训练模型使用训练数据来训练模型。The Two-Class Boosted Decision Tree module initializes the generic model, and Train Model uses training data to train the model.

  3. 将左侧执行 R 脚本模块的左侧输出连接到训练模型模块的右侧输入端口(在本教程中,使用了“拆分数据”模块左侧传出的数据进行训练)。Connect the left output of the left Execute R Script module to the right input port of the Train Model module (in this tutorial you used the data coming from the left side of the Split Data module for training).

    提示

    此试验不需要执行 R 脚本模块的两个输入和一个输出,因此可以将其保留为未附加状态。you don't need two of the inputs and one of the outputs of the Execute R Script module for this experiment, so you can leave them unattached.

实验的此部分现在如下所示:This portion of the experiment now looks something like this:

训练模型

现在,需要告诉训练模型模块我们希望它预测信用风险值。Now you need to tell the Train Model module that you want the model to predict the Credit Risk value.

  1. 选择训练模型模块。Select the Train Model module. 属性窗格中,单击启动列选择器In the Properties pane, click Launch column selector.

  2. 选择单个列对话框中,在可用列下的搜索字段中键入“信用风险”,在下方选择“信用风险”,并单击向右箭头按钮 ( > ) 将“信用风险”移动到选定列In the Select a single column dialog, type "credit risk" in the search field under Available Columns, select "Credit risk" below, and click the right arrow button (>) to move "Credit risk" to Selected Columns.

    选择“训练模型”模块的“信用风险”列

  3. 单击确定复选标记。Click the OK check mark.

两类支持向量机Two-Class Support Vector Machine

接下来设置 SVM 模型。Next, you set up the SVM model.

首先,简要介绍 SVM。First, a little explanation about SVM. 提升决策树非常适用于所有类型的功能。Boosted decision trees work well with features of any type. 但是,因为 SVM 模块生成一个线性分类器,因此它生成的模型在所有数值功能具有相同范围时存在最佳测试错误。However, since the SVM module generates a linear classifier, the model that it generates has the best test error when all numeric features have the same scale. 若要使所有数值功能转换为同一范围,请使用“Tanh”转换(通过规范化数据模块)。To convert all numeric features to the same scale, you use a "Tanh" transformation (with the Normalize Data module). 这会将我们的数字转换为 [0,1] 范围。This transforms our numbers into the [0,1] range. SVM 模块将字符串功能依次转换为分类功能和二进制 0/1 功能,因此无需手动转换字符串功能。The SVM module converts string features to categorical features and then to binary 0/1 features, so you don't need to manually transform string features. 此外,也不需要转换“信用风险”列(列 21)- 它是数值,但也是我们在训练模型时要预测的值,因此需要使其保持不变。Also, you don't want to transform the Credit Risk column (column 21) - it's numeric, but it's the value we're training the model to predict, so you need to leave it alone.

若要设置 SVM 模型,请执行以下操作:To set up the SVM model, do the following:

  1. 在模块面板中找到两类支持向量机模块,并将其拖动到画布上。Find the Two-Class Support Vector Machine module in the module palette and drag it onto the canvas.

  2. 右键单击训练模型模块,选择“复制”,然后右键单击画布并选择“粘贴” 。Right-click the Train Model module, select Copy, and then right-click the canvas and select Paste. 训练模型模块的副本具有与原始模块相同的列选择。The copy of the Train Model module has the same column selection as the original.

  3. 两类支持向量机模块的输出连接到第二个训练模型模块的左侧输入端口。Connect the output of the Two-Class Support Vector Machine module to the left input port of the second Train Model module.

  4. 查找规范化数据模块并将其拖动到画布上。Find the Normalize Data module and drag it onto the canvas.

  5. 将左侧执行 R 脚本模块的左侧输出连接到此模块的输入(请注意,模块的输出端口可以连接到多个其他模块)。Connect the left output of the left Execute R Script module to the input of this module (notice that the output port of a module may be connected to more than one other module).

  6. 规范化数据模块的左侧输出端口连接到第二个训练模型模块的右侧输入端口。Connect the left output port of the Normalize Data module to the right input port of the second Train Model module.

我们实验的此部分现在应如下所示:This portion of our experiment should now look something like this:

训练第二个模型

现在,配置规范化数据模块:Now configure the Normalize Data module:

  1. 单击选择规范化数据模块。Click to select the Normalize Data module. 在“属性”窗格中,选择“Tanh”作为“转换方法”参数。In the Properties pane, select Tanh for the Transformation method parameter.

  2. 单击“启动列选择器”、为“开始”选择“没有列”、在第一个下拉列表中选择“包括”、在第二个下拉列表中选择“列类型”,并在第三个下拉列表中选择“数值”。Click Launch column selector, select "No columns" for Begin With, select Include in the first dropdown, select column type in the second dropdown, and select Numeric in the third dropdown. 这会指定所有数值列(和唯一数值)均已转换。This specifies that all the numeric columns (and only numeric) are transformed.

  3. 单击此行右侧的加号 (+),这会创建一行下拉列表。Click the plus sign (+) to the right of this row - this creates a row of dropdowns. 在第一个下拉列表中选择“排除”,在第二个下拉列表中选择“列名称”,并在文本字段中输入“信用风险”。Select Exclude in the first dropdown, select column names in the second dropdown, and enter "Credit risk" in the text field. 这会指定应忽略“信用风险”列(我们需要执行此操作,因为此列是数值,因此,如果不排除,也将转换)。This specifies that the Credit Risk column should be ignored (you need to do this because this column is numeric and so would be transformed if you didn't exclude it).

  4. 单击确定复选标记。Click the OK check mark.

    选择用于“规范化数据”模块的列

规范化数据模块现在设置为在所有数值列(“信用风险”列除外)上执行 Tanh 转换。The Normalize Data module is now set to perform a Tanh transformation on all numeric columns except for the Credit Risk column.

评分和评估模型Score and evaluate the models

使用已由拆分数据模块分隔出的测试数据对已训练的模型进行评分。you use the testing data that was separated out by the Split Data module to score our trained models. 然后,可以比较两个模型的结果,查看哪个模型生成的结果更好。you can then compare the results of the two models to see which generated better results.

添加“评分模型”模块Add the Score Model modules

  1. 找到评分模型模块并将其拖动到画布上。Find the Score Model module and drag it onto the canvas.

  2. 将已连接到两类提升决策树模块的训练模型模块连接到评分模型模块的左侧输入端口。Connect the Train Model module that's connected to the Two-Class Boosted Decision Tree module to the left input port of the Score Model module.

  3. 将右侧执行 R 脚本模块(测试数据)连接到评分模型模块的右侧输入端口。Connect the right Execute R Script module (our testing data) to the right input port of the Score Model module.

    已连接的“评分模型”模块

    评分模型模块现在可以从测试数据中获取信用信息、通过模型运行它,并比较模型生成的预测与测试数据中的实际信用风险列。The Score Model module can now take the credit information from the testing data, run it through the model, and compare the predictions the model generates with the actual credit risk column in the testing data.

  4. 复制并粘贴评分模型模块以创建第二个副本。Copy and paste the Score Model module to create a second copy.

  5. 将 SVM 模型的输出(即连接到两类支持向量机模块的训练模型模块的输出端口)连接到第二个评分模型模块的输入端口。Connect the output of the SVM model (that is, the output port of the Train Model module that's connected to the Two-Class Support Vector Machine module) to the input port of the second Score Model module.

  6. 对于 SVM 模型,我们需要像处理训练数据一样,对测试数据执行相同的转换。For the SVM model, you have to do the same transformation to the test data as you did to the training data. 因此,请复制并粘贴规范化数据模块以创建第二个副本,并将其连接到右侧执行 R 脚本模块。So copy and paste the Normalize Data module to create a second copy and connect it to the right Execute R Script module.

  7. 将第二个规范化数据模块的左侧输出连接到第二个评分模型模块的右侧输入端口。Connect the left output of the second Normalize Data module to the right input port of the second Score Model module.

    已连接的两个“评分模型”模块

添加“评估模型”模块Add the Evaluate Model module

若要评估两个评分结果并对其进行比较,请使用评估模型模块。To evaluate the two scoring results and compare them, you use an Evaluate Model module.

  1. 找到评估模型模块并将其拖动到画布上。Find the Evaluate Model module and drag it onto the canvas.

  2. 将与提升决策树模型相关联的评分模型模块的输出端口连接到评估模型模块的左侧输入端口。Connect the output port of the Score Model module associated with the boosted decision tree model to the left input port of the Evaluate Model module.

  3. 将另一个评分模型模块连接到右侧输入端口。Connect the other Score Model module to the right input port.

    已连接的“评估模型”模块

运行实验并检查结果Run the experiment and check the results

若要运行此实验,请单击画布下面的“运行”按钮。To run the experiment, click the RUN button below the canvas. 可能需要几分钟时间。It may take a few minutes. 每个模块上的旋转指示符显示它正在运行,模块完成后,会显示一个绿色对号。A spinning indicator on each module shows that it's running, and then a green check mark shows when the module is finished. 当所有模块都有一个对号时,表示该实验已完成运行。When all the modules have a check mark, the experiment has finished running.

实验现在看起来应当与下图类似:The experiment should now look something like this:

评估两种模型

要检查结果,请单击评估模型模块的输出端口,并选择“可视化”。To check the results, click the output port of the Evaluate Model module and select Visualize.

评估模型模块将生成一对曲线和度量值,从而比较两个已评分模型的结果。The Evaluate Model module produces a pair of curves and metrics that allow you to compare the results of the two scored models. 可将结果视为受试者工作特征 (ROC) 曲线、精度/召回曲线或提升曲线。You can view the results as Receiver Operator Characteristic (ROC) curves, Precision/Recall curves, or Lift curves. 其他显示数据包括混淆矩阵、曲线下面积 (AUC) 的累积值和其他度量值。Additional data displayed includes a confusion matrix, cumulative values for the area under the curve (AUC), and other metrics. 可通过将滑块向左或向右移动更改阈值,并查看它如何影响度量值集。You can change the threshold value by moving the slider left or right and see how it affects the set of metrics.

在图表右侧,单击“已评分数据”或“要比较的已评分数据集”,突出显示关联的曲线并在下方显示关联的度量值。To the right of the graph, click Scored dataset or Scored dataset to compare to highlight the associated curve and to display the associated metrics below. 在曲线图例中,“已评分数据集”对应于评估模型(本例中的是提升决策树模型)模块的左侧输入端口。In the legend for the curves, "Scored dataset" corresponds to the left input port of the Evaluate Model module - in our case, this is the boosted decision tree model. “要比较的已评分数据集”对应于右侧输入端口;在本例中,这是 SVM 模型。"Scored dataset to compare" corresponds to the right input port - the SVM model in our case. 单击其中一个标签后,该模型的曲线将突出显示,并且会显示相应的度量值,如下图所示。When you click one of these labels, the curve for that model is highlighted and the corresponding metrics are displayed, as shown in the following graphic.

模型的 ROC 曲线

通过检查这些值,可以确定哪个模型提供的结果与你所需的结果最接近。By examining these values, you can decide which model is closest to giving you the results you're looking for. 可以返回到之前的步骤,通过更改不同模型中的参数值来迭代进行实验。You can go back and iterate on your experiment by changing parameter values in the different models.

对这些结果进行解释以及对模型性能进行优化的科学与艺术不在本教程的范围内。The science and art of interpreting these results and tuning the model performance is outside the scope of this tutorial. 若要获得更多帮助,可以阅读以下文章:For additional help, you might read the following articles:

提示

每次运行实验时,该迭代的记录都会保留在运行历史记录中。Each time you run the experiment a record of that iteration is kept in the Run History. 可以通过单击画布下面的“查看运行历史记录”查看这些迭代,并返回到任何一个迭代。You can view these iterations, and return to any of them, by clicking VIEW RUN HISTORY below the canvas. 也可以单击“属性”窗格中的“之前运行”,返回到已打开迭代的前一个迭代。You can also click Prior Run in the Properties pane to return to the iteration immediately preceding the one you have open.

可通过单击画布下面的“另存为”,复制实验的任何迭代。You can make a copy of any iteration of your experiment by clicking SAVE AS below the canvas. 使用试验的“摘要”和“说明”属性,保留在试验迭代中所尝试操作的记录。Use the experiment's Summary and Description properties to keep a record of what you've tried in your experiment iterations.

有关详细信息,请参阅在 Azure 机器学习工作室(经典版)中管理试验迭代For more information, see Manage experiment iterations in Azure Machine Learning Studio (classic).

清理资源Clean up resources

如果不再需要通过本文创建的资源,请删除它们,以免产生费用。If you no longer need the resources you created using this article, delete them to avoid incurring any charges. 导出和删除产品内用户数据一文中了解具体信息。Learn how in the article, Export and delete in-product user data.

后续步骤Next steps

在本教程中,我们已完成以下步骤:In this tutorial, you completed these steps:

  • 创建试验Create an experiment
  • 训练多个模型Train multiple models
  • 评分和评估模型Score and evaluate the models

现在,可以部署此数据的模型。You're now ready to deploy models for this data.