教程 1:预测信用风险 - Azure 机器学习工作室(经典版)Tutorial 1: Predict credit risk - Azure Machine Learning Studio (classic)

备注

Studio(经典)中的 Notebooks(预览版)功能将在 2020 年 4 月 13 日关闭。The Notebooks (preview) feature in Studio (classic) will be shut down on April 13th, 2020. 4 月 13 日以后,“Notebooks”选项卡将与所有已保存的笔记本一起删除。After April 13th, the Notebooks tab will be removed along with any saved notebooks.

提示

鼓励当前正在使用或评估机器学习工作室(经典版)的客户尝试 Azure 机器学习设计器(预览版),它提供拖放 ML 模块以及 可伸缩性、版本控制和企业安全性。Customers currently using or evaluating Machine Learning Studio (classic) are encouraged to try Azure Machine Learning designer (preview), which provides drag-n-drop ML modules plus scalability, version control, and enterprise security.

在本教程中,我们将深入探讨开发预测分析解决方案的过程。In this tutorial, you take an extended look at the process of developing a predictive analytics solution. 我们将在机器学习工作室(经典版)中开发一个简单模型。You develop a simple model in Machine Learning Studio (classic). 然后将该模型部署为 Azure 机器学习 Web 服务。You then deploy the model as an Azure Machine Learning web service. 部署的模型将使用新数据进行预测。This deployed model can make predictions using new data. 本教程是由三个部分构成的系列教程的第一部分This tutorial is part one of a three-part tutorial series.

假设用户需要根据他们提供的贷款申请相关信息预测个人的信用风险。Suppose you need to predict an individual's credit risk based on the information they gave on a credit application.

信用风险评估是个较为复杂的问题,但本教程会将其适当简化。Credit risk assessment is a complex problem, but this tutorial will simplify it a bit. 我们将使用它作为示例,展示如何使用 Microsoft Azure 机器学习工作室(经典版)来创建预测分析解决方案。You'll use it as an example of how you can create a predictive analytics solution using Microsoft Azure Machine Learning Studio (classic). 对此解决方案,我们将使用 Azure 机器学习工作室(经典)和机器学习 Web 服务。You'll use Azure Machine Learning Studio (classic) and a Machine Learning web service for this solution.

在这篇由三个部分构成的教程中,我们将从公开的信用风险数据着手。In this three-part tutorial, you start with publicly available credit risk data. 然后开发并训练预测模型。You then develop and train a predictive model. 最后将该模型部署为 Web 服务。Finally you deploy the model as a web service.

本教程部分介绍以下操作:In this part of the tutorial you:

  • 创建机器学习工作室(经典版)工作区Create a Machine Learning Studio (classic) workspace
  • 上传现有数据Upload existing data
  • 创建试验Create an experiment

然后,可以使用此试验训练第 2 部分中的模型,并在第 3 部分部署这些模型You can then use this experiment to train models in part 2 and then deploy them in part 3.

先决条件Prerequisites

本教程默认用户此前至少使用过机器学习工作室(经典版)一次,且对机器学习概念有一些了解。This tutorial assumes that you've used Machine Learning Studio (classic) at least once before, and that you have some understanding of machine learning concepts. 但不假设用户精通其中任一领域。But it doesn't assume you're an expert in either.

如果以前从来没用过 Azure 机器学习工作室(经典版),则可能一开始需要学习在 Azure 机器学习工作室(经典版)中创建第一个数据科学试验快速入门 。If you've never used Azure Machine Learning Studio (classic) before, you might want to start with the quickstart, Create your first data science experiment in Azure Machine Learning Studio (classic). 该快速入门指导用户首次完成机器学习工作室(经典版)的使用。The quickstart takes you through Machine Learning Studio (classic) for the first time. 教程中会介绍各种基础知识:如何将模块拖放到试验中、如何将模块连接到一起、如何运行试验,以及如何查看结果。It shows you the basics of how to drag-and-drop modules onto your experiment, connect them together, run the experiment, and look at the results.

提示

可在 Azure AI 库中找到本教程中开发的试验的工作副本。You can find a working copy of the experiment that you develop in this tutorial in the Azure AI Gallery. 请前往 Tutorial - Predict credit risk(教程 - 预测信用风险)并单击“在工作室中打开”将试验副本下载到机器学习工作室(经典)的工作区 。Go to Tutorial - Predict credit risk and click Open in Studio to download a copy of the experiment into your Machine Learning Studio (classic) workspace.

创建机器学习工作室(经典版)工作区Create a Machine Learning Studio (classic) workspace

要使用机器学习工作室(经典版),需要具有 Microsoft Azure 机器学习工作室(经典版)工作区。To use Machine Learning Studio (classic), you need to have a Microsoft Azure Machine Learning Studio (classic) workspace. 此工作区包含创建、管理和发布试验所需的工具。This workspace contains the tools you need to create, manage, and publish experiments.

要创建工作区,请参阅创建和共享 Azure 机器学习工作室(经典版)工作区To create a workspace, see Create and share an Azure Machine Learning Studio (classic) workspace.

创建工作区后,打开机器学习工作室(经典版)(https://studio.ml.azure.cn/Home)。After your workspace is created, open Machine Learning Studio (classic) (https://studio.ml.azure.cn/). 如果有多个工作区,可在窗口右上角的工具栏中选择工作区。If you have more than one workspace, you can select the workspace in the toolbar in the upper-right corner of the window.

在工作室(经典版)中选择工作区

提示

如果你是工作区的所有者,则可通过邀请其他人到工作区来共享所进行的试验。If you are owner of the workspace, you can share the experiments you're working on by inviting others to the workspace. 可以在“设置”页面上的“机器学习工作室(经典版)”中执行此操作 。You can do this in Machine Learning Studio (classic) on the SETTINGS page. 只需每位用户的 Microsoft 帐户或组织帐户即可。You just need the Microsoft account or organizational account for each user.

在“设置” 页上,单击“用户” ,并在窗口底部单击“邀请更多用户” 。On the SETTINGS page, click USERS, then click INVITE MORE USERS at the bottom of the window.

上传现有数据Upload existing data

若要开发用于信用风险的预测模型,我们需要用于训练和测试模型的数据。To develop a predictive model for credit risk, you need data that you can use to train and then test the model. 对于本教程,我们将使用 UC Irvine 机器学习存储库的“UCI Statlog(德国信用数据)数据集”。For this tutorial, You'll use the "UCI Statlog (German Credit Data) Data Set" from the UC Irvine Machine Learning repository. 可在此处找到以下内容:You can find it here:
https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

使用名为 german.data 的文件。You'll use the file named german.data. 将此文件下载到本地硬盘驱动器。Download this file to your local hard drive.

german.data 数据集包含 1000 个以前的信贷申请人的 20 个变量行。The german.data dataset contains rows of 20 variables for 1000 past applicants for credit. 这 20 个变量代表数据集的特征集(特征向量),此特征集提供每个信贷申请人的标识特征。These 20 variables represent the dataset's set of features (the feature vector), which provides identifying characteristics for each credit applicant. 每行增加一列表示申请人经计算的信贷风险,其中700 个申请人标识为低信贷风险,300 个申请人标识为高信贷风险。An additional column in each row represents the applicant's calculated credit risk, with 700 applicants identified as a low credit risk and 300 as a high risk.

UCI 网站提供此数据的功能向量的属性说明。The UCI website provides a description of the attributes of the feature vector for this data. 此数据包括财务信息、信用历史记录、就业状态和个人信息。This data includes financial information, credit history, employment status, and personal information. 每个申请人都将提供二进制分级,指示他们的信贷风险是高还是低。For each applicant, a binary rating has been given indicating whether they are a low or high credit risk.

我们将使用此数据训练一个预测分析模型。You'll use this data to train a predictive analytics model. 操作完成后,模型应能够接受新个体的特征向量,并预测其信用风险是低还是高。When you're done, your model should be able to accept a feature vector for a new individual and predict whether they are a low or high credit risk.

下面是一个有趣的转折。Here's an interesting twist.

UCI 网站上的数据集说明提及了如果我们对人员的信用风险进行错误的分类所要付出的代价。The description of the dataset on the UCI website mentions what it costs if you misclassify a person's credit risk. 如果模型预测某个人员具有高信用风险,而实际上该人员具有低信用风险,则该模型进行了错误分类。If the model predicts a high credit risk for someone who is actually a low credit risk, the model has made a misclassification.

但对金融机构而言,反向错误分类会付出五倍的代价:如果模型预测某个人员具有低信贷风险,而实际上该人员具有高信贷风险。But the reverse misclassification is five times more costly to the financial institution: if the model predicts a low credit risk for someone who is actually a high credit risk.

因此,我们想要训练模型,使后一种类型的错误分类代价高于其他方式的错误分类五倍。So, you want to train your model so that the cost of this latter type of misclassification is five times higher than misclassifying the other way.

在试验中训练模型时实现此目的的一个简单方法是复制(5 次)表示高信用风险用户的条目。One simple way to do this when training the model in your experiment is by duplicating (five times) those entries that represent someone with a high credit risk.

然后,如果模型将实际上具有高风险的某人错误分类为低信用风险,则模型会执行该相同的错误分类五次(每个重复项一次)。Then, if the model misclassifies someone as a low credit risk when they're actually a high risk, the model does that same misclassification five times, once for each duplicate. 这会增加此错误在训练结果中的成本。This will increase the cost of this error in the training results.

转换数据集格式Convert the dataset format

原始数据集使用空格分隔的格式。The original dataset uses a blank-separated format. 机器学习工作室(经典)使用逗号分隔值 (CSV) 文件效果更好,所以我们通过将空格替换为逗号来转换数据集。Machine Learning Studio (classic) works better with a comma-separated value (CSV) file, so you'll convert the dataset by replacing spaces with commas.

转换此数据的方法有很多。There are many ways to convert this data. 一种方法是使用以下 Windows PowerShell 命令:One way is by using the following Windows PowerShell command:

cat german.data | %{$_ -replace " ",","} | sc german.csv  

另一种方法是使用 Unix sed 命令:Another way is by using the Unix sed command:

sed 's/ /,/g' german.data > german.csv  

在任一情况下,我们已在可在试验中使用的名为 german.csv 的文件中创建了逗号分隔版的数据。In either case, you have created a comma-separated version of the data in a file named german.csv that you can use in your experiment.

将数据集上传到机器学习工作室(经典版)Upload the dataset to Machine Learning Studio (classic)

数据转换为 CSV 格式后,需要将其上传到机器学习工作室(经典)。Once the data has been converted to CSV format, you need to upload it into Machine Learning Studio (classic).

  1. 打开机器学习工作室(经典版)主页 (https://studio.ml.azure.cn/)。Open the Machine Learning Studio (classic) home page (https://studio.ml.azure.cn/).

  2. 单击窗口左上角菜单菜单,单击“Azure 机器学习” ,选择“工作室” ,并登录。Click the menu Menu in the upper-left corner of the window, click Azure Machine Learning, select Studio, and sign in.

  3. 单击窗口底部的“+ 新建” 。Click +NEW at the bottom of the window.

  4. 选择“数据集” 。Select DATASET.

  5. 选择“从本地文件” 。Select FROM LOCAL FILE.

    从本地文件添加数据集

  6. 在“上传新数据集”对话框中单击“浏览”,找到创建的 german.csv 文件。 In the Upload a new dataset dialog, click Browse, and find the german.csv file you created.

  7. 输入数据集名称。Enter a name for the dataset. 在本教程中,此数据集名为“UCI 德国信用卡数据”。For this tutorial, call it "UCI German Credit Card Data".

  8. 对于数据类型,请选择“没有标题的一般 CSV 文件(.nh.csv)” 。For data type, select Generic CSV File With no header (.nh.csv).

  9. 添加说明(如果需要)。Add a description if you'd like.

  10. 单击“确定” 复选标记。Click the OK check mark.

    上传数据集

这会数据上传到可在试验中使用的数据集模块。This uploads the data into a dataset module that you can use in an experiment.

可以通过单击工作室(经典版)窗口左侧的“数据集”选项卡,来管理已上传到工作室(经典版)的数据集 。You can manage datasets that you've uploaded to Studio (classic) by clicking the DATASETS tab to the left of the Studio (classic) window.

管理数据集

有关将其他数据导入试验类型的详细信息,请参阅将训练数据导入 Azure 机器学习工作室(经典版)For more information about importing other types of data into an experiment, see Import your training data into Azure Machine Learning Studio (classic).

创建试验Create an experiment

本教程的下一步是在机器学习工作室(经典)中创建一个使用我们上传的数据集的试验。The next step in this tutorial is to create an experiment in Machine Learning Studio (classic) that uses the dataset you uploaded.

  1. 在工作室(经典版)中,单击窗口底部的“+新建” 。In Studio (classic), click +NEW at the bottom of the window.

  2. 选择“实验” ,并选择“空白实验”。Select EXPERIMENT, and then select "Blank Experiment".

    创建新实验

  3. 选择画布顶部的默认实验名称,然后将它重命名为有意义的名称。Select the default experiment name at the top of the canvas and rename it to something meaningful.

    重命名实验

    提示

    在“属性” 窗格中填写实验的“摘要” 和“说明” 会是一个很好的做法。It's a good practice to fill in Summary and Description for the experiment in the Properties pane. 这些属性提供了记录实验的机会,以便任何看到它的人都能理解目标和方法。These properties give you the chance to document the experiment so that anyone who looks at it later will understand your goals and methodology.

    实验属性

  4. 在实验画布左侧的模块控制板中,展开“已保存的数据集” 。In the module palette to the left of the experiment canvas, expand Saved Datasets.

  5. 找到在“我的数据集” 下创建的数据集,并将其拖动到画布上。Find the dataset you created under My Datasets and drag it onto the canvas. 此外还可以通过在控制板上方的“搜索” 框中输入名称来查找数据集。You can also find the dataset by entering the name in the Search box above the palette.

    将数据集添加到实验

准备数据Prepare the data

可以查看前 100 行数据和整个数据集的一些统计信息:单击数据集的输出端口(底部的小圆圈),然后选择“可视化” 。You can view the first 100 rows of the data and some statistical information for the whole dataset: Click the output port of the dataset (the small circle at the bottom) and select Visualize.

因为数据文件没有列标题,所以工作室(经典)提供了通用标题(Col1、Col2 等 )。Because the data file didn't come with column headings, Studio (classic) has provided generic headings (Col1, Col2, etc.). 好标题不是创建模型的关键,但它们使实验中的数据处理变得更加容易。Good headings aren't essential to creating a model, but they make it easier to work with the data in the experiment. 此外,当我们最终在 Web 服务中发布此模型时,标题将有助于识别服务用户的列。Also, when you eventually publish this model in a web service, the headings help identify the columns to the user of the service.

可以使用编辑元数据模块来添加列标题。You can add column headings using the Edit Metadata module.

可以使用编辑元数据模块来更改与数据集关联的元数据。You use the Edit Metadata module to change metadata associated with a dataset. 在本例中,我们使用它来为列标题提供更友好的名称。In this case, you use it to provide more friendly names for column headings.

要使用编辑元数据,请首先指定要修改的列(在本例中为所有列)。接下来,指定要对这些列执行的操作(在此情况下为更改列标题)。To use Edit Metadata, you first specify which columns to modify (in this case, all of them.) Next, you specify the action to be performed on those columns (in this case, changing column headings.)

  1. 在模块控制板的“搜索” 框中键入“元数据”。In the module palette, type "metadata" in the Search box. 编辑元数据显示在模块列表中。The Edit Metadata appears in the module list.

  2. 单击并将编辑元数据模块拖到画布上,并将其放到之前添加的数据集的下方。Click and drag the Edit Metadata module onto the canvas and drop it below the dataset you added earlier.

  3. 将数据集连接到编辑元数据:单击数据集的输出端口(数据集底部的小圆圈),将其拖到编辑元数据的输入端口(模块顶部的小圆圈),然后松开鼠标按键。Connect the dataset to the Edit Metadata: click the output port of the dataset (the small circle at the bottom of the dataset), drag to the input port of Edit Metadata (the small circle at the top of the module), then release the mouse button. 即便是在画布上来回移动,数据集和模块仍保持连接。The dataset and module remain connected even if you move either around on the canvas.

    实验现在看起来应当与下图类似:The experiment should now look something like this:

    添加编辑元数据

    红色感叹号表示尚未设置此模块的属性。The red exclamation mark indicates that you haven't set the properties for this module yet. 我们会在下一步完成该操作。You'll do that next.

    提示

    可以双击模块并输入文本,为模块添加注释。You can add a comment to a module by double-clicking the module and entering text. 这有助于快速查看模块在实验中的运行情况。This can help you see at a glance what the module is doing in your experiment. 在本例中,请双击编辑元数据模块,并输入注释“添加列标题”。In this case, double-click the Edit Metadata module and type the comment "Add column headings". 单击画布上的任意位置以关闭文本框。Click anywhere else on the canvas to close the text box. 若要显示注释,请单击模块上的向下箭头。To display the comment, click the down-arrow on the module.

    添加了注释的“编辑元数据”模块

  4. 选择编辑元数据,并在画布右侧的“属性”窗格中,单击“启动列选择器” 。Select Edit Metadata, and in the Properties pane to the right of the canvas, click Launch column selector.

  5. 在“选择列” 对话框中,选择“可用列” 中的所有行,并单击 > 以将其移动到“选定列” 。In the Select columns dialog, select all the rows in Available Columns and click > to move them to Selected Columns. 此对话框应如下所示:The dialog should look like this:

    其中选择了所有列的列选择器

  6. 单击确定复选标记。Click the OK check mark.

  7. 回到“属性” 窗格中,查找“新列名称” 参数。Back in the Properties pane, look for the New column names parameter. 在此字段中,输入数据集中 21 列的名称列表,以逗号分隔并按列排序。In this field, enter a list of names for the 21 columns in the dataset, separated by commas and in column order. 可以从 UCI 网站上的数据集文档中获取列名称,或为了方便起见,也可以复制并粘贴以下列表:You can obtain the columns names from the dataset documentation on the UCI website, or for convenience you can copy and paste the following list:

    Status of checking account, Duration in months, Credit history, Purpose, Credit amount, Savings account/bond, Present employment since, Installment rate in percentage of disposable income, Personal status and sex, Other debtors, Present residence since, Property, Age in years, Other installment plans, Housing, Number of existing credits, Job, Number of people providing maintenance for, Telephone, Foreign worker, Credit risk  
    

    “属性”窗格将如下所示:The Properties pane looks like this:

    编辑元数据的属性

    提示

    若要验证列标题,请运行实验(单击实验画布下方的“运行” )。If you want to verify the column headings, run the experiment (click RUN below the experiment canvas). 完成运行后(编辑元数据上会出现一个绿色对勾标记),单击编辑元数据模块的输出端口,并选择“可视化” 。When it finishes running (a green check mark appears on Edit Metadata), click the output port of the Edit Metadata module, and select Visualize. 可以用同样的方式查看任何模块的输出,以通过实验查看数据的进度。You can view the output of any module in the same way to view the progress of the data through the experiment.

创建训练和测试数据集Create training and test datasets

需要一些用于训练模型的数据和一些用于测试模型的数据。You need some data to train the model and some to test it. 因此,在试验的下一步中,我们将数据集拆分为两个单独的数据集:一个用于训练模型,一个用于测试模型。So in the next step of the experiment, you split the dataset into two separate datasets: one for training our model and one for testing it.

为此,请使用拆分数据模块。To do this, you use the Split Data module.

  1. 找到拆分数据模块,将其拖到画布上,并将其连接到编辑元数据模块。Find the Split Data module, drag it onto the canvas, and connect it to the Edit Metadata module.

  2. 默认情况下,拆分比为 0.5,并且设置了“随机拆分” 参数。By default, the split ratio is 0.5 and the Randomized split parameter is set. 这意味着,随机的一半数据通过拆分数据模块的一个端口输出,另一半通过另一个端口输出。This means that a random half of the data is output through one port of the Split Data module, and half through the other. 可以调整这些参数,以及“随机种子” 参数,以更改训练和测试数据之间的拆分。You can adjust these parameters, as well as the Random seed parameter, to change the split between training and testing data. 在本例中,我们将其保持不变。For this example, you leave them as-is.

    提示

    第一个输出数据集中行的分数属性决定了通过输出端口输出的数据量。The property Fraction of rows in the first output dataset determines how much of the data is output through the left output port. 例如,如果将比率设置为 0.7,则 70% 的数据将通过左端口输出,30% 通过右端口输出。For instance, if you set the ratio to 0.7, then 70% of the data is output through the left port and 30% through the right port.

  3. 双击拆分数据模块,并输入注释“训练/测试数据拆分 50%”。Double-click the Split Data module and enter the comment, "Training/testing data split 50%".

可以使用拆分数据模块的输出,但我们选择使用左侧输出作为训练数据,右侧输出作为测试数据。You can use the outputs of the Split Data module however you like, but let's choose to use the left output as training data and the right output as testing data.

上一步骤中所述,将高信贷风险错误分类为低的成本比将低信用风险错误分类为高的成本高五倍。As mentioned in the previous step, the cost of misclassifying a high credit risk as low is five times higher than the cost of misclassifying a low credit risk as high. 考虑到这一点,我们生成一个新的数据集来反映这个成本函数。To account for this, you generate a new dataset that reflects this cost function. 在新数据集中,每个高风险示例会复制五次,而每个低风险示例则不复制。In the new dataset, each high risk example is replicated five times, while each low risk example is not replicated.

可以使用 R 代码进行此复制:You can do this replication using R code:

  1. 找到执行 R 脚本模块并将其拖到试验画布上。Find and drag the Execute R Script module onto the experiment canvas.

  2. 拆分数据模块的左侧输出端口连接到执行 R 脚本模块的第一个输入端口(“Dataset1”)。Connect the left output port of the Split Data module to the first input port ("Dataset1") of the Execute R Script module.

  3. 双击执行 R 脚本模块,并输入注释“设置成本调整”。Double-click the Execute R Script module and enter the comment, "Set cost adjustment".

  4. 在“属性” 窗格中,删除 R 脚本参数中的默认文本,并输入以下脚本:In the Properties pane, delete the default text in the R Script parameter and enter this script:

    dataset1 <- maml.mapInputPort(1)
    data.set<-dataset1[dataset1[,21]==1,]
    pos<-dataset1[dataset1[,21]==2,]
    for (i in 1:5) data.set<-rbind(data.set,pos)
    maml.mapOutputPort("data.set")
    

    “执行 R 脚本”模块中的 R 脚本

需要对拆分数据模块的每个输出执行相同的复制操作,以确保训练和测试数据具有相同的成本调整。You need to do this same replication operation for each output of the Split Data module so that the training and testing data have the same cost adjustment. 执行此操作最简单的方法是:复制刚才生成的执行 R 脚本模块,将其连接到拆分数据模块的另一个输出端口。The easiest way to do this is by duplicating the Execute R Script module you just made and connecting it to the other output port of the Split Data module.

  1. 右键单击执行 R 脚本模块,并选择“复制” 。Right-click the Execute R Script module and select Copy.

  2. 右键单击实验画布,并选择“粘贴” 。Right-click the experiment canvas and select Paste.

  3. 将新模块拖放到位,然后将拆分数据模块的右侧输出端口连接到新的执行 R 脚本模块的第一个输入端口。Drag the new module into position, and then connect the right output port of the Split Data module to the first input port of this new Execute R Script module.

  4. 在画布底部,单击“运行” 。At the bottom of the canvas, click Run.

提示

执行 R 脚本模块的副本包含与原始模块相同的脚本。The copy of the Execute R Script module contains the same script as the original module. 在画布上复制和粘贴模块时,副本将保留原始文件的所有属性。When you copy and paste a module on the canvas, the copy retains all the properties of the original.

我们的实验现在如下所示:Our experiment now looks something like this:

添加拆分模块和 R 脚本

有关在实验中使用 R 脚本的详细信息,请参阅使用 R 扩展实验For more information on using R scripts in your experiments, see Extend your experiment with R.

清理资源Clean up resources

如果不再需要通过本文创建的资源,请删除它们,以免产生费用。If you no longer need the resources you created using this article, delete them to avoid incurring any charges. 导出和删除产品内用户数据一文中了解具体信息。Learn how in the article, Export and delete in-product user data.

后续步骤Next steps

在本教程中,我们已完成以下步骤:In this tutorial you completed these steps:

  • 创建机器学习工作室(经典版)工作区Create a Machine Learning Studio (classic) workspace
  • 将现有数据上传到工作区Upload existing data into the workspace
  • 创建试验Create an experiment

现在,可以开始训练和评估此数据的模型。You are now ready to train and evaluate models for this data.