设计器示例管道Designer sample pipelines

使用 Azure 机器学习设计器中的内置示例,快速开始生成自己的机器学习管道。Use the built-in examples in Azure Machine Learning designer to quickly get started building your own machine learning pipelines. Azure 机器学习设计器 GitHub 存储库包含了可帮助你了解某些常见机器学习方案的详细文档。The Azure Machine Learning designer GitHub repository contains detailed documentation to help you understand some common machine learning scenarios.

先决条件Prerequisites

  • Azure 订阅。An Azure subscription. 如果没有 Azure 订阅,请创建一个试用帐户If you don't have an Azure subscription, create a trial account.
  • 使用企业 SKU 的 Azure 机器学习工作区。An Azure Machine Learning workspace with the Enterprise SKU.

使用示例管道Use sample pipelines

设计器会将示例管道的副本保存到工作室工作区。The designer saves a copy of the sample pipelines to your studio workspace. 你可以编辑管道,以根据自己的需求对其进行改编;还可以保存管道供自己使用。You can edit the pipeline to adapt it to your needs and save it as your own. 可以使用这些示例作为起点来快速开始生成项目。Use them as a starting point to jumpstart your projects.

以下介绍如何使用设计器示例:Here's how to use a designer sample:

  1. 登录到 studio.ml.azure.cn,选择要使用的工作区。Sign in to studio.ml.azure.cn, and select the workspace you want to work with.

  2. 选择“设计器”。 Select Designer.

  3. 在“新建管道”部分下选择一个示例管道。 Select a sample pipeline under the New pipeline section.

    选择“显示更多示例”查看完整的示例列表。 Select Show more samples for a complete list of samples.

  4. 若要运行管道,必须首先设置要在其上运行管道的默认计算目标。To run a pipeline, you first have to set default compute target to run the pipeline on.

    1. 在画布右侧的“设置”窗格中,选择“选择计算目标”。 In the Settings pane to the right of the canvas, select Select compute target.

    2. 在出现的对话框中,选择现有的计算目标或创建新的计算目标。In the dialog that appears, select an existing compute target or create a new one. 选择“保存”。Select Save.

    3. 选择画布顶部的“提交”,提交管道运行。Select Submit at the top of the canvas to submit a pipeline run.

    运行的完成可能需要一些时间,具体取决于示例管道和计算设置。Depending on the sample pipeline and compute settings, runs may take some time to complete. 默认计算设置中的最小节点大小为 0,这意味着设计器必须在空闲后分配资源。The default compute settings have a minimum node size of 0, which means that the designer must allocate resources after being idle. 由于计算资源已分配,因此,重复的管道运行花费的时间会更少。Repeated pipeline runs will take less time since the compute resources are already allocated. 此外,设计器还对每个模块使用缓存的结果,以便进一步提高效率。Additionally, the designer uses cached results for each module to further improve efficiency.

  5. 管道运行完毕后,可以查看管道,还可以查看每个模块的输出,了解详细信息。After the pipeline finishes running, you can review the pipeline and view the output for each module to learn more. 使用以下步骤查看模块输出:Use the following steps to view module outputs:

    1. 选择画布中的模块。Select a module in the canvas.

    2. 在画布右侧的模块详细信息窗格中,选择“输出 + 日志”。In the module details pane to the right of the canvas, select Outputs + logs. 选择图标 可视化图标,查看每个模块的结果。Select the graph icon visualize icon to see the results of each module.

    从示例着手,了解一些最常见的机器学习方案。Use the samples as starting points for some of the most common machine learning scenarios.

回归Regression

探索这些内置的回归示例。Explore these built-in regression samples.

标题示例Sample title 说明Description
示例 1:回归 - 汽车价格预测(基本)Sample 1: Regression - Automobile Price Prediction (Basic) 使用线性回归预测汽车价格。Predict car prices using linear regression.
示例 2:回归 - 汽车价格预测(高级)Sample 2: Regression - Automobile Price Prediction (Advanced) 使用决策林和提升决策树回归器预测汽车价格。Predict car prices using decision forest and boosted decision tree regressors. 比较模型以找出最佳算法。Compare models to find the best algorithm.

分类Classification

探索这些内置的分类示例。Explore these built-in classification samples. 对于未提供文档链接的示例,可将其打开并查看模块注释来详细了解这些示例。You can learn more about the samples without documentation links by opening the samples and viewing the module comments instead.

标题示例Sample title 说明Description
示例 3:通过特征选择进行二元分类 - 收入预测Sample 3: Binary Classification with Feature Selection - Income Prediction 使用双类提升决策树预测收入的高低。Predict income as high or low, using a two-class boosted decision tree. 使用皮尔逊相关选择特征。Use Pearson correlation to select features.
示例 4:通过自定义 Python 脚本进行二元分类 - 信用风险预测Sample 4: Binary Classification with custom Python script - Credit Risk Prediction 将信贷申请分类为高风险或低风险。Classify credit applications as high or low risk. 使用“执行 Python 脚本”模块为数据加权。Use the Execute Python Script module to weight your data.
示例 5:二元分类 - 客户关系预测Sample 5: Binary Classification - Customer Relationship Prediction 使用双类提升决策树预测客户流失率。Predict customer churn using two-class boosted decision trees. 使用 SMOTE 对有偏差的数据采样。Use SMOTE to sample biased data.
示例 7:文本分类 - 维基百科 SP 500 数据集Sample 7: Text Classification - Wikipedia SP 500 Dataset 使用多类逻辑回归对维基百科文章中的公司类型进行分类。Classify company types from Wikipedia articles with multiclass logistic regression.
示例 12:多类分类 - 字母识别Sample 12: Multiclass Classification - Letter Recognition 创建二元分类器的系综,对手写字母进行分类。Create an ensemble of binary classifiers to classify written letters.

推荐器Recommender

探索这些内置的推荐器示例。Explore these built-in recommender samples. 对于未提供文档链接的示例,可将其打开并查看模块注释来详细了解这些示例。You can learn more about the samples without documentation links by opening the samples and viewing the module comments instead.

标题示例Sample title 说明Description
示例 10:推荐 - 电影评级推文Sample 10: Recommendation - Movie Rating Tweets 基于电影标题和评级生成电影推荐器引擎。Build a movie recommender engine from movie titles and rating.

实用工具Utility

详细了解用于演示机器学习实用工具和功能的示例。Learn more about the samples that demonstrate machine learning utilities and features. 对于未提供文档链接的示例,可将其打开并查看模块注释来详细了解这些示例。You can learn more about the samples without documentation links by opening the samples and viewing the module comments instead.

标题示例Sample title 说明Description
示例 6:使用自定义 R 脚本 - 航班延误预测Sample 6: Use custom R script - Flight Delay Prediction
示例 8:二元分类的交叉验证 - 成人收入预测Sample 8: Cross Validation for Binary Classification - Adult Income Prediction 使用交叉验证生成用于预测成人收入的二元分类器。Use cross validation to build a binary classifier for adult income.
示例 9:排列特征重要性Sample 9: Permutation Feature Importance 使用排列特征重要性来计算测试数据集的重要性评分。Use permutation feature importance to compute importance scores for the test dataset.
示例 11:优化二元分类的参数 - 成人收入预测Sample 11: Tune Parameters for Binary Classification - Adult Income Prediction 使用“优化模型超参数”找出用于生成二元分类器的最佳超参数。Use Tune Model Hyperparameters to find optimal hyperparameters to build a binary classifier.

数据集Datasets

在 Azure 机器学习设计器中创建新管道时,其中会默认包含多个示例数据集。When you create a new pipeline in Azure Machine Learning designer, a number of sample datasets are included by default. 设计器主页中的示例管道使用这些示例数据集。These sample datasets are used by the sample pipelines in the designer homepage.

示例数据集在“数据集-示例”类别下提供。 The sample datasets are available under Datasets-Samples category. 可以在设计器的画布左侧的模块面板中找到它。You can find this in the module palette to the left of the canvas in the designer. 将其中的任何数据集拖放到画布中即可在自己的管道中使用它们。You can use any of these datasets in your own pipeline by dragging it to the canvas.

数据集名称     Dataset name     数据集说明Dataset description
成年人口收入二元分类数据集Adult Census Income Binary Classification dataset 1994 年人口普查数据库的子集(其中在职人士年龄大于 16,调整后的收入指数大于 100)。A subset of the 1994 Census database, using working adults over the age of 16 with an adjusted income index of > 100.
使用情况:使用人口统计信息对人员分类,预测某人的年收入是否超过 5 万。Usage: Classify people using demographics to predict whether a person earns over 50K a year.
相关研究:Kohavi, R.、Becker, B.(1996 年)。Related Research: Kohavi, R., Becker, B., (1996). UCI 机器学习存储库UCI Machine Learning Repository. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science
汽车价格数据(原始)Automobile price data (Raw) 有关汽车品牌和型号的信息,包括价格、汽缸数和 MPG 等特性以及保险风险评分。Information about automobiles by make and model, including the price, features such as the number of cylinders and MPG, as well as an insurance risk score.
风险评分最初与自动定价关联。The risk score is initially associated with auto price. 然后,针对精算师所熟知符号化过程中的实际风险进行调整。It is then adjusted for actual risk in a process known to actuaries as symboling. 值为 +3 表明汽车存在风险,值为 -3 表明汽车可能安全。A value of +3 indicates that the auto is risky, and a value of -3 that it is probably safe.
用法:按特性、使用回归或多元分类预测风险评分。Usage: Predict the risk score by features, using regression or multivariate classification.
相关研究:Schlimmer, J.C.Related Research: Schlimmer, J.C. (1987 年)。(1987). UCI 机器学习存储库UCI Machine Learning Repository. 加州大学欧文分校的信息与计算机科学学院。Irvine, CA: University of California, School of Information and Computer Science.
共享的 CRM 亲和力标签CRM Appetency Labels Shared 来自 KDD Cup 2009 客户关系预测挑战赛的标签 (orange_small_train_appetency.labels)。Labels from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train_appetency.labels).
共享的 CRM 流失情况标签CRM Churn Labels Shared 来自 KDD Cup 2009 客户关系预测挑战赛的标签 (orange_small_train_churn.labels)。Labels from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train_churn.labels).
共享的 CRM 数据集CRM Dataset Shared 此数据来自 KDD Cup 2009 客户关系预测挑战赛 (orange_small_train.data.zip)。This data comes from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train.data.zip).
数据集包含法国电信公司 Orange 的 50K 个客户。The dataset contains 50K customers from the French Telecom company Orange. 每个客户都有 230 个匿名特征,其中 190 个是数字的,其余 40 个是分类的。Each customer has 230 anonymized features, 190 of which are numeric and 40 are categorical. 特征非常稀疏。The features are very sparse.
共享的 CRM 追加销售标签CRM Upselling Labels Shared 来自 KDD Cup 2009 客户关系预测挑战赛的标签 (orange_large_train_upselling.labelsLabels from the KDD Cup 2009 customer relationship prediction challenge (orange_large_train_upselling.labels
航班延误数据Flight Delays Data 从美国的 TranStats 数据收集中获得的客运航班正常率数据。交通部(准时)。Passenger flight on-time performance data taken from the TranStats data collection of the U.S. Department of Transportation (On-Time).
数据集涵盖 2013 年 4 月到 10 月的时间段。The dataset covers the time period April-October 2013. 在上传到设计器之前,数据集的处理如下所述:Before uploading to the designer, the dataset was processed as follows:
- 数据集经筛选,仅涵盖美国本土 70 个最繁忙的机场- The dataset was filtered to cover only the 70 busiest airports in the continental US
- 取消的航班标记为延误超过 15 分钟- Canceled flights were labeled as delayed by more than 15 minutes
- 转机航班已筛选掉- Diverted flights were filtered out
- 已选择以下各列:Year、Month、DayofMonth、DayOfWeek、Carrier、OriginAirportID、DestAirportID、CRSDepTime、DepDelay、DepDel15、CRSArrTime、ArrDelay、ArrDel15、Canceled- The following columns were selected: Year, Month, DayofMonth, DayOfWeek, Carrier, OriginAirportID, DestAirportID, CRSDepTime, DepDelay, DepDel15, CRSArrTime, ArrDelay, ArrDel15, Canceled
德国信用卡 UCI 数据集German Credit Card UCI dataset 使用 german.data 文件的 UCI Statlog(德国信用卡)数据集 (Statlog+German+Credit+Data)。The UCI Statlog (German Credit Card) dataset (Statlog+German+Credit+Data), using the german.data file.
数据集将用户(由一组属性描述)分为两类:低信用风险或高信用风险。The dataset classifies people, described by a set of attributes, as low or high credit risks. 每个示例表示一位用户。Each example represents a person. 有 20 个特征,包括数值和分类,以及二进制标签(信用风险值)。There are 20 features, both numerical and categorical, and a binary label (the credit risk value). 高信用风险条目具有标签 = 2,低信用风险条目具有标签 = 1。High credit risk entries have label = 2, low credit risk entries have label = 1. 将低风险示例错误分类为高的成本是 1,反之将高风险示例错误分类为低的成本是 5。The cost of misclassifying a low risk example as high is 1, whereas the cost of misclassifying a high risk example as low is 5.
IMDB 电影标题IMDB Movie Titles 数据集包含 Twitter 推文中给电影评分的有关信息:IMDB 电影 ID、电影名称、风格和制作年份。The dataset contains information about movies that were rated in Twitter tweets: IMDB movie ID, movie name, genre, and production year. 数据集中有 17K 个电影。There are 17K movies in the dataset. 报告“S.The dataset was introduced in the paper "S. Dooms、T. De Pessemier 和 L. Martens.Dooms, T. De Pessemier and L. Martens. MovieTweetings:从 Twitter 收集的电影评分数据集。MovieTweetings: a Movie Rating Dataset Collected From Twitter. 有关适用于推荐器系统 (CrowdRec at RecSys 2013) 的众包和人工计算研讨会。”中引用了该数据集。Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec at RecSys 2013."
电影评分Movie Ratings 该数据集是电影迷你推文数据集的扩展版本。The dataset is an extended version of the Movie Tweetings dataset. 数据集中具有 170K 个电影评分,从 Twitter 上结构良好的推文中提取。The dataset has 170K ratings for movies, extracted from well-structured tweets on Twitter. 每个实例表示一篇推文,是一个元组:用户 ID、IMDB 电影 ID、评分、时间戳、收藏此推文的数目和转发此推文的数目。Each instance represents a tweet and is a tuple: user ID, IMDB movie ID, rating, timestamp, number of favorites for this tweet, and number of retweets of this tweet. 数据集由 A. Said、S. Dooms、B. Loni 和 D. Tikk 提供,用于 Recommender Systems Challenge 2014。The dataset was made available by A. Said, S. Dooms, B. Loni and D. Tikk for Recommender Systems Challenge 2014.
天气数据集Weather Dataset 美国国家海洋和大气局每小时发布的陆基天气观测(从 201304 到 201310 的合并数据)。Hourly land-based weather observations from NOAA (merged data from 201304 to 201310).
该天气数据包括从机场气象站获取的观测结果,涵盖的时间段为 2013 年 4 月到 10 月。The weather data covers observations made from airport weather stations, covering the time period April-October 2013. 在上传到设计器之前,数据集的处理如下所述:Before uploading to the designer, the dataset was processed as follows:
- 气象站 ID 已映射到相应的机场 ID- Weather station IDs were mapped to corresponding airport IDs
- 与 70 个最繁忙的机场无关的气象站已筛选掉- Weather stations not associated with the 70 busiest airports were filtered out
- Date 列已拆分为单独的 Year、Month 和 Day 列- The Date column was split into separate Year, Month, and Day columns
- 已选择以下各列:AirportID、Year、Month、Day、Time、TimeZone、SkyCondition、Visibility、WeatherType、DryBulbFarenheit、DryBulbCelsius、WetBulbFarenheit、WetBulbCelsius、DewPointFarenheit、DewPointCelsius、RelativeHumidity、WindSpeed、WindDirection、ValueForWindCharacter、StationPressure、PressureTendency、PressureChange、SeaLevelPressure、RecordType、HourlyPrecip、Altimeter- The following columns were selected: AirportID, Year, Month, Day, Time, TimeZone, SkyCondition, Visibility, WeatherType, DryBulbFarenheit, DryBulbCelsius, WetBulbFarenheit, WetBulbCelsius, DewPointFarenheit, DewPointCelsius, RelativeHumidity, WindSpeed, WindDirection, ValueForWindCharacter, StationPressure, PressureTendency, PressureChange, SeaLevelPressure, RecordType, HourlyPrecip, Altimeter
维基百科 SP 500 数据集Wikipedia SP 500 Dataset 数据来自维基百科 (https://www.wikipedia.org/) ,基于每个标准普尔 500 强公司的文章,存储为 XML 数据。Data is derived from Wikipedia (https://www.wikipedia.org/) based on articles of each S&P 500 company, stored as XML data.
在上传到设计器之前,数据集的处理如下所述:Before uploading to the designer, the dataset was processed as follows:
- 提取每个具体公司的文本内容- Extract text content for each specific company
- 去除 wiki 格式设置- Remove wiki formatting
- 去除非字母数字字符- Remove non-alphanumeric characters
- 将所有文本都转换为小写- Convert all text to lowercase
- 添加了已知的公司类别- Known company categories were added
请注意,可能找不到某些公司的文章,因此记录数小于 500。Note that for some companies an article could not be found, so the number of records is less than 500.

清理资源Clean up resources

重要

可以使用你创建的、用作其他 Azure 机器学习教程和操作指南文章的先决条件的资源。You can use the resources that you created as prerequisites for other Azure Machine Learning tutorials and how-to articles.

删除所有内容Delete everything

如果你不打算使用所创建的任何内容,请删除整个资源组,以免产生任何费用。If you don't plan to use anything that you created, delete the entire resource group so you don't incur any charges.

  1. 在 Azure 门户的窗口左侧选择“资源组” 。In the Azure portal, select Resource groups on the left side of the window.

    在 Azure 门户中删除资源组

  2. 在列表中选择你创建的资源组。In the list, select the resource group that you created.

  3. 选择“删除资源组” 。Select Delete resource group.

删除该资源组也会删除在设计器中创建的所有资源。Deleting the resource group also deletes all resources that you created in the designer.

删除各项资产Delete individual assets

在创建试验的设计器中删除各个资产,方法是将其选中,然后选择“删除”按钮。 In the designer where you created your experiment, delete individual assets by selecting them and then selecting the Delete button.

此处创建的计算目标在未使用时,会自动缩减到零个节点。 The compute target that you created here automatically autoscales to zero nodes when it's not being used. 此操作旨在最大程度地减少费用。This action is taken to minimize charges. 若要删除计算目标,请执行以下步骤: If you want to delete the compute target, take these steps:

删除资产

可以通过选择每个数据集并选择“注销” ,从工作区中注销数据集。You can unregister datasets from your workspace by selecting each dataset and selecting Unregister.

取消注册数据集

若要删除数据集,请使用 Azure 门户或 Azure 存储资源管理器访问存储帐户,然后手动删除这些资产。To delete a dataset, go to the storage account by using the Azure portal or Azure Storage Explorer and manually delete those assets.

后续步骤Next steps

若要了解有关预测分析和机器学习的基本知识,请参阅教程:使用设计器预测汽车价格Learn the basics of predictive analytics and machine learning with Tutorial: Predict automobile price with the designer