在 Azure 机器学习工作室(经典)中使用示例数据集Use the sample datasets in Azure Machine Learning Studio (classic)

适用于: yes机器学习工作室(经典) noAzure 机器学习APPLIES TO: yesMachine Learning Studio (classic) noAzure Machine Learning

在 Azure 机器学习工作室(经典)中创建新工作区时,默认包含大量的示例数据集和试验。When you create a new workspace in Azure Machine Learning Studio (classic), a number of sample datasets and experiments are included by default. 其中许多示例数据集由 Azure AI 库中的示例模型使用。Many of these sample datasets are used by the sample models in the Azure AI Gallery. 其他示例数据集作为通常在机器学习中使用的各种类型数据的示例包含在内。Others are included as examples of various types of data typically used in machine learning.

一些数据集可在 Azure Blob 存储中使用。Some of these datasets are available in Azure Blob storage. 对于这些数据集,下表提供了直接链接。For these datasets, the following table provides a direct link. 可以在试验中通过使用导入数据模块来使用这些数据集。You can use these datasets in your experiments by using the Import Data module.

余下的这些示例数据集可在“保存的数据集”下的工作区中使用。The rest of these sample datasets are available in your workspace under Saved Datasets. 可以在机器学习工作室(经典)中试验画布左侧的模块调色板中找到此项。You can find this in the module palette to the left of the experiment canvas in Machine Learning Studio (classic). 通过将其中的任何数据集拖放到实验画布中,即可在自己的实验中使用它们。You can use any of these datasets in your own experiment by dragging it to your experiment canvas.

数据集Datasets

数据集名称Dataset name 数据集说明Dataset description
成年人口收入二元分类数据集Adult Census Income Binary Classification dataset 1994 年人口普查数据库的子集(其中在职人士年龄大于 16,调整后的收入指数大于 100)。A subset of the 1994 Census database, using working adults over the age of 16 with an adjusted income index of > 100.

用途:使用人口统计信息对人员分类,预测某人的年收入是否超过 5 万。 Usage: Classify people using demographics to predict whether a person earns over 50K a year.

相关研究:Kohavi, R.、Becker, B.(1996 年)。 Related Research: Kohavi, R., Becker, B., (1996). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science
机场代码数据集Airport Codes Dataset 美国机场代码。U.S. airport codes.

此数据集中的一行包含每一个美国机场,其中提供机场 ID 号和名称,以及所在位置的城市和州。This dataset contains one row for each U.S. airport, providing the airport ID number and name along with the location city and state.
汽车价格数据(原始)Automobile price data (Raw) 有关汽车品牌和型号的信息,包括价格、汽缸数和 MPG 等特性以及保险风险评分。Information about automobiles by make and model, including the price, features such as the number of cylinders and MPG, as well as an insurance risk score.

风险评分最初与自动定价关联。The risk score is initially associated with auto price. 然后,针对精算师所熟知符号化过程中的实际风险进行调整。It is then adjusted for actual risk in a process known to actuaries as symboling. 值为 +3 表明汽车存在风险,值为 -3 表明汽车可能安全。A value of +3 indicates that the auto is risky, and a value of -3 that it is probably safe.

用途:按特性、使用回归或多元分类预测风险评分。 Usage: Predict the risk score by features, using regression or multivariate classification.

相关研究:Schlimmer, J.C. Related Research: Schlimmer, J.C. (1987 年)。(1987). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science
自行车租赁 UCI 数据集Bike Rental UCI dataset UCI 自行车租赁数据集基于 Capital Bikeshare 公司的实际数据,该公司维护着美国华盛顿特区的自行车租赁网络。UCI Bike Rental dataset that is based on real data from Capital Bikeshare company that maintains a bike rental network in Washington DC.

该数据集中的一行对应于 2011 年和 2012 年中每一天的每个小时,总共 17,379 行。The dataset has one row for each hour of each day in 2011 and 2012, for a total of 17,379 rows. 每小时自行车租车数范围从 1 到 977。The range of hourly bike rentals is from 1 to 977.
Bill Gates RGB 图像Bill Gates RGB Image 转换为 CSV 数据的公开可用的映像文件。Publicly available image file converted to CSV data.

使用 K 平均值聚类的颜色量化模型详细信息页中提供了用于转换该图像的代码。The code for converting the image is provided in the Color quantization using K-Means clustering model detail page.
献血数据Blood donation data 来自台湾新竹市输血服务中心献血数据库的数据子集。A subset of data from the blood donor database of the Blood Transfusion Service Center of Hsin-Chu City, Taiwan.

捐献者数据包括自上次捐献的时隔月数和频率,或捐献总数、自上次捐献的时隔时间和献血量。Donor data includes the months since last donation), and frequency, or the total number of donations, time since last donation, and amount of blood donated.

用途:目标是通过分类预测 2007 年 3 月是否会有献血者,其中 1 表示目标期间有献血者,0 表示没有献血者。 Usage: The goal is to predict via classification whether the donor donated blood in March 2007, where 1 indicates a donor during the target period, and 0 a non-donor.

相关研究:Yeh, I.C.(2008 年)。 Related Research: Yeh, I.C., (2008). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science

Yeh, I-Cheng, Yang, King-Jang, 和 Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence”, 专家系统及其应用, 2008 年, https://dx.doi.org/10.1016/j.eswa.2008.07.018 Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, "Knowledge discovery on RFM model using Bernoulli sequence, "Expert Systems with Applications, 2008, https://dx.doi.org/10.1016/j.eswa.2008.07.018
乳腺癌数据Breast cancer data 肿瘤学研究所提供的三个癌症相关数据集之一,经常出现在机器学习文献中。One of three cancer-related datasets provided by the Oncology Institute that appears frequently in machine learning literature. 将诊断信息与来自约 300 个组织样本的实验室分析的特征相结合。Combines diagnostic information with features from laboratory analysis of about 300 tissue samples.

用途:基于 9 种属性(其中有一些是线性的,一些是无条件的)对癌症类型分类。 Usage: Classify the type of cancer, based on 9 attributes, some of which are linear and some are categorical.

相关研究:Wohlberg, W.H.、Street, W.N. 和 Mangasarian, O.L. Related Research: Wohlberg, W.H., Street, W.N., & Mangasarian, O.L. (1995 年)。(1995). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science
乳腺癌特征Breast Cancer Features 数据集包含 X 射线图像的 102K 个可疑区域(候选)的信息,每个由 117 个特征描述。The dataset contains information for 102K suspicious regions (candidates) of X-ray images, each described by 117 features. 这些特征是专有的,数据集创建者(即 Siemens Healthcare)不会透露其含义。The features are proprietary and their meaning is not revealed by the dataset creators (Siemens Healthcare).
乳腺癌信息Breast Cancer Info 数据集包含 X 射线图像的每个可疑区域的附加信息。The dataset contains additional information for each suspicious region of X-ray image. 每个示例提供关于乳腺癌特征数据集中的相应行数的信息(例如,标签、患者 ID、相对于整个图像的贴片坐标)。Each example provides information (for example, label, patient ID, coordinates of patch relative to the whole image) about the corresponding row number in the Breast Cancer Features dataset. 每位病人有许多示例。Each patient has a number of examples. 对于患有癌症的患者,一些示例是积极的,一些示例是消极的。For patients who have a cancer, some examples are positive and some are negative. 对于未患癌症的患者,所有示例都是消极的。For patients who don't have a cancer, all examples are negative. 数据集有 102K 个示例。The dataset has 102K examples. 数据集有偏差,0.6% 的点为正,其余为负。The dataset is biased, 0.6% of the points are positive, the rest are negative. 数据集由 Siemens Healthcare 提供。The dataset was made available by Siemens Healthcare.
共享的 CRM 亲和力标签CRM Appetency Labels Shared 来自 KDD Cup 2009 客户关系预测挑战赛的标签 (orange_small_train_appetency.labels)。Labels from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train_appetency.labels).
共享的 CRM 流失情况标签CRM Churn Labels Shared 来自 KDD Cup 2009 客户关系预测挑战赛的标签 (orange_small_train_churn.labels)。Labels from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train_churn.labels).
共享的 CRM 数据集CRM Dataset Shared 此数据来自 KDD Cup 2009 客户关系预测挑战赛 (orange_small_train.data.zip)。This data comes from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train.data.zip).

数据集包含法国电信公司 Orange 的 50K 个客户。The dataset contains 50K customers from the French Telecom company Orange. 每个客户都有 230 个匿名特征,其中 190 个是数字的,其余 40 个是分类的。Each customer has 230 anonymized features, 190 of which are numeric and 40 are categorical. 特征非常稀疏。The features are very sparse.
共享的 CRM 追加销售标签CRM Upselling Labels Shared 来自 KDD Cup 2009 客户关系预测挑战赛的标签 (orange_large_train_upselling.labels)。Labels from the KDD Cup 2009 customer relationship prediction challenge (orange_large_train_upselling.labels).
能效回归数据Energy-Efficiency Regression data 模拟能量分布的集合,基于 12 种不同的建筑形状。A collection of simulated energy profiles, based on 12 different building shapes. 建筑通过 8 个特征区分。The buildings are differentiated by eight features. 这包括玻璃窗面积、玻璃窗面积分布和方向。This includes glazing area, the glazing area distribution, and orientation.

用途:使用回归或分类来预测基于两个实值响应之一的能效等级。 Usage: Use either regression or classification to predict the energy-efficiency rating based as one of two real valued responses. 对于多类分类,将响应变量舍入为最接近的整数。For multi-class classification, is round the response variable to the nearest integer.

相关研究:Xifara, A. & Tsanas, A.(2012 年)。 Related Research: Xifara, A. & Tsanas, A. (2012). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science
航班延误数据Flight Delays Data 从美国的 TranStats 数据收集中获得的客运航班正常率数据。交通部(准时)。Passenger flight on-time performance data taken from the TranStats data collection of the U.S. Department of Transportation (On-Time).

数据集涵盖 2013 年 4 月到 10 月的时间段。The dataset covers the time period April-October 2013. 在上传到 Azure 机器学习工作室(经典)之前,数据集已按如下所述进行处理:Before uploading to Azure Machine Learning Studio (classic), the dataset was processed as follows:
  • 数据集经筛选,仅包含美国本土 70 个最繁忙的机场The dataset was filtered to cover only the 70 busiest airports in the continental US
  • 取消的航班标记为延误超过 15 分钟Canceled flights were labeled as delayed by more than 15 minutes
  • 转机航班已筛选掉Diverted flights were filtered out
  • 已选择以下各列:Year、Month、DayofMonth、DayOfWeek、Carrier、OriginAirportID、DestAirportID、CRSDepTime、DepDelay、DepDel15、CRSArrTime、ArrDelay、ArrDel15、CanceledThe following columns were selected: Year, Month, DayofMonth, DayOfWeek, Carrier, OriginAirportID, DestAirportID, CRSDepTime, DepDelay, DepDel15, CRSArrTime, ArrDelay, ArrDel15, Canceled
航班正常率(原始)Flight on-time performance (Raw) 美国 2011 年 10 月航班到达和出发的记录。Records of airplane flight arrivals and departures within United States from October 2011.

用途:预测航班延误。 Usage: Predict flight delays.

相关研究:来自美国运输部 https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time Related Research: From US Dept. of Transportation https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time.
森林火灾数据Forest fires data 包含天气数据,如温度和湿度指数和风速。Contains weather data, such as temperature and humidity indices and wind speed. 该数据取自葡萄牙东北部地区,并包含森林火灾记录。The data is taken from an area of northeast Portugal, combined with records of forest fires.

用途:这项回归任务非常难,其目的是预测森林火灾的焚毁面积。 Usage: This is a difficult regression task, where the aim is to predict the burned area of forest fires.

相关研究:Cortez, P. 和 Morais, A.(2008 年)。 Related Research: Cortez, P., & Morais, A. (2008). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science

[Cortez 和 Morais,2007 年] P. Cortez 和 A. Morais。[Cortez and Morais, 2007] P. Cortez and A. Morais. 使用气象数据预测森林火灾的数据挖掘方法。A Data Mining Approach to Predict Forest Fires using Meteorological Data. 在 J. Neves, M. F.In J. Neves, M. F. Santos 和 J. Machado 编辑,“人工智能新趋势”,2007 年第 13 届 EPIA 会议记录 - 有关人工智能的葡萄牙会议于 2007 年 12 月在葡萄牙吉马良斯召开,第 512-523 页。Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9。APPIA, ISBN-13 978-989-95618-0-9. 如需获取,请访问 http://www.dsi.uminho.pt/~pcortez/fires.pdfAvailable at: http://www.dsi.uminho.pt/~pcortez/fires.pdf.
德国信用卡 UCI 数据集German Credit Card UCI dataset 使用 german.data 文件的 UCI Statlog(德国信用卡)数据集 (Statlog+German+Credit+Data)。The UCI Statlog (German Credit Card) dataset (Statlog+German+Credit+Data), using the german.data file.

数据集将用户(由一组属性描述)分为两类:低信用风险或高信用风险。The dataset classifies people, described by a set of attributes, as low or high credit risks. 每个示例表示一位用户。Each example represents a person. 有 20 个特征,包括数值和分类,以及二进制标签(信用风险值)。There are 20 features, both numerical and categorical, and a binary label (the credit risk value). 高信用风险条目具有标签 = 2,低信用风险条目具有标签 = 1。High credit risk entries have label = 2, low credit risk entries have label = 1. 将低风险示例错误分类为高的成本是 1,反之将高风险示例错误分类为低的成本是 5。The cost of misclassifying a low risk example as high is 1, whereas the cost of misclassifying a high risk example as low is 5.
IMDB 电影标题IMDB Movie Titles 数据集包含 Twitter 推文中给电影评分的有关信息:IMDB 电影 ID、电影名称、风格和制作年份。The dataset contains information about movies that were rated in Twitter tweets: IMDB movie ID, movie name, genre, and production year. 数据集中有 17K 个电影。There are 17K movies in the dataset. 报告“S.The dataset was introduced in the paper "S. Dooms、T. De Pessemier 和 L. Martens.Dooms, T. De Pessemier and L. Martens. MovieTweetings:从 Twitter 收集的电影评分数据集。MovieTweetings: a Movie Rating Dataset Collected From Twitter. 有关适用于推荐器系统 (CrowdRec at RecSys 2013) 的众包和人工计算研讨会。”中引用了该数据集。Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec at RecSys 2013."
鸢尾花双类数据Iris two class data 这可能是模式识别文献中最有名的数据库。This is perhaps the best known database to be found in the pattern recognition literature. 数据集相对较小,三种鸢尾花品种的每个花瓣测量均包含 50 个示例。The dataset is relatively small, containing 50 examples each of petal measurements from three iris varieties.

用途:通过测量值预测鸢尾花类型。 Usage: Predict the iris type from the measurements.

相关研究:Fisher, R.A. Related Research: Fisher, R.A. (1988 年)。(1988). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science
电影推文Movie Tweets 该数据集是电影迷你推文数据集的扩展版本。The dataset is an extended version of the Movie Tweetings dataset. 数据集中具有 170K 个电影评分,从 Twitter 上结构良好的推文中提取。The dataset has 170K ratings for movies, extracted from well-structured tweets on Twitter. 每个实例表示一篇推文,是一个元组:用户 ID、IMDB 电影 ID、评分、时间戳、收藏此推文的数目和转发此推文的数目。Each instance represents a tweet and is a tuple: user ID, IMDB movie ID, rating, timestamp, number of favorites for this tweet, and number of retweets of this tweet. 数据集由 A. Said、S. Dooms、B. Loni 和 D. Tikk 提供,用于 Recommender Systems Challenge 2014。The dataset was made available by A. Said, S. Dooms, B. Loni and D. Tikk for Recommender Systems Challenge 2014.
各种汽车的 MPG 数据MPG data for various automobiles 此数据集是美国卡内基梅隆大学的 StatLib 库提供的数据集略有修改的版本。This dataset is a slightly modified version of the dataset provided by the StatLib library of Carnegie Mellon University. 1983 年美国统计协会博览会使用过该数据集。The dataset was used in the 1983 American Statistical Association Exposition.

该数据列出各种汽车的油耗(以每加仑英里数为计量单位)。The data lists fuel consumption for various automobiles in miles per gallon. 还包括汽缸数、发动机排量、马力、总重量和加速性能等信息。It also includes information such as the number of cylinders, engine displacement, horsepower, total weight, and acceleration.

用途:基于 3 个多值离散属性和 5 个连续属性预测燃料经济性。 Usage: Predict fuel economy based on three multivalued discrete attributes and five continuous attributes.

相关研究:StatLib,美国卡内基梅隆大学(1993 年)。 Related Research: StatLib, Carnegie Mellon University, (1993). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science
皮马族印地安人糖尿病二元分类数据集Pima Indians Diabetes Binary Classification dataset 美国国家糖尿病、消化和肾脏疾病研究所数据库数据的一个子集。A subset of data from the National Institute of Diabetes and Digestive and Kidney Diseases database. 数据集经筛选,重点研究皮马族印第安人后代的女性患者。The dataset was filtered to focus on female patients of Pima Indian heritage. 数据包括葡萄糖和胰岛素水平等医疗数据,以及生活方式因素。The data includes medical data such as glucose and insulin levels, as well as lifestyle factors.

用途:预测观察对象是否患有糖尿病(二元分类)。 Usage: Predict whether the subject has diabetes (binary classification).

相关研究:Sigillito, V.(1990 年)。 Related Research: Sigillito, V. (1990). UCI 机器学习存储库 https://archive.ics.uci.edu/ml"UCI Machine Learning Repository https://archive.ics.uci.edu/ml". 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science
餐馆客户数据Restaurant customer data 一组关于客户的元数据,其中包括人口统计信息和偏好。A set of metadata about customers, including demographics and preferences.

用途:将此数据集与其他两个餐馆数据集结合使用,以便训练和测试推荐器系统。 Usage: Use this dataset, in combination with the other two restaurant datasets, to train and test a recommender system.

相关研究:Bache, K. 和 Lichman, M.(2013 年)。 Related Research: Bache, K. and Lichman, M. (2013). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院。Irvine, CA: University of California, School of Information and Computer Science.
餐馆特色数据Restaurant feature data 一组关于餐馆及其特色的元数据,如食物种类、就餐样式和位置。A set of metadata about restaurants and their features, such as food type, dining style, and location.

用途:将此数据集与其他两个餐馆数据集结合使用,以便训练和测试推荐器系统。 Usage: Use this dataset, in combination with the other two restaurant datasets, to train and test a recommender system.

相关研究:Bache, K. 和 Lichman, M.(2013 年)。 Related Research: Bache, K. and Lichman, M. (2013). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院。Irvine, CA: University of California, School of Information and Computer Science.
餐馆评分Restaurant ratings 包含用户对餐馆的评分,分数范围从 0 到 2。Contains ratings given by users to restaurants on a scale from 0 to 2.

用途:将此数据集与其他两个餐馆数据集结合使用,以便训练和测试推荐器系统。 Usage: Use this dataset, in combination with the other two restaurant datasets, to train and test a recommender system.

相关研究:Bache, K. 和 Lichman, M.(2013 年)。 Related Research: Bache, K. and Lichman, M. (2013). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院。Irvine, CA: University of California, School of Information and Computer Science.
钢退火多类数据集Steel Annealing multi-class dataset 该数据集包含钢退火实验的一系列记录。This dataset contains a series of records from steel annealing trials. 它包含生成钢种的物理属性(宽度、厚度、类型(卷材、板材等))。It contains the physical attributes (width, thickness, type (coil, sheet, etc.) of the resulting steel types.

用途:预测两个数值类属性的任何一个:硬度或强度。 Usage: Predict any of two numeric class attributes; hardness or strength. 还可以分析这些属性之间的相关性。You might also analyze correlations among attributes.

钢种遵循由 SAE 和其他组织定义的一组标准。Steel grades follow a set standard, defined by SAE and other organizations. 正在寻找特定“种类”(类变量),并想要了解所需值。You are looking for a specific 'grade' (the class variable) and want to understand the values needed.

相关研究:Sterling, D. 和 Buntine, W.(NA)。 Related Research: Sterling, D. & Buntine, W. (NA). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息与计算机科学学院Irvine, CA: University of California, School of Information and Computer Science

可访问此处获取有关钢等级的帮助指南:https://www.steamforum.com/pictures/Outokumpu-steel-grades-properties-global-standards.pdf A useful guide to steel grades can be found here: https://www.steamforum.com/pictures/Outokumpu-steel-grades-properties-global-standards.pdf
望远镜数据Telescope data 高能量伽玛粒子爆发以及背景噪声的记录,两者都使用蒙特卡洛方法模拟。Record of high energy gamma particle bursts along with background noise, both simulated using a Monte Carlo process.

该模拟的目的是提高路基 Cherenkov 大气伽玛望远镜的准确性。The intent of the simulation was to improve the accuracy of ground-based atmospheric Cherenkov gamma telescopes. 这通过使用统计方法来区分所需信号(Cherenkov 辐射淋浴)和背景噪声(由上层大气中的宇宙射线引发的强子淋浴)实现。This is done by using statistical methods to differentiate between the desired signal (Cherenkov radiation showers) and background noise (hadronic showers initiated by cosmic rays in the upper atmosphere).

已预处理数据,创建细长的群集,其中长轴朝向相机中心。The data has been pre-processed to create an elongated cluster with the long axis is oriented towards the camera center. 该椭圆的特征(通常称为 Hillas 参数)是可以用于辨别的图像参数。The characteristics of this ellipse (often called Hillas parameters) are among the image parameters that can be used for discrimination.

用途:预测淋浴图像表示信号,还是表示背景噪声。 Usage: Predict whether image of a shower represents signal or background noise.

注意: 简单的分类精度对于该数据没有意义,因为与将信号事件分类为背景相比,更糟糕的是将背景事件分类为信号。 Notes: Simple classification accuracy is not meaningful for this data, since classifying a background event as signal is worse than classifying a signal event as background. 为了比较不同的分类器,应当使用 ROC 图。For comparison of different classifiers, the ROC graph should be used. 接受背景事件作为信号的概率必须低于以下阈值之一:0.01、0.02、0.05、0.1 或 0.2。The probability of accepting a background event as signal must be below one of the following thresholds: 0.01, 0.02, 0.05, 0.1, or 0.2.

另请注意,背景事件的数目(h 表示强子淋浴)被低估。Also, note that the number of background events (h, for hadronic showers) is underestimated. 在实际测量中,h 或噪声类代表大部分事件。In real measurements, the h or noise class represents the majority of events.

相关研究:Bock, R.K. Related Research: Bock, R.K. (1995 年)。(1995). UCI 机器学习存储库 https://archive.ics.uci.edu/mlUCI Machine Learning Repository https://archive.ics.uci.edu/ml. 加州大学欧文分校的信息学院Irvine, CA: University of California, School of Information
天气数据集Weather Dataset 美国国家海洋和大气局每小时发布的陆基天气观测(从 201304 到 201310 的合并数据)。Hourly land-based weather observations from NOAA (merged data from 201304 to 201310).

该天气数据包括从机场气象站获取的观测结果,涵盖的时间段为 2013 年 4 月到 10 月。The weather data covers observations made from airport weather stations, covering the time period April-October 2013. 在上传到 Azure 机器学习工作室(经典)之前,数据集已按如下所述进行处理:Before uploading to Azure Machine Learning Studio (classic), the dataset was processed as follows:
  • 气象站 ID 已映射到相应的机场 IDWeather station IDs were mapped to corresponding airport IDs
  • 与 70 个最繁忙的机场无关的气象站已筛选掉Weather stations not associated with the 70 busiest airports were filtered out
  • Date 列已拆分为单独的 Year、Month 和 Day 列The Date column was split into separate Year, Month, and Day columns
  • 已选择以下各列:AirportID、Year、Month、Day、Time、TimeZone、SkyCondition、Visibility、WeatherType、DryBulbFarenheit、DryBulbCelsius、WetBulbFarenheit、WetBulbCelsius、DewPointFarenheit、DewPointCelsius、RelativeHumidity、WindSpeed、WindDirection、ValueForWindCharacter、StationPressure、PressureTendency、PressureChange、SeaLevelPressure、RecordType、HourlyPrecip、AltimeterThe following columns were selected: AirportID, Year, Month, Day, Time, TimeZone, SkyCondition, Visibility, WeatherType, DryBulbFarenheit, DryBulbCelsius, WetBulbFarenheit, WetBulbCelsius, DewPointFarenheit, DewPointCelsius, RelativeHumidity, WindSpeed, WindDirection, ValueForWindCharacter, StationPressure, PressureTendency, PressureChange, SeaLevelPressure, RecordType, HourlyPrecip, Altimeter
维基百科 SP 500 数据集Wikipedia SP 500 Dataset 数据来自维基百科 (https://www.wikipedia.org/),基于每个标准普尔 500 强公司的文章,存储为 XML 数据。Data is derived from Wikipedia (https://www.wikipedia.org/) based on articles of each S&P 500 company, stored as XML data.

在上传到 Azure 机器学习工作室(经典)之前,数据集已按如下所述进行处理:Before uploading to Azure Machine Learning Studio (classic), the dataset was processed as follows:
  • 提取每个特定公司的文本内容Extract text content for each specific company
  • 去除维基百科的格式设置Remove wiki formatting
  • 去除非字母数字字符Remove non-alphanumeric characters
  • 将所有文本都转换为小写Convert all text to lowercase
  • 添加了已知的公司类别Known company categories were added

请注意,可能找不到某些公司的文章,因此记录数小于 500。Note that for some companies an article could not be found, so the number of records is less than 500.
direct_marketing.csvdirect_marketing.csv 数据集包含客户数据和有关其响应直接邮寄活动的迹象。The dataset contains customer data and indications about their response to a direct mailing campaign. 每一行表示一位客户。Each row represents a customer. 数据集包含 9 个关于用户统计信息和过往行为的特征,以及 3 个标签列(访问、转化和支出)。The dataset contains nine features about user demographics and past behavior, and three label columns (visit, conversion, and spend). 访问是一个二进制列,指示客户是在市场营销活动后访问的。Visit is a binary column that indicates that a customer visited after the marketing campaign. 转化指示客户购买了物品。Conversion indicates a customer purchased something. 支出为购物金额。Spend is the amount that was spent. 数据集由 Kevin Hillstrom 提供,用于 MineThatData E-Mail Analytics And Data Mining Challenge。The dataset was made available by Kevin Hillstrom for MineThatData E-Mail Analytics And Data Mining Challenge.
lyrl2004_tokens_test.csvlyrl2004_tokens_test.csv RCV1-V2 路透社新闻数据集中测试示例的特征。Features of test examples in the RCV1-V2 Reuters news dataset. 该数据集中具有 781K 个新闻文章及其 ID(数据集的第一列)。The dataset has 781K news articles along with their IDs (first column of the dataset). 每篇文章已经过标记化、设置了停用词和去除枝枝叶叶。Each article is tokenized, stopworded, and stemmed. 数据集由 David 提供。The dataset was made available by David. D.D. Lewis。Lewis.
lyrl2004_tokens_train.csvlyrl2004_tokens_train.csv RCV1-V2 路透社新闻数据集中训练示例的特征。Features of training examples in the RCV1-V2 Reuters news dataset. 该数据集中具有 23K 个新闻文章及其 ID(数据集的第一列)。The dataset has 23K news articles along with their IDs (first column of the dataset). 每篇文章已经过标记化、设置了停用词和去除枝枝叶叶。Each article is tokenized, stopworded, and stemmed. 数据集由 David 提供。The dataset was made available by David. D.D. Lewis。Lewis.
network_intrusion_detection.csvnetwork_intrusion_detection.csv
来自 KDD Cup 1999 知识发现和数据挖掘工具竞赛 (kddcup99.html) 的数据集。Dataset from the KDD Cup 1999 Knowledge Discovery and Data Mining Tools Competition (kddcup99.html).

该数据集已下载并存储在 Azure Blob 存储 (network_intrusion_detection.csv) 中,包含训练和测试数据集。The dataset was downloaded and stored in Azure Blob storage (network_intrusion_detection.csv) and includes both training and testing datasets. 训练数据集大约有 12.6 万行和 43 列,包括标签。The training dataset has approximately 126K rows and 43 columns, including the labels. 3 列是标签信息的组成部分,40 列由数字和字符串/分类特征组成,可用于训练模型。Three columns are part of the label information, and 40 columns, consisting of numeric and string/categorical features, are available for training the model. 测试数据具有大约 22.5K 个测试示例,具有与训练数据相同的 43 列。The test data has approximately 22.5K test examples with the same 43 columns as in the training data.
rcv1-v2.topics.qrels.csvrcv1-v2.topics.qrels.csv RCV1-V2 路透社新闻数据集中新闻文章的主题分配。Topic assignments for news articles in the RCV1-V2 Reuters news dataset. 可以分配给新闻文章多个主题。A news article can be assigned to several topics. 每行的格式为“<主题名称> <文档 ID> 1”。The format of each row is "<topic name> <document id> 1". 数据集包含 2.6M 个主题分配。The dataset contains 2.6M topic assignments. 数据集由 David 提供。The dataset was made available by David. D.D. Lewis。Lewis.
student_performance.txtstudent_performance.txt 此数据来自 KDD Cup 2010 学生成绩评估挑战赛(学生成绩评估)。This data comes from the KDD Cup 2010 Student performance evaluation challenge (student performance evaluation). 使用的数据为 Algebra_2008_2009 训练集 (Stamper, J., Niculescu-Mizil, A., Ritter, S., Gordon, G.J., & Koedinger, K.R.The data used is the Algebra_2008_2009 training set (Stamper, J., Niculescu-Mizil, A., Ritter, S., Gordon, G.J., & Koedinger, K.R. (2010 年)。(2010). 代数 I 2008-2009。Algebra I 2008-2009. 来自 KDD Cup 2010 教育数据挖掘挑战赛的挑战数据集。Challenge dataset from KDD Cup 2010 Educational Data Mining Challenge. 可在 downloads.jsp 中找到该数据。Find it at downloads.jsp.

该数据集已下载并存储在 Azure Blob 存储 (student_performance.txt) 中,包含学生辅导系统中的日志文件。The dataset was downloaded and stored in Azure Blob storage (student_performance.txt) and contains log files from a student tutoring system. 提供的特征包含问题 ID 及其简要描述、学生 ID、时间戳以及学生在正确解决该问题之前尝试的次数。The supplied features include problem ID and its brief description, student ID, timestamp, and how many attempts the student made before solving the problem in the right way. 原始数据集具有 890 万条记录,此数据集已降低取样为前 10 万行。The original dataset has 8.9M records; this dataset has been down-sampled to the first 100K rows. 该数据集具有 23 个制表符分隔的各种类型的列:数值、分类和时间戳。The dataset has 23 tab-separated columns of various types: numeric, categorical, and timestamp.

后续步骤Next steps