Team Data Science Process 的工作原理 - 针对 1 TB 数据集使用 Azure HDInsight Hadoop 群集The Team Data Science Process in action - Using an Azure HDInsight Hadoop Cluster on a 1 TB dataset

本演练演示如何利用 Azure HDInsight Hadoop 群集在端到端方案中使用 Team Data Science Process 来存储、浏览并描绘工程师数据特征,从其中一个公用 Criteo 数据集下载示例数据。This walkthrough demonstrates how to use the Team Data Science Process in an end-to-end scenario with an Azure HDInsight Hadoop cluster to store, explore, feature engineer, and down sample data from one of the publicly available Criteo datasets. 其中使用了 Azure 机器学习在这些数据上构建二进制分类模型。It uses Azure Machine Learning to build a binary classification model on this data. 此外,还演示如何将这些模型之一发布为 Web 服务。It also shows how to publish one of these models as a Web service.

也可以使用 IPython Notebook 来完成本演练中介绍的任务。It is also possible to use an IPython notebook to accomplish the tasks presented in this walkthrough. 想要尝试此方法的用户应该咨询使用 Hive ODBC 连接的 Criteo 演练主题。Users who would like to try this approach should consult the Criteo walkthrough using a Hive ODBC connection topic.

Criteo 数据集说明Criteo Dataset Description

Criteo 数据是一个单击预测数据集,包含 370 GB 的 gzip 压缩 TSV 文件(约 1.3 TB 未压缩),包含超过 43 亿条记录。The Criteo data is a click prediction dataset that is 370 GB of gzip compressed TSV files (~1.3 TB uncompressed), comprising more than 4.3 billion records. 它取自 Criteo 提供的 24 天的单击数据。It is taken from 24 days of click data made available by Criteo. 为了方便数据科学家,我们已解压缩数据以便用于试验。For the convenience of data scientists, the data available to us to experiment with has been unzipped.

此数据集中的每个记录包含 40 列:Each record in this dataset contains 40 columns:

  • 第一列是标签列,该列指示用户是否单击“添加”(值 1)或未单击(值 0)the first column is a label column that indicates whether a user clicks an add (value 1) or does not click one (value 0)
  • 接下来的 13 列是数值列,并且next 13 columns are numeric, and
  • 最后的 26 列是分类列last 26 are categorical columns

这些列是匿名的,并且使用一系列枚举的名称:“Col1”(表示标签列)到“Col40”(表示最后一个分类列)。The columns are anonymized and use a series of enumerated names: "Col1" (for the label column) to 'Col40" (for the last categorical column).

以下是此数据集中两个观测(行)的前 20 列的摘要:Here is an excerpt of the first 20 columns of two observations (rows) from this dataset:

Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11 Col12 Col13 Col14 Col15 Col16 Col17 Col18 Col19 Col20

0 40 42 2 54 3 0 0 2 16 0 1 4448 4 1acfe1ee 1b2ff61f 2e8b2631 6faef306 c6fc10d3 6fcd6dcb

0 24 27 5 0 2 1 3 10064 9a8cb066 7a06385f 417e6103 2170fc56 acf676aa 6fcd6dcb

在此数据集的数值列和分类列中都有缺失值。There are missing values in both the numeric and categorical columns in this dataset. 本文介绍一种处理缺失值的简单方法。A simple method for handling the missing values is described. 将缺失值存储到 Hive 表中时,将浏览数据的其他详细信息。Additional details of the data are explored when storing them into Hive tables.

定义:点击率(CTR): 此指标是数据的点击数的百分比。Definition: Clickthrough rate (CTR): This metric is the percentage of clicks in the data. 在此 Criteo 数据集中,CTR 约为 3.3% 或 0.033。In this Criteo dataset, the CTR is about 3.3% or 0.033.

预测任务示例Examples of prediction tasks

本演练中涉及两个示例预测问题:Two sample prediction problems are addressed in this walkthrough:

  1. 二元分类:预测用户是否单击了添加:Binary classification: Predicts whether a user clicked an add:

    • 分类 0:无点击Class 0: No Click
    • 分类 1:单击Class 1: Click
  2. 回归:预测来自用户功能的广告点击概率。Regression: Predicts the probability of an ad click from user features.

为数据科学设置 HDInsight Hadoop 群集Set Up an HDInsight Hadoop cluster for data science

备注

此步骤通常是管理员任务。This step is typically an Admin task.

通过三个步骤设置 Azure Data Science 环境,以构建具有 HDInsight 群集的预测分析解决方案:Set up your Azure Data Science environment for building predictive analytics solutions with HDInsight clusters in three steps:

  1. 创建存储帐户:此存储帐户用于在 Azure Blob 存储中存储数据。Create a storage account: This storage account is used to store data in Azure Blob Storage. HDInsight 群集中使用的数据存储在此处。The data used in HDInsight clusters is stored here.

  2. 自定义用于数据科学的 Azure HDInsight Hadoop 群集:此步骤将创建一个在所有节点上都安装有 64 位 Anaconda Python 2.7 的 Azure HDInsight Hadoop 群集。Customize Azure HDInsight Hadoop Clusters for Data Science: This step creates an Azure HDInsight Hadoop cluster with 64-bit Anaconda Python 2.7 installed on all nodes. 自定义 HDInsight 群集时,要完成两个重要步骤(本主题中有所描述)。There are two important steps (described in this topic) to complete when customizing the HDInsight cluster.

    • 在创建 HDInsight 群集时将其与在步骤 1 中创建的存储帐户相链接。Link the storage account created in step 1 with your HDInsight cluster when it is created. 此存储帐户用于访问可在群集中处理的数据。This storage account is used for accessing data that can be processed within the cluster.
    • 创建群集后,启用对其头节点的远程访问。Enable Remote Access to the head node of the cluster after it is created. 请记住在此处指定的远程访问凭据(与创建群集时指定的凭据不同):完成以下过程。Remember the remote access credentials you specify here (different from the credentials specified at cluster creation): complete the following procedures.
  3. 创建 Azure 机器学习工作室(经典)工作区:此 Azure 机器学习工作区用于在 HDInsight 群集上进行初始数据浏览和缩小取样后构建机器学习模型。Create an Azure Machine Learning Studio (classic) workspace: This Azure Machine Learning workspace is used for building machine learning models after an initial data exploration and down sampling on the HDInsight cluster.

从公共源获取和使用数据Get and consume data from a public source

可以通过单击链接、接受使用条款并提供名称来访问 Criteo 数据集。The Criteo dataset can be accessed by clicking on the link, accepting the terms of use, and providing a name. 快照如下所示:A snapshot is shown here:

接受 Criteo 条款

单击“继续下载”,详细了解数据集及其可用性。Click Continue to Download to read more about the dataset and its availability.

数据驻留在 Azure Blob 存储位置:wasb://criteo@azuremlsampleexperiments.blob.core.windows.net/raw/。The data resides in an Azure blob storage location: wasb://criteo@azuremlsampleexperiments.blob.core.windows.net/raw/. “wasb”表示 Azure Blob 存储位置。The "wasb" refers to Azure Blob Storage location.

  1. 此 Azure blob 存储中的数据由已解压缩数据的三个子文件夹组成。The data in this Azure blob storage consists of three subfolders of unzipped data.

    1. 子文件夹 raw/count/ 包含前 21 天的数据 - 从第_00天到第_20天The subfolder raw/count/ contains the first 21 days of data - from day_00 to day_20
    2. 子文件夹 raw/train/ 由一天的数据组成,即第_21 天The subfolder raw/train/ consists of a single day of data, day_21
    3. 子文件夹 raw/test/ 由两天的数据组成,第_22 天和第_23 天The subfolder raw/test/ consists of two days of data, day_22 and day_23
  2. 原始 gzip 数据可以在主文件夹 raw/ 中作为 day_NN.gz 使用,其中 NN 值从 00 到 23。The raw gzip data are also available in the main folder raw/ as day_NN.gz, where NN goes from 00 to 23.

另一种访问、浏览和建模不需要任何本地下载的数据的方法会在本演示的后续部分中创建 Hive 表时进行介绍。An alternative approach to access, explore, and model this data that does not require any local downloads is explained later in this walkthrough when we create Hive tables.

登录到群集头节点Log in to the cluster headnode

若要登录到集群的头节点,请使用 Azure 门户找到该集群。To log in to the headnode of the cluster, use the Azure portal to locate the cluster. 单击左侧的 HDInsight 大象图标,并双击群集名称。Click the HDInsight elephant icon on the left and then double-click the name of your cluster. 导航到“配置”选项卡,双击页面底部的“连接”图标,并在出现提示时输入远程访问凭据,从而转到群集的头节点。Navigate to the Configuration tab, double-click the CONNECT icon on the bottom of the page, and enter your remote access credentials when prompted, taking you to the headnode of the cluster.

以下是首次登录到群集头节点的典型示例:Here is what a typical first log in to the cluster headnode looks like:

登录到群集

左侧是“Hadoop 命令行”,这是进行数据浏览的主力。On the left is the "Hadoop Command Line", which is our workhorse for the data exploration. 可以看到两个有用的 URL -“Hadoop Yarn 状态”和“Hadoop 名称节点”。Notice two useful URLs - "Hadoop Yarn Status" and "Hadoop Name Node". Yarn 状态 URL 显示作业进度,名称节点 URL 提供有关群集配置的详细信息。The yarn status URL shows job progress and the name node URL gives details on the cluster configuration.

现在设置并准备开始第一部分的演练:使用 Hive 进行数据挖掘,并为 Azure 机器学习准备数据。Now you are set up and ready to begin first part of the walkthrough: data exploration using Hive and getting data ready for Azure Machine Learning.

创建 Hive 数据库和表Create Hive database and tables

若要为我们的 Criteo 数据集创建 Hive 表,请在头节点的桌面上打开“Hadoop 命令行”,并通过输入命令输入 Hive 目录To create Hive tables for our Criteo dataset, open the *Hadoop Command Line _ on the desktop of the head node, and enter the Hive directory by entering the command

cd %hive_home%\bin

备注

从 Hive bin/ 目录提示符运行此演练中的所有 Hive 命令。Run all Hive commands in this walkthrough from the Hive bin/ directory prompt. 这会自动处理任何路径问题。This takes care of any path issues automatically. 我们将术语“Hive 目录提示符”、“Hive bin/ 目录提示符”和“Hadoop 命令行”互换使用。You can use the terms "Hive directory prompt", "Hive bin/ directory prompt", and "Hadoop Command Line" interchangeably.

备注

若要执行任何 Hive 查询,可以使用以下命令:To execute any Hive query, one can always use the following commands:

cd %hive_home%\bin hive

Hive REPL 出现“hive>”符号后,只需剪切并粘贴查询即可执行。After the Hive REPL appears with a "hive >"sign, simply cut and paste the query to execute it.

以下代码创建一个数据库“criteo”,并生成 4 个表:The following code creates a database "criteo" and then generates four tables:

一个用于生成在第_00 天至第_20 天构建的计数的表,_ a table for generating counts built on days day_00 to day_20,

  • 一个用于在第_21 天构建的定型数据集的表,以及a table for use as the train dataset built on day_21, and
  • 两个分别用于在第_22 天和第_23 天构建的测试数据集的表。two tables for use as the test datasets built on day_22 and day_23 respectively.

因为其中某一天是节假日,所以我们将测试数据集分为两类不同的表。Split the test dataset into two different tables because one of the days is a holiday. 目标是确定模型是否可以检测节假日和非节假日之间的点击率差异。The objective is to determine if the model can detect differences between a holiday and non-holiday from the click-through rate.

为方便起见,此处显示脚本 sample_hive_create_criteo_database_and_tables.hqlThe script sample_hive_create_criteo_database_and_tables.hql is displayed here for convenience:

CREATE DATABASE IF NOT EXISTS criteo;
DROP TABLE IF EXISTS criteo.criteo_count;
CREATE TABLE criteo.criteo_count (
col1 string,col2 double,col3 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 double,col10 double,col11 double,col12 double,col13 double,col14 double,col15 string,col16 string,col17 string,col18 string,col19 string,col20 string,col21 string,col22 string,col23 string,col24 string,col25 string,col26 string,col27 string,col28 string,col29 string,col30 string,col31 string,col32 string,col33 string,col34 string,col35 string,col36 string,col37 string,col38 string,col39 string,col40 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 'wasb://criteo@azuremlsampleexperiments.blob.core.windows.net/raw/count';

DROP TABLE IF EXISTS criteo.criteo_train;
CREATE TABLE criteo.criteo_train (
col1 string,col2 double,col3 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 double,col10 double,col11 double,col12 double,col13 double,col14 double,col15 string,col16 string,col17 string,col18 string,col19 string,col20 string,col21 string,col22 string,col23 string,col24 string,col25 string,col26 string,col27 string,col28 string,col29 string,col30 string,col31 string,col32 string,col33 string,col34 string,col35 string,col36 string,col37 string,col38 string,col39 string,col40 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 'wasb://criteo@azuremlsampleexperiments.blob.core.windows.net/raw/train';

DROP TABLE IF EXISTS criteo.criteo_test_day_22;
CREATE TABLE criteo.criteo_test_day_22 (
col1 string,col2 double,col3 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 double,col10 double,col11 double,col12 double,col13 double,col14 double,col15 string,col16 string,col17 string,col18 string,col19 string,col20 string,col21 string,col22 string,col23 string,col24 string,col25 string,col26 string,col27 string,col28 string,col29 string,col30 string,col31 string,col32 string,col33 string,col34 string,col35 string,col36 string,col37 string,col38 string,col39 string,col40 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 'wasb://criteo@azuremlsampleexperiments.blob.core.windows.net/raw/test/day_22';

DROP TABLE IF EXISTS criteo.criteo_test_day_23;
CREATE TABLE criteo.criteo_test_day_23 (
col1 string,col2 double,col3 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 double,col10 double,col11 double,col12 double,col13 double,col14 double,col15 string,col16 string,col17 string,col18 string,col19 string,col20 string,col21 string,col22 string,col23 string,col24 string,col25 string,col26 string,col27 string,col28 string,col29 string,col30 string,col31 string,col32 string,col33 string,col34 string,col35 string,col36 string,col37 string,col38 string,col39 string,col40 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 'wasb://criteo@azuremlsampleexperiments.blob.core.windows.net/raw/test/day_23';

所有这些表都是外部的,因此可以指向其 Azure Blob 存储 (wasb) 位置。All these tables are external so you can point to their Azure Blob Storage (wasb) locations.

有两种可以执行任何 Hive 查询的方法:There are two ways to execute ANY Hive query:

  • 使用 Hive REPL 命令行:首先发出“hive”命令,然后在 Hive REPL 命令行中复制并粘贴查询:Using the Hive REPL command line: The first is to issue a "hive" command and copy and paste a query at the Hive REPL command line:

    cd %hive_home%\bin
    hive
    

    现在,在 REPL 命令行中,剪切并粘贴查询以执行它。Now at the REPL command line, cutting and pasting the query executes it.

  • 将查询保存到文件并执行命令:第二种方法是将查询保存到“.hql”文件 (sample_hive_create_criteo_database_and_tables.hql),并发出以下命令来执行查询:Saving queries to a file and executing the command: The second is to save the queries to a '.hql' file (sample_hive_create_criteo_database_and_tables.hql) and then issue the following command to execute the query:

    hive -f C:\temp\sample_hive_create_criteo_database_and_tables.hql
    

确认数据库和表的创建Confirm database and table creation

接下来,使用 Hive bin/ 目录提示符中的以下命令来确认数据库的创建:Next, confirm the creation of the database with the following command from the Hive bin/ directory prompt:

hive -e "show databases;"

会得到:This gives:

criteo
default
Time taken: 1.25 seconds, Fetched: 2 row(s)

这确认了新数据库“criteo”的创建。This confirms the creation of the new database, "criteo".

若要查看所创建的表,只需从 Hive bin/ 目录提示符中输入命令:To see what tables were created, simply issue the command here from the Hive bin/ directory prompt:

hive -e "show tables in criteo;"

应会看到以下输出:You should then see the following output:

criteo_count
criteo_test_day_22
criteo_test_day_23
criteo_train
Time taken: 1.437 seconds, Fetched: 4 row(s)

Hive 中的数据浏览Data exploration in Hive

现已准备好在 Hive 中做一些基本的数据浏览。Now you are ready to do some basic data exploration in Hive. 首先统计训练和测试数据表中的示例数目。You begin by counting the number of examples in the train and test data tables.

定型示例数Number of train examples

sample_hive_count_train_table_examples.hql 的内容如下所示:The contents of sample_hive_count_train_table_examples.hql are shown here:

SELECT COUNT(*) FROM criteo.criteo_train;

这会生成:This yields:

192215183
Time taken: 264.154 seconds, Fetched: 1 row(s)

或者,也可以从 Hive bin/ 目录提示符发出以下命令:Alternatively, one may also issue the following command from the Hive bin/ directory prompt:

hive -f C:\temp\sample_hive_count_criteo_train_table_examples.hql

两个测试数据集中的测试示例数Number of test examples in the two test datasets

现在统计两个测试数据集中的示例数目。Now count the number of examples in the two test datasets. sample_hive_count_criteo_test_day_22_table_examples.hql 的内容如下:The contents of sample_hive_count_criteo_test_day_22_table_examples.hql are here:

SELECT COUNT(*) FROM criteo.criteo_test_day_22;

这会生成:This yields:

189747893
Time taken: 267.968 seconds, Fetched: 1 row(s)

像往常一样,也可以通过发出命令从 Hive bin/ 目录提示符调用脚本:As usual, you may also call the script from the Hive bin/ directory prompt by issuing the command:

hive -f C:\temp\sample_hive_count_criteo_test_day_22_table_examples.hql

最后,检查基于第 _23 天的测试数据集中的测试示例数目。Finally, you examine the number of test examples in the test dataset based on day_23.

执行此操作的命令与显示的命令类似(请参阅 sample_hive_count_criteo_test_day_23_examples.hql):The command to do this is similar to the one shown (refer to sample_hive_count_criteo_test_day_23_examples.hql):

SELECT COUNT(*) FROM criteo.criteo_test_day_23;

会得到:This gives:

178274637
Time taken: 253.089 seconds, Fetched: 1 row(s)

定型数据集中的标签分布Label distribution in the train dataset

定型数据集中的标签分布是很有趣的。The label distribution in the train dataset is of interest. 为了看到这一点,我们显示 sample_hive_criteo_label_distribution_train_table.hql 的内容:To see this, show contents of sample_hive_criteo_label_distribution_train_table.hql:

SELECT Col1, COUNT(*) AS CT FROM criteo.criteo_train GROUP BY Col1;

这会产生标签分布:This yields the label distribution:

1       6292903
0       185922280
Time taken: 459.435 seconds, Fetched: 2 row(s)

正标签的百分比约为 3.3%(与原始数据集一致)。The percentage of positive labels is about 3.3% (consistent with the original dataset).

定型数据集中某些数值变量的直方图分布Histogram distributions of some numeric variables in the train dataset

可以使用 Hive 的本地“histogram_numeric”函数来找出数值变量的分布。You can use Hive's native "histogram_numeric" function to find out what the distribution of the numeric variables looks like. 以下是 sample_hive_criteo_histogram_numeric.hql 的内容:Here are the contents of sample_hive_criteo_histogram_numeric.hql:

SELECT CAST(hist.x as int) as bin_center, CAST(hist.y as bigint) as bin_height FROM
    (SELECT
    histogram_numeric(col2, 20) as col2_hist
    FROM
    criteo.criteo_train
    ) a
    LATERAL VIEW explode(col2_hist) exploded_table as hist;

这会生成以下内容:This yields the following:

26      155878415
2606    92753
6755    22086
11202   6922
14432   4163
17815   2488
21072   1901
24113   1283
27429   1225
30818   906
34512   723
38026   387
41007   290
43417   312
45797   571
49819   428
53505   328
56853   527
61004   160
65510   3446
Time taken: 317.851 seconds, Fetched: 20 row(s)

LATERAL VIEW - Hive 服务中的 explode 组合用于生成类似 SQL 的输出,而不是通常的列表。The LATERAL VIEW - explode combination in Hive serves to produce a SQL-like output instead of the usual list. 在该表中,第一列对应于量化中心,第二列对应于量化频率。In this table, the first column corresponds to the bin center and the second to the bin frequency.

定型数据集中一些数值变量的近似百分位数Approximate percentiles of some numeric variables in the train dataset

有关数值变量的另一个有趣之处是近似百分位数的计算。Also of interest with numeric variables is the computation of approximate percentiles. Hive 的本地“percentile _approx”能为我们完成这些操作。Hive's native "percentile_approx" does this for us. sample_hive_criteo_approximate_percentiles.hql 的内容为:The contents of sample_hive_criteo_approximate_percentiles.hql are:

SELECT MIN(Col2) AS Col2_min, PERCENTILE_APPROX(Col2, 0.1) AS Col2_01, PERCENTILE_APPROX(Col2, 0.3) AS Col2_03, PERCENTILE_APPROX(Col2, 0.5) AS Col2_median, PERCENTILE_APPROX(Col2, 0.8) AS Col2_08, MAX(Col2) AS Col2_max FROM criteo.criteo_train;

这会生成:This yields:

1.0     2.1418600917169246      2.1418600917169246    6.21887086390288 27.53454893115633       65535.0
Time taken: 564.953 seconds, Fetched: 1 row(s)

百分位数的分布与任何数值变量的直方图分布密切相关。The distribution of percentiles is closely related to the histogram distribution of any numeric variable usually.

查找定型数据集中某些分类列的唯一值的数量Find number of unique values for some categorical columns in the train dataset

继续数据浏览,我们发现某些分类列所采用的唯一值的数量。Continuing the data exploration, find, for some categorical columns, the number of unique values they take. 为此,显示 sample_hive_criteo_unique_values_categoricals.hql 的内容:To do this, show contents of sample_hive_criteo_unique_values_categoricals.hql:

SELECT COUNT(DISTINCT(Col15)) AS num_uniques FROM criteo.criteo_train;

这会生成:This yields:

19011825
Time taken: 448.116 seconds, Fetched: 1 row(s)

Col15 有 19M 个唯一值!Col15 has 19M unique values! 使用类似“one-hot encoding”的 naive 技术来编码这种高维分类变量是不可行的。Using naive techniques like "one-hot encoding" to encode such high-dimensional categorical variables is not feasible. 特别是,我们解释和演示了一中功能强大且可靠的技术,称为使用计数学习,以有效地解决这个问题。In particular, a powerful, robust technique called Learning With Counts for tackling this problem efficiently is explained and demonstrated.

最后,查看一些其他分类列的唯一值的数量来结束本小节。Finally look at the number of unique values for some other categorical columns as well. sample_hive_criteo_unique_values_multiple_categoricals.hql 的内容为:The contents of sample_hive_criteo_unique_values_multiple_categoricals.hql are:

SELECT COUNT(DISTINCT(Col16)), COUNT(DISTINCT(Col17)),
COUNT(DISTINCT(Col18), COUNT(DISTINCT(Col19), COUNT(DISTINCT(Col20))
FROM criteo.criteo_train;

这会生成:This yields:

30935   15200   7349    20067   3
Time taken: 1933.883 seconds, Fetched: 1 row(s)

我们再次看到,除了 Col20,所有其他列都有许多唯一的值。Again, note that except for Col20, all the other columns have many unique values.

定型数据集中分类变量对的共同计数Co-occurrence counts of pairs of categorical variables in the train dataset

分类变量对的计数分布也是相关的。The count distributions of pairs of categorical variables are also of interest. 可以使用 sample_hive_criteo_paired_categorical_counts.hql 中的代码来确定:This can be determined using the code in sample_hive_criteo_paired_categorical_counts.hql:

SELECT Col15, Col16, COUNT(*) AS paired_count FROM criteo.criteo_train GROUP BY Col15, Col16 ORDER BY paired_count DESC LIMIT 15;

按照它们的出现顺序逆序排列,在此情况下查看其中前 15 个。Reverse order the counts by their occurrence and look at the top 15 in this case. 会得到:This gives us:

ad98e872        cea68cd3        8964458
ad98e872        3dbb483e        8444762
ad98e872        43ced263        3082503
ad98e872        420acc05        2694489
ad98e872        ac4c5591        2559535
ad98e872        fb1e95da        2227216
ad98e872        8af1edc8        1794955
ad98e872        e56937ee        1643550
ad98e872        d1fade1c        1348719
ad98e872        977b4431        1115528
e5f3fd8d        a15d1051        959252
ad98e872        dd86c04a        872975
349b3fec        a52ef97d        821062
e5f3fd8d        a0aaffa6        792250
265366bf        6f5c7c41        782142
Time taken: 560.22 seconds, Fetched: 15 row(s)

对 Azure 机器学习的数据集进行下采样Down sample the datasets for Azure Machine Learning

浏览数据集并演示我们如何对任何变量(包括组合)进行这种类型的浏览,现在对数据集进行取样,以便可以在 Azure 机器学习中生成模型。Having explored the datasets and demonstrated how to do this type of exploration for any variables (including combinations), down sample the data sets so that models in Azure Machine Learning can be built. 回想一下,我们关注的问题是:给定一组示例属性(Col2 - Col40 的特征值),预测 Col1 是 0(未单击)还是 1(单击)。Recall that the focus of the problem is: given a set of example attributes (feature values from Col2 - Col40), predict if Col1 is a 0 (no click) or a 1 (click).

为了将训练和测试数据集降至原始大小的 1%,我们使用 Hive 的原生 RAND() 函数。To down sample the train and test datasets to 1% of the original size, use Hive's native RAND() function. 下一个脚本,sample_hive_criteo_downsample_train_dataset.hql 对定型数据集执行此操作:The next script, sample_hive_criteo_downsample_train_dataset.hql does this for the train dataset:

CREATE TABLE criteo.criteo_train_downsample_1perc (
col1 string,col2 double,col3 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 double,col10 double,col11 double,col12 double,col13 double,col14 double,col15 string,col16 string,col17 string,col18 string,col19 string,col20 string,col21 string,col22 string,col23 string,col24 string,col25 string,col26 string,col27 string,col28 string,col29 string,col30 string,col31 string,col32 string,col33 string,col34 string,col35 string,col36 string,col37 string,col38 string,col39 string,col40 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

---Now downsample and store in this table

INSERT OVERWRITE TABLE criteo.criteo_train_downsample_1perc SELECT * FROM criteo.criteo_train WHERE RAND() <= 0.01;

这会生成:This yields:

Time taken: 12.22 seconds
Time taken: 298.98 seconds

脚本 sample_hive_criteo_downsample_test_day_22_dataset.hql 用于测试数据,第_22 天:The script sample_hive_criteo_downsample_test_day_22_dataset.hql does it for test data, day_22:

--- Now for test data (day_22)

CREATE TABLE criteo.criteo_test_day_22_downsample_1perc (
col1 string,col2 double,col3 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 double,col10 double,col11 double,col12 double,col13 double,col14 double,col15 string,col16 string,col17 string,col18 string,col19 string,col20 string,col21 string,col22 string,col23 string,col24 string,col25 string,col26 string,col27 string,col28 string,col29 string,col30 string,col31 string,col32 string,col33 string,col34 string,col35 string,col36 string,col37 string,col38 string,col39 string,col40 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

INSERT OVERWRITE TABLE criteo.criteo_test_day_22_downsample_1perc SELECT * FROM criteo.criteo_test_day_22 WHERE RAND() <= 0.01;

这会生成:This yields:

Time taken: 1.22 seconds
Time taken: 317.66 seconds

最后,脚本 sample_hive_criteo_downsample_test_day_23_dataset.hql 用于测试数据,第_23 天:Finally, the script sample_hive_criteo_downsample_test_day_23_dataset.hql does it for test data, day_23:

--- Finally test data day_23
CREATE TABLE criteo.criteo_test_day_23_downsample_1perc (
col1 string,col2 double,col3 double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 double,col10 double,col11 double,col12 double,col13 double,col14 double,col15 string,col16 string,col17 string,col18 string,col19 string,col20 string,col21 string,col22 string,col23 string,col24 string,col25 string,col26 string,col27 string,col28 string,col29 string,col30 string,col31 string,col32 string,col33 string,col34 string,col35 string,col36 string,col37 string,col38 string,col39 string,col40 srical feature; tring)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

INSERT OVERWRITE TABLE criteo.criteo_test_day_23_downsample_1perc SELECT * FROM criteo.criteo_test_day_23 WHERE RAND() <= 0.01;

这会生成:This yields:

Time taken: 1.86 seconds
Time taken: 300.02 seconds

有此结果,我们准备使用取样缩小训练和测试数据集在 Azure 机器学习中构建模型。With this, you are ready to use our down sampled train and test datasets for building models in Azure Machine Learning.

在继续进行 Azure 机器学习之前,还有最后一个重要的组成部分,该部分涉及计数表。There is a final important component before moving on to Azure Machine Learning, which concerns the count table. 下一小节将更详细地讨论计数表。In the next sub-section, the count table is discussed in some detail.

关于计数表的简要讨论A brief discussion on the count table

正如我们所看到的,有几个分类变量具有高的维度。As you saw, several categorical variables have a high dimensionality. 在演练中,我们提出了一种名为使用计数学习的有效的技术,以高效、可靠的方式对这些变量进行编码。In the walkthrough, a powerful technique called Learning With Counts to encode these variables in an efficient, robust manner is presented. 有关此技术的详细信息,请参阅提供的链接。More information on this technique is in the link provided.

备注

本演练重点讲解如何使用计数表来生成高维分类特征的紧凑表示形式。In this walkthrough, the focus is on using count tables to produce compact representations of high-dimensional categorical features. 这不是编码分类特征的唯一方法;有关其他技术的详细信息,感兴趣的用户可以查看 one-hot 编码特性哈希This is not the only way to encode categorical features; for more information on other techniques, interested users can check out one-hot-encoding and feature hashing.

若要在计数数据上构建计数表,需要使用文件夹 raw/count 中的数据。To build count tables on the count data, use the data in the folder raw/count. 在建模部分中,我们向用户展示了如何从零开始构建这些计数表,或者使用预建的计数表来进行浏览。In the modeling section, users are shown how to build these count tables for categorical features from scratch, or alternatively to use a pre-built count table for their explorations. 在下文中,当我们提到“预建计数表”时,指的是使用提供的计数表。In what follows, when "pre-built count tables" are referred to, we mean using the count tables that have been provided. 下一节提供了有关如何访问这些表的详细说明。Detailed instructions on how to access these tables are provided in the next section.

使用 Azure 机器学习构建模型Build a model with Azure Machine Learning

我们在 Azure 机器学习中的模型构建过程遵循以下步骤:Our model building process in Azure Machine Learning follows these steps:

  1. 将数据从 Hive 表中获取到 Azure 机器学习Get the data from Hive tables into Azure Machine Learning
  2. 创建试验:清理数据并使用计数表使其成为特征Create the experiment: clean the data and make it a feature with count tables
  3. 生成、训练和评分模型Build, train, and score the model
  4. 评估模型Evaluate the model
  5. 将模型发布为 Web 服务Publish the model as a web-service

现已准备好在 Azure 机器学习工作室中构建模型。Now you are ready to build models in Azure Machine Learning studio. 我们的取样缩小数据将作为 Hive 表保存在群集中。Our down sampled data is saved as Hive tables in the cluster. 使用 Azure 机器学习的“导入数据”模块来读取此数据。Use the Azure Machine Learning Import Data module to read this data. 在下方提供访问此群集的存储帐户的凭据。The credentials to access the storage account of this cluster are provided in what follows.

步骤 1:使用导入数据模块将数据从 Hive 表中导入到 Azure 机器学习中,并选择它用于机器学习实验Step 1: Get data from Hive tables into Azure Machine Learning using the Import Data module and select it for a machine learning experiment

首先,选择“+新建” -> “实验” -> 空白实验。Start by selecting a +NEW -> EXPERIMENT -> Blank Experiment. 然后,从左上角的“搜索”框中搜索“导入数据”。Then, from the Search box on the top left, search for "Import Data". 将“导入数据”模块拖放到实验画布(屏幕中间部分),以使用模块进行数据访问。Drag and drop the Import Data module on to the experiment canvas (the middle portion of the screen) to use the module for data access.

这是从 Hive 表获取数据时“导入数据”的样子:This is what the Import Data looks like while getting data from the Hive table:

导入数据获取数据

对于“导入数据”模块,图形中提供的参数值仅是需要提供的值类型的示例。For the Import Data module, the values of the parameters that are provided in the graphic are just examples of the sort of values you need to provide. 以下是有关如何填写“导入数据”模块的参数集的常规指导。Here is some general guidance on how to fill out the parameter set for the Import Data module.

  1. 为“数据源”选择“Hive 查询”Choose "Hive query" for Data Source
  2. 在“Hive 数据库查询”框中,一个简单的SELECT * FROM <_数据集_名称._表_名称> - 就足够了。In the Hive database query box, a simple SELECT * FROM <your_database_name.your_table_name> - is enough.
  3. Hcatalog 服务器 URI:如果群集是“abc”,则其为:https://abc.azurehdinsight.cnHcatalog server URI: If your cluster is "abc", then this is simply: https://abc.azurehdinsight.cn
  4. Hadoop 用户帐户名称:调试群集时选择的用户名。Hadoop user account name: The user name chosen at the time of commissioning the cluster. (不是远程访问用户名!)(NOT the Remote Access user name!)
  5. Hadoop 用户帐户密码:调试群集时选择的用户名的密码。Hadoop user account password: The password for the user name chosen at the time of commissioning the cluster. (不是远程访问密码!)(NOT the Remote Access password!)
  6. 输出数据的位置:选择“Azure”Location of output data: Choose "Azure"
  7. Azure 存储帐户名称:与群集关联的存储帐户Azure storage account name: The storage account associated with the cluster
  8. Azure 存储帐户密钥:与群集关联的存储帐户的密钥。Azure storage account key: The key of the storage account associated with the cluster.
  9. Azure 容器名称:如果群集名称为“abc”,则通常只是简单的“abc”。Azure container name: If the cluster name is "abc", then this is simply "abc", usually.

“导入数据”完成获取数据后(会在模块上看到绿色对勾),(使用选择的名称)将此数据另存为数据集。Once the Import Data finishes getting data (you see the green tick on the Module), save this data as a Dataset (with a name of your choice). 将显示为:What this looks like:

导入数据保存数据

右键单击“导入数据”模块的输出端口。Right-click the output port of the Import Data module. 这会显示“另存为数据集”选项和“可视化”选项。This reveals a Save as dataset option and a Visualize option. 如果单击“可视化”选项,会显示 100 行数据,以及对某些摘要统计信息有用的右侧面板。The Visualize option, if clicked, displays 100 rows of the data, along with a right panel that is useful for some summary statistics. 要保存数据,只需选择“另存为数据集”,并按照说明操作即可。To save data, simply select Save as dataset and follow instructions.

若要选择用于机器学习实验的保存数据集,请使用下图中显示的“搜索”框来定位数据集。To select the saved dataset for use in a machine learning experiment, locate the datasets using the Search box shown in the following figure. 然后只需输入给定数据集的部分名称即可访问该数据集,并将其拖动到主面板上。Then simply type out the name you gave the dataset partially to access it and drag the dataset onto the main panel. 将其放在主面板上,并选择用于机器学习建模。Dropping it onto the main panel selects it for use in machine learning modeling.

将数据集拖放到主面板上

备注

对定型和测试数据集都执行此操作。Do this for both the train and the test datasets. 此外,请记住使用为此目的提供的数据库名称和表名称。Also, remember to use the database name and table names that you gave for this purpose. 图中使用的值仅用于说明目的。The values used in the figure are solely for illustration purposes.

步骤 2:在 Azure 机器学习中创建一个试验,以预测单击/无单击Step 2: Create an experiment in Azure Machine Learning to predict clicks / no clicks

我们的 Azure 机器学习工作室(经典)试验如下所示:Our Azure Machine Learning Studio (classic) experiment looks like this:

机器学习实验

现在,检查一下此试验的关键组件。Now examine the key components of this experiment. 将保存的训练和测试数据集拖到试验画布上。Drag our saved train and test datasets on to our experiment canvas first.

清理缺失数据Clean Missing Data

“清除缺失数据”模块执行其名称所表示的功能:按用户指定的方式清除缺失的数据。The Clean Missing Data module does what its name suggests: it cleans missing data in ways that can be user-specified. 看一下此模块,我们看到:Look into this module to see this:

清理缺失数据

在此处选择将所有缺失值替换为 0。Here, choose to replace all missing values with a 0. 还有其他选项,可以通过查看模块中的下拉列表看到这些选项。There are other options as well, which can be seen by looking at the dropdowns in the module.

对数据进行特性工程Feature engineering on the data

对于大型数据集的某些分类特征,可能有数百万个唯一值。There can be millions of unique values for some categorical features of large datasets. 使用诸如 one-hot 编码之类的 naive 方法来表示这种高维分类特征是完全不可行的。Using naive methods such as one-hot encoding for representing such high-dimensional categorical features is entirely unfeasible. 本演练演示了如何使用计数功能使用内置的 Azure 机器学习模块来生成这些高维分类变量的紧凑表示形式。This walkthrough demonstrates how to use count features using built-in Azure Machine Learning modules to generate compact representations of these high-dimensional categorical variables. 最终结果是更小的模型大小、更快的定型时间和与使用其他技术类似的性能指标。The end-result is a smaller model size, faster training times, and performance metrics that are comparable to using other techniques.

构建计数转换Building counting transforms

若要构建计数功能,可使用 Azure 机器学习中提供的“构建计数转换”模块。To build count features, use the Build Counting Transform module that is available in Azure Machine Learning. 模块如下所示:The module looks like this:

“生成计数转换”模块属性 “生成计数转换”模块Build Counting Transform module properties Build Counting Transform module

重要

在“计数列”框中,输入要执行计数的列。In the Count columns box, enter those columns that you wish to perform counts on. 通常,要输入是(正如所提到的)高维分类列。Typically, these are (as mentioned) high-dimensional categorical columns. 请记住,Criteo 数据集有 26 个分类列:从 Col15 到 Col40。Remember that the Criteo dataset has 26 categorical columns: from Col15 to Col40. 此处对所有分类列进行计数,并给出其指数(从 15 到 40,用逗号分隔,如图所示)。Here, count on all of them and give their indices (from 15 to 40 separated by commas as shown).

若要在 MapReduce 模式下使用模块(适用于大型数据集),则需要访问 HDInsight Hadoop 群集(用于功能浏览的群集也可以重复使用于此目的)及其凭据。To use the module in the MapReduce mode (appropriate for large datasets), you need access to an HDInsight Hadoop cluster (the one used for feature exploration can be reused for this purpose as well) and its credentials. 前面的图说明了填充值的样式(将提供的值替换为与自己的用例相关的值)。The previous figures illustrate what the filled-in values look like (replace the values provided for illustration with those relevant for your own use-case).

模块参数

上图展示了如何输入 blob 位置。The preceding figure shows how to enter the input blob location. 此位置含有为构建计数表保留的数据。This location has the data reserved for building count tables on.

此模块完成运行后,可以右键单击模块并选择“另存为转换”选项保存转换以便稍后使用:When this module finishes running, save the transform for later by right-clicking the module and selecting the Save as Transform option:

“另存为转换”选项

在上面显示的实验体系结构中,数据集“ytransform2”正好与保存的计数转换相对应。In our experiment architecture shown above, the dataset "ytransform2" corresponds precisely to a saved count transform. 对于本试验的其余部分,假设读者对某些数据使用“构建计数转换”模块来生成计数,并可以使用这些计数来在训练和测试数据集上生成计数功能。For the remainder of this experiment, it is assumed that the reader used a Build Counting Transform module on some data to generate counts, and can then use those counts to generate count features on the train and test datasets.

选择要将哪些计数功能作为定型和测试数据集的一部分Choosing what count features to include as part of the train and test datasets

计数转换准备就绪后,用户可以使用“修改计数表参数”模块选择要包括在其训练和测试数据集中的功能。Once a count transform ready, the user can choose what features to include in their train and test datasets using the Modify Count Table Parameters module. 为了完整,此处展示了该模块。For completeness, this module is shown here. 但为了简单起见,在试验中并没有使用它。But in interests of simplicity do not actually use it in our experiment.

修改计数表参数

在此情况下,可以看出,已使用对数几率并忽略了退避列。In this case, as can be seen, the log-odds are to be used and the back off column is ignored. 还可以设置参数,例如垃圾桶阈值、要添加的用于平滑处理的伪先验示例数以及是否使用 Laplacian 噪声。You can also set parameters such as the garbage bin threshold, how many pseudo-prior examples to add for smoothing, and whether to use any Laplacian noise or not. 所有这些参数都是高级功能,需要注意的是,对于还不熟悉此类功能生成的用户而言,使用默认值会是一个很好的选择。All these are advanced features and it is to be noted that the default values are a good starting point for users who are new to this type of feature generation.

生成计数功能前的数据转换Data transformation before generating the count features

现在关注的重点是在实际生成计数功能之前转换训练和测试数据。Now the focus is on an important point about transforming our train and test data prior to actually generating count features. 在对数据应用计数转换之前,要使用两个“执行 R 脚本”模块。There are two Execute R Script modules used before the count transform is applied to our data.

“执行 R 脚本”模块

下面是第一个 R 脚本:Here is the first R script:

第一个 R 脚本

此 R 脚本将列重命名为“Col1”到“Col40”。This R script renames our columns to names "Col1" to "Col40". 这是因为计数转换需要此格式的名称。This is because the count transform expects names of this format.

第二个 R 脚本通过对负类进行缩小取样来平衡正类和负类(分别是类 1和 0)之间的分布。The second R script balances the distribution between positive and negative classes (classes 1 and 0 respectively) by down-sampling the negative class. R 脚本会在此处演示如何执行此操作:The R script here shows how to do this:

第二个 R 脚本

在这一简单的 R 脚本中,使用了“pos_neg_ratio”设置正类和负类之间的平衡量。In this simple R script, the "pos_neg_ratio" is used to set the amount of balance between the positive and the negative classes. 这是十分重要的,因为改进类不平衡通常对类分布存在偏差的分类问题的性能有很大帮助(回想一下,在本例中,有 3.3% 的正类和 96.7% 的负类)。This is important to do since improving class imbalance usually has performance benefits for classification problems where the class distribution is skewed (recall that in this case, you have 3.3% positive class and 96.7% negative class).

对我们的数据应用计数转换Applying the count transformation on our data

最后,可以使用“应用转换”模块将计数转换应用于训练和测试数据集。Finally, you can use the Apply Transformation module to apply the count transforms on our train and test datasets. 此模块将保存的计数转换作为一个输入,将定型或测试数据集作为另一个输入,并返回具有计数功能的数据。This module takes the saved count transform as one input and the train or test datasets as the other input, and returns data with count features. 如下所示:It is shown here:

“应用转换”模块

计数功能样式概要An excerpt of what the count features look like

在本示例中看一下计数功能的样式是有所帮助的。It is instructive to see what the count features look like in our case. 下面是该样式的概要:Here is an excerpt of this:

计数功能

在此概要中,展示了之前进行计数的列,除了任何相关的回退外,还得到了计数和对数几率。This excerpt shows that for the columns counted on, you get the counts and log odds in addition to any relevant backoffs.

现在,准备使用这些转换的数据集构建 Azure 机器学习模型。You are now ready to build an Azure Machine Learning model using these transformed datasets. 下一部分将演示如何执行此操作。In the next section shows how this can be done.

步骤 3:生成、训练模型并对其进行评分Step 3: Build, train, and score the model

选择学习者Choice of learner

首先需要选择一个学习者。First, you need to choose a learner. 使用双类提升的决策树来作为学习者。Use a two-class boosted decision tree as our learner. 以下是此学习者的默认选项:Here are the default options for this learner:

双类提升决策树参数

对于试验,请选择默认值。For the experiment, choose the default values. 默认值是有意义的,并且是获得性能的快速基线的有效方法。The defaults are meaningful and a good way to get quick baselines on performance. 如果选择一旦有基线,则可以通过扫描参数来提高性能。You can improve on performance by sweeping parameters if you choose to once you have a baseline.

定型模型Train the model

对于训练,只需调用“训练模型”模块。For training, simply invoke a Train Model module. 它的两个输入是二类提升的决策树学习者和我们的定型数据集。The two inputs to it are the Two-Class Boosted Decision Tree learner and our train dataset. 如下所示:This is shown here:

“定型模型”模块

为模型评分Score the model

获取训练的模型后,可对测试数据集进行评分,并评估其性能。Once you have a trained model, you are ready to score on the test dataset and to evaluate its performance. 可以使用下图中显示的“评分模型”模块以及“评估模型”模块来完成此操作:Do this by using the Score Model module shown in the following figure, along with an Evaluate Model module:

“评分模型”模块

步骤 4:评估模型Step 4: Evaluate the model

最后,应分析模型性能。Finally, you should analyze model performance. 通常,对于两类(二进制)分类问题,一种有效地度量值为 AUC。Usually, for two class (binary) classification problems, a good measure is the AUC. 为了将此曲线可视化,请将“评分模型”模块连接到“评估模型”模块 。To visualize this curve, connect the Score Model module to an Evaluate Model module. 在“评估模型”模块上单击“可视化”将生成如下图所示的图形:Clicking Visualize on the Evaluate Model module yields a graphic like the following one:

评估模块 BDT 模型

在二进制(或二类)分类问题中,预测准确性的有效度量值是曲线下面积 (AUC)。In binary (or two class) classification problems, a good measure of prediction accuracy is the Area Under Curve (AUC). 下一部分使用此模型在测试数据集上显示结果。The following section shows our results using this model on our test dataset. 右键单击“评估模型”模块的输出端口,然后单击“可视化” 。Right-click the output port of the Evaluate Model module and then Visualize.

可视化评估模型模块

步骤 5:将模型发布为 Web 服务Step 5: Publish the model as a Web service

将 Azure 机器学习模型以最小误差发布为 Web 服务的能力是使其广泛可用的一个有价值的功能。The ability to publish an Azure Machine Learning model as web services with a minimum of fuss is a valuable feature for making it widely available. 完成后,任何人都可以使用其需要预测的输入数据调用 Web 服务,并且 Web 服务使用模型返回这些预测。Once that is done, anyone can make calls to the web service with input data that they need predictions for, and the web service uses the model to return those predictions.

首先,通过右键单击“训练模型”模块并使用“另存为训练模型”选项,将训练模型另存为训练模型对象 。First, save our trained model as a Trained Model object by right-clicking the Train Model module and using the Save as Trained Model option.

接下来,为 Web 服务创建输入和输出端口:Next, create input and output ports for our web service:

  • 输入端口接收与所需要预测的数据相同的形式的数据an input port takes data in the same form as the data that you need predictions for
  • 输出端口返回评分标签和关联概率。an output port returns the Scored Labels and the associated probabilities.

为输入端口选择少量几行数据Select a few rows of data for the input port

使用“应用 SQL 转换”模块可以方便地选择仅 10 行作为输入端口数据。It is convenient to use an Apply SQL Transformation module to select just 10 rows to serve as the input port data. 使用此处显示的 SQL 查询为输入端口选择仅这几行数据:Select just these rows of data for our input port using the SQL query shown here:

输入端口数据

Web 服务Web service

现在可以运行一个小试验,用于发布我们的 Web 服务。Now you are ready to run a small experiment that can be used to publish our web service.

为 webservice 生成输入数据Generate input data for webservice

首先,由于计数表很大,因此我们采用几行测试数据,并从中生成带有计数功能的输出数据。As a zeroth step, since the count table is large, take a few lines of test data and generate output data from it with count features. 此输出可以作为 webservice 的输入数据格式,如下所示:This output can serve as the input data format for our webservice, as shown here:

创建 BDT 输入数据

备注

对于输入数据格式,请使用 Count Featurizer 模块的 OUTPUT。For the input data format, use the OUTPUT of the Count Featurizer module. 此实验完成运行后,将 Count Featurizer 模块的输出另存为数据集。Once this experiment finishes running, save the output from the Count Featurizer module as a Dataset. 此数据集用于 webservice 中的输入数据。This Dataset is used for the input data in the webservice.

发布 webservice 的评分实验Scoring experiment for publishing webservice

首先,基本结构为“评分模型”模块,该模块接受训练模型对象和在前面步骤中使用“Count Featurizer”模块生成的几行输入数据 。First, the essential structure is a Score Model module that accepts our trained model object and a few lines of input data that were generated in the previous steps using the Count Featurizer module. 使用“在数据集中选择列”来投影出评分标签和评分概率。Use "Select Columns in Dataset" to project out the Scored labels and the Score probabilities.

在数据集中选择列

请注意如何将“在数据集中选择列”模块用于从数据集中“过滤”数据。Notice how the Select Columns in Dataset module can be used for 'filtering' data out from a dataset. 将显示以下内容:The contents are shown here:

使用“在数据集中选择列”进行过滤

若要获取蓝色输入和输出端口,只需单击右下角的“准备 webservice”。To get the blue input and output ports, you simply click prepare webservice at the bottom right. 运行此实验会允许发布 Web 服务:单击右下角的“发布 Web 服务”图标,如下所示:Running this experiment also allows us to publish the web service: click the PUBLISH WEB SERVICE icon at the bottom right, shown here:

发布 Web 服务

Web 服务发布后,会重定向到一个如下所示的页面:Once the webservice is published, get redirected to a page that looks thus:

Web 服务仪表板

请注意在左侧 Web 服务的两个链接:Notice the two links for webservices on the left side:

  • “请求/响应”服务(或 RRS)适用于单个预测,本次研讨会中已使用该服务。The REQUEST/RESPONSE Service (or RRS) is meant for single predictions and is what has been utilized in this workshop.
  • “BATCH 执行”服务 (BES) 适用于 batch 预测,该服务要求用于进行预测的输入数据驻留在 Azure Blob 存储中。The BATCH EXECUTION Service (BES) is used for batch predictions and requires that the input data used to make predictions reside in Azure Blob Storage.

单击“请求/响应”链接,进入一个页面,该页面提供了 C#、python 和 R 中的固有代码。此代码轻松地可用于调用 Web 服务。Clicking on the link REQUEST/RESPONSE takes us to a page that gives us pre-canned code in C#, python, and R. This code can be conveniently used for making calls to the webservice. 需要使用此页面上的 API 密钥进行身份验证。The API key on this page needs to be used for authentication.

可以方便地将此 python 代码复制到 IPython Notebook 中的一个新单元格。It is convenient to copy this python code over to a new cell in the IPython notebook.

下面是一段带有正确 API 密钥的 python 代码。Here is a segment of python code with the correct API key.

Python 代码

默认 API 密钥已替换为 webservice 的 API 密钥。The default API key has been replaced with our webservice's API key. 在 IPython Notebook 中此单元格上单击“运行”将生成以下响应:Clicking Run on this cell in an IPython notebook yields the following response:

IPython 响应

对于在 Python 脚本 JSON 框架中询问的两个测试示例,会以“评分标签,评分概率”的形式得到返回答案。For the two test examples asked about in the Python script JSON framework, you get back answers in the form "Scored Labels, Scored Probabilities". 在本例中,已选择固有代码提供的默认值(所有数值列为 0,所有分类列为字符串“value”)。In this case, the default values have been chosen that the pre-canned code provides (0's for all numeric columns and the string "value" for all categorical columns).

总而言之,本演练现展示如何使用 Azure 机器学习处理大规模数据集。In conclusion, our walkthrough shows how to handle large-scale dataset using Azure Machine Learning. 我们从 1 TB 的数据开始,构建了一个预测模型,并在云中将其部署为 Web 服务。You started with a terabyte of data, constructed a prediction model, and deployed it as a web service in the cloud.