运行中的 Team Data Science Process:使用 Azure HDInsight Hadoop 群集The Team Data Science Process in action: Use Azure HDInsight Hadoop clusters

本演练在一个端到端方案中使用 Team Data Science Process (TDSP)In this walkthrough, we use the Team Data Science Process (TDSP) in an end-to-end scenario. 其中使用 Azure HDInsight Hadoop 群集对公开发布的纽约市出租车行程数据集中的数据进行存储、探索和实施特性工程,以及对该数据进行下采样。We use an Azure HDInsight Hadoop cluster to store, explore, and feature-engineer data from the publicly available NYC Taxi Trips dataset, and to down-sample the data. 为了处理二元分类、多类分类和回归预测任务,我们将使用 Azure 机器学习构建数据模型。To handle binary and multiclass classification and regression predictive tasks, we build models of the data with Azure Machine Learning.

有关介绍如何处理大型数据集的演练,请参阅 Team Data Science Process - 使用 Azure HDInsight Hadoop 群集处理 1 TB 数据集For a walkthrough that shows how to handle a larger dataset, see Team Data Science Process - Using Azure HDInsight Hadoop Clusters on a 1 TB dataset.

也可以通过 IPython Notebook 来完成使用 1 TB 数据集的演练中介绍的任务。You can also use an IPython notebook to accomplish the tasks presented in the walkthrough that uses the 1 TB dataset. 有关详细信息,请参阅使用 Hive ODBC 连接的 Criteo 演练For more information, see Criteo walkthrough using a Hive ODBC connection.

NYC 出租车行程数据集介绍NYC Taxi Trips dataset description

NYC 出租车行程数据是大约 20 GB(未压缩时约为 48 GB)的压缩逗号分隔值 (CSV) 文件。The NYC Taxi Trip data is about 20 GB of compressed comma-separated values (CSV) files (~48 GB uncompressed). 其中包含超过 1.73 亿个单独行程及每个行程支付的费用。It has more than 173 million individual trips, and includes the fares paid for each trip. 每个行程记录会包括上车和下车的位置和时间、匿名的出租车司机驾驶证编号和牌照编号(出租车的唯一 ID)。Each trip record includes the pick-up and dropoff location and time, anonymized hack (driver's) license number, and medallion number (the taxi’s unique ID). 数据涵盖 2013 年的所有行程,并在每个月的以下两个数据集中提供:The data covers all trips in the year 2013, and is provided in the following two datasets for each month:

  • trip_data CSV 文件包含行程的详细信息:乘客数、上车和下车地点、行程持续时间和行程距离。The trip_data CSV files contain trip details: the number of passengers, pick up and dropoff points, trip duration, and trip length. 下面是一些示例记录:Here are a few sample records:

    medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude

    89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171

    0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066

    0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002

    DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388

    DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868

  • trip_data CSV 文件包含每个行程支付费用的详细信息:付款类型、费用金额、附加费和税金、小费和通行费以及支付的总金额。The trip_fare CSV files contain details of the fare paid for each trip: payment type, fare amount, surcharge and taxes, tips and tolls, and the total amount paid. 下面是一些示例记录:Here are a few sample records:

    medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount

    89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01 15:11:48,CSH,6.5,0,0.5,0,0,7

    0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06 00:18:35,CSH,6,0.5,0.5,0,0,7

    0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05 18:49:41,CSH,5.5,1,0.5,0,0,7

    DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:54:15,CSH,5,0.5,0.5,0,0,6

    DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:25:03,CSH,9.5,0.5,0.5,0,0,10.5

联接 trip_data 和 trip_fare 的唯一键由以下字段组成:medallion、hack_license 和 pickup_datetime。The unique key to join trip_data and trip_fare is composed of the fields: medallion, hack_license, and pickup_datetime. 若要获取与特定行程相关的所有详细信息,只需联接这三个键即可。To get all of the details relevant to a particular trip, it is sufficient to join with these three keys.

预测任务示例Examples of prediction tasks

根据数据分析确定要进行的预测类型,以帮助明确所需的进程任务。Determine the kind of predictions you want to make based on data analysis to help clarify the required process tasks. 下面是本演练中要解决的三个预测问题示例,它们全部基于 tip_amountHere are three examples of prediction problems that we address in this walkthrough, all based on the tip_amount :

  • 二元分类 :预测某个行程是否支付小费。Binary classification : Predict whether or not a tip was paid for a trip. 即大于 $0 的 tip_amount 是正例,等于 $0 的 tip_amount 是反例。That is, a tip_amount that is greater than $0 is a positive example, while a tip_amount of $0 is a negative example.

    • 级别 0:tip_amount = $0Class 0: tip_amount = $0
    • 级别 1:tip_amount > $0Class 1: tip_amount > $0
  • 多类分类 :预测为行程支付的小费金额范围。Multiclass classification : Predict the range of tip amounts paid for the trip. 我们将 tip_amount 划分成五个类:We divide the tip_amount into five classes:

    • 级别 0:tip_amount = $0Class 0: tip_amount = $0
    • 级别 1:tip_amount > $0 且 tip_amount <= $5Class 1: tip_amount > $0 and tip_amount <= $5
    • 级别 2:tip_amount > $5 且 tip_amount <= $10Class 2: tip_amount > $5 and tip_amount <= $10
    • 级别 3:tip_amount > $10 且 tip_amount <= $20Class 3: tip_amount > $10 and tip_amount <= $20
    • 级别 4:tip_amount > $20Class 4: tip_amount > $20
  • 回归任务 :预测为行程支付的小费金额。Regression task : Predict the amount of the tip paid for a trip.

针对高级分析设置 HDInsight Hadoop 群集Set up an HDInsight Hadoop cluster for advanced analytics

备注

这通常是管理任务。This is typically an admin task.

可以通过三个步骤为使用 HDInsight 群集的高级分析设置 Azure 环境:You can set up an Azure environment for advanced analytics that employs an HDInsight cluster in three steps:

  1. 创建存储帐户:此存储帐户用于在 Azure Blob 存储中存储数据。Create a storage account: This storage account is used for storing data in Azure Blob storage. HDInsight 群集中使用的数据也驻留在此处。The data used in HDInsight clusters also resides here.

  2. 为高级分析过程和技术自定义 Azure HDInsight Hadoop 群集Customize Azure HDInsight Hadoop clusters for the Advanced Analytics Process and Technology. 此步骤将创建一个在所有节点上都安装有 64 位 Anaconda Python 2.7 的 HDInsight Hadoop 群集。This step creates an HDInsight Hadoop cluster with 64-bit Anaconda Python 2.7 installed on all nodes. 自定义 HDInsight 群集时需牢记两个重要步骤。There are two important steps to remember while customizing your HDInsight cluster.

    • 创建 HDInsight 群集时,请记住将其与步骤 1 中创建的存储帐户相链接。Remember to link the storage account created in step 1 with your HDInsight cluster when you are creating it. 此存储帐户访问在该群集中处理的数据。This storage account accesses data that is processed within the cluster.
    • 创建群集后,启用对其头节点的远程访问。After you create the cluster, enable Remote Access to the head node of the cluster. 浏览到“配置” 选项卡,并选择“启用远程” 。Browse to the Configuration tab, and select Enable Remote . 此步骤指定用于远程登录的用户凭据。This step specifies the user credentials used for remote login.
  3. 创建 Azure 机器学习工作区:此工作区用于生成机器学习模型。Create an Azure Machine Learning workspace: You use this workspace to build machine learning models. 使用 HDInsight 群集完成初始数据探索并进行下采样后,此任务将得到解决。This task is addressed after completing an initial data exploration and down-sampling, by using the HDInsight cluster.

从公共源获取数据Get the data from a public source

备注

这通常是管理任务。This is typically an admin task.

若要将 NYC 出租车行程数据集从其公共位置复制,可以使用将数据从 Azure Blob 存储移入和移出中所述的任意方法。To copy the NYC Taxi Trips dataset to your machine from its public location, use any of the methods described in Move data to and from Azure Blob storage.

此处介绍如何使用 AzCopy 传输包含数据的文件。Here, we describe how to use AzCopy to transfer the files containing data. 若要下载并安装 AzCopy,请按照 AzCopy 命令行实用工具入门中的说明进行操作。To download and install AzCopy, follow the instructions at Getting started with the AzCopy command-line utility.

  1. 在命令提示符窗口中,运行以下 AzCopy 命令,请将 <path_to_data_folder> 替换为所需目标 :From a command prompt window, run the following AzCopy commands, replacing <path_to_data_folder> with the desired destination:

    "C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\azcopy" /Source:https://nyctaxitrips.blob.core.chinacloudapi.cn/data /Dest:<path_to_data_folder> /S
    
  2. 复制完成后,所选数据文件夹中总共会出现 24 个压缩文件。When the copy completes, you will see a total of 24 zipped files in the data folder chosen. 将下载的文件解压缩到本地计算机上的同一目录。Unzip the downloaded files to the same directory on your local machine. 记下未压缩的文件所在的文件夹。Make a note of the folder where the uncompressed files reside. 此文件夹在下文中称为 <path_to_unzipped_data_files> 。This folder is referred to as the <path_to_unzipped_data_files> in what follows.

将数据上传到 HDInsight Hadoop 群集的默认容器Upload the data to the default container of the HDInsight Hadoop cluster

备注

这通常是管理任务。This is typically an admin task.

在以下 AzCopy 命令中,将以下参数替换为创建 Hadoop 群集和解压缩数据文件时所指定的实际值。In the following AzCopy commands, replace the following parameters with the actual values that you specified when creating the Hadoop cluster and unzipping the data files.

  • <path_to_data_folder> 计算机上包含解压缩数据文件的目录(及路径)。<path_to_data_folder> The directory (along with the path) on your machine that contains the unzipped data files.
  • <storage account name of Hadoop cluster> 与 HDInsight 群集关联的存储帐户。<storage account name of Hadoop cluster> The storage account associated with your HDInsight cluster.
  • <default container of Hadoop cluster> 群集使用的默认容器。<default container of Hadoop cluster> The default container used by your cluster. 默认容器的名称通常与群集本身的名称相同。The name of the default container is usually the same name as the cluster itself. 例如,如果群集名为“abc123.azurehdinsight.net”,则默认容器为 abc123。For example, if the cluster is called "abc123.azurehdinsight.net", the default container is abc123.
  • <storage account key> 群集使用的存储帐户密钥。<storage account key> The key for the storage account used by your cluster.

在命令提示符或 Windows PowerShell 窗口中,运行以下两个 AzCopy 命令。From a command prompt or a Windows PowerShell window, run the following two AzCopy commands.

此命令将行程数据上传到 Hadoop 群集的默认容器中的 nyctaxitripraw 目录。This command uploads the trip data to the nyctaxitripraw directory in the default container of the Hadoop cluster.

"C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\azcopy" /Source:<path_to_unzipped_data_files> /Dest:https://<storage account name of Hadoop cluster>.blob.core.chinacloudapi.cn/<default container of Hadoop cluster>/nyctaxitripraw /DestKey:<storage account key> /S /Pattern:trip_data_*.csv

此命令将费用数据上传到 Hadoop 群集的默认容器中的 nyctaxifareraw 目录。This command uploads the fare data to the nyctaxifareraw directory in the default container of the Hadoop cluster.

"C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\azcopy" /Source:<path_to_unzipped_data_files> /Dest:https://<storage account name of Hadoop cluster>.blob.core.chinacloudapi.cn/<default container of Hadoop cluster>/nyctaxifareraw /DestKey:<storage account key> /S /Pattern:trip_fare_*.csv

现在,数据应在 Blob 存储中,并且可以在 HDInsight 群集中使用。The data should now be in Blob storage, and ready to be consumed within the HDInsight cluster.

登录到 Hadoop 群集的头节点,并为探索数据分析做好准备Sign in to the head node of Hadoop cluster and prepare for exploratory data analysis

备注

这通常是管理任务。This is typically an admin task.

若要访问群集的头节点以进行探索数据分析和数据的下采样,请按照访问 Hadoop 群集的头节点中所述的过程进行操作。To access the head node of the cluster for exploratory data analysis and down-sampling of the data, follow the procedure outlined in Access the head node of Hadoop Cluster.

在本演练中,我们主要使用 Hive(一种类似 SQL 的查询语言)编写的查询来执行初步数据探索。In this walkthrough, we primarily use queries written in Hive, a SQL-like query language, to perform preliminary data explorations. Hive 查询存储在“.hql”文件中。The Hive queries are stored in '.hql' files. 然后,对此数据进行下采样,以便用于在机器学习中构建模型。We then down-sample this data to be used within Machine Learning for building models.

若要准备用于探索数据分析的群集,需将包含相关 Hive 脚本的“.hql”文件从 GitHub 下载到头节点上的本地目录 (C:\temp)。To prepare the cluster for exploratory data analysis, download the '.hql' files containing the relevant Hive scripts from GitHub to a local directory (C:\temp) on the head node. 从群集的头节点中打开命令提示符,并运行以下两个命令:Open the command prompt from within the head node of the cluster, and run the following two commands:

set script='https://raw.githubusercontent.com/Azure/Azure-MachineLearning-DataScience/master/Misc/DataScienceProcess/DataScienceScripts/Download_DataScience_Scripts.ps1'

@powershell -NoProfile -ExecutionPolicy unrestricted -Command "iex ((new-object net.webclient).DownloadString(%script%))"

这两个命令会将本演练中需要的所有“.hql”文件下载到头节点中的本地目录 C:\temp\ 。These two commands download all '.hql' files needed in this walkthrough to the local directory C:\temp\ in the head node.

创建按月分区的 Hive 数据库和表Create Hive database and tables partitioned by month

备注

此任务通常是为管理员设置的。This task is typically for an admin.

现在已准备就绪,可以为 NYC 出租车数据集创建 Hive 表。You are now ready to create Hive tables for the NYC taxi dataset. 在 Hadoop 群集的头节点中,在头节点的桌面上打开 Hadoop 命令行。In the head node of the Hadoop cluster, open the Hadoop command line on the desktop of the head node. 运行以下命令进入 Hive 目录:Enter the Hive directory by running the following command:

cd %hive_home%\bin

备注

从 Hive bin/ 目录提示符运行此演练中的所有 Hive 命令。Run all Hive commands in this walkthrough from the Hive bin/ directory prompt. 这会自动处理任何路径问题。This handles any path issues automatically. 我们在本演练中交替使用术语“Hive 目录提示符”、“Hive bin/ 目录提示符”和“Hadoop 命令行”。We use the terms "Hive directory prompt", "Hive bin/ directory prompt", and "Hadoop command line" interchangeably in this walkthrough.

在 Hive 目录提示符下,在创建 Hive 数据库和表的头节点的 Hadoop 命令行中运行以下命令:From the Hive directory prompt, run the following command in the Hadoop command line of the head node that creates the Hive database and tables:

hive -f "C:\temp\sample_hive_create_db_and_tables.hql"

下面是 C:\temp\sample_hive_create_db_and_tables.hql 文件的内容,用于创建 Hive 数据库 nyctaxidb 以及表“行程”和“费用” 。Here is the content of the C:\temp\sample_hive_create_db_and_tables.hql file that creates the Hive database nyctaxidb , and the tables trip and fare .

create database if not exists nyctaxidb;

create external table if not exists nyctaxidb.trip
(
    medallion string,
    hack_license string,
    vendor_id string,
    rate_code string,
    store_and_fwd_flag string,
    pickup_datetime string,
    dropoff_datetime string,
    passenger_count int,
    trip_time_in_secs double,
    trip_distance double,
    pickup_longitude double,
    pickup_latitude double,
    dropoff_longitude double,
    dropoff_latitude double)  
PARTITIONED BY (month int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 'wasb:///nyctaxidbdata/trip' TBLPROPERTIES('skip.header.line.count'='1');

create external table if not exists nyctaxidb.fare
(
    medallion string,
    hack_license string,
    vendor_id string,
    pickup_datetime string,
    payment_type string,
    fare_amount double,
    surcharge double,
    mta_tax double,
    tip_amount double,
    tolls_amount double,
    total_amount double)
PARTITIONED BY (month int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 'wasb:///nyctaxidbdata/fare' TBLPROPERTIES('skip.header.line.count'='1');

此 Hive 脚本会创建两个表:This Hive script creates two tables:

  • trip 表包含每个行程的行程详情(司机详细信息、上车时间、行程距离和时间)。The trip table contains trip details of each ride (driver details, pick-up time, trip distance, and times).
  • fare 表包含费用详细信息(费用金额、小费金额、通行费和附加费)。The fare table contains fare details (fare amount, tip amount, tolls, and surcharges).

如果需要有关这些过程的任何其他帮助或想要了解另外的过程,请参阅直接从 Hadoop 命令行提交 Hive 查询部分。If you need any additional assistance with these procedures, or you want to investigate alternative ones, see the section Submit Hive queries directly from the Hadoop command line.

按分区将数据加载到 Hive 表Load data to Hive tables by partitions

备注

此任务通常是为管理员设置的。This task is typically for an admin.

NYC 出租车数据集具有按月划分的自然分区,用于加快处理和查询时间。The NYC taxi dataset has a natural partitioning by month, which we use to enable faster processing and query times. 以下 PowerShell 命令(使用 Hadoop 命令行 从 Hive 目录发出)将数据加载到按月分区的 trip 和 fare Hive 表。The following PowerShell commands (issued from the Hive directory by using the Hadoop command line) load data to the trip and fare Hive tables, partitioned by month.

for /L %i IN (1,1,12) DO (hive -hiveconf MONTH=%i -f "C:\temp\sample_hive_load_data_by_partitions.hql")

sample_hive_load_data_by_partitions.hql 文件包含以下 LOAD 命令:The sample_hive_load_data_by_partitions.hql file contains the following LOAD commands:

LOAD DATA INPATH 'wasb:///nyctaxitripraw/trip_data_${hiveconf:MONTH}.csv' INTO TABLE nyctaxidb.trip PARTITION (month=${hiveconf:MONTH});
LOAD DATA INPATH 'wasb:///nyctaxifareraw/trip_fare_${hiveconf:MONTH}.csv' INTO TABLE nyctaxidb.fare PARTITION (month=${hiveconf:MONTH});

在探索过程中,此处使用的许多 Hive 查询仅涉及查看一个或两个分区。A number of the Hive queries used here in the exploration process involve looking at only one or two partitions. 但是可以针对整个数据集运行这些查询。But these queries can be run across the entire dataset.

在 HDInsight Hadoop 群集中显示数据库Show databases in the HDInsight Hadoop cluster

若要在 Hadoop 命令行窗口中显示 HDInsight Hadoop 群集中创建的数据库,请在 Hadoop 命令行中运行以下命令:To show the databases created in HDInsight Hadoop cluster inside the Hadoop command-line window, run the following command in the Hadoop command line:

hive -e "show databases;"

显示 nyctaxidb 数据库中的 Hive 表Show the Hive tables in the nyctaxidb database

若要显示 nyctaxidb 数据库中的表,请在 Hadoop 命令行中运行以下命令:To show the tables in the nyctaxidb database, run the following command in the Hadoop command line:

hive -e "show tables in nyctaxidb;"

可以通过运行以下命令,确认表是否分区:We can confirm that the tables are partitioned by running the following command:

hive -e "show partitions nyctaxidb.trip;"

下面是预期的输出:Here is the expected output:

month=1
month=10
month=11
month=12
month=2
month=3
month=4
month=5
month=6
month=7
month=8
month=9
Time taken: 2.075 seconds, Fetched: 12 row(s)

同样,可以通过运行以下命令确保 fare 表已分区:Similarly, we can ensure that the fare table is partitioned by running the following command:

hive -e "show partitions nyctaxidb.fare;"

下面是预期的输出:Here is the expected output:

month=1
month=10
month=11
month=12
month=2
month=3
month=4
month=5
month=6
month=7
month=8
month=9
Time taken: 1.887 seconds, Fetched: 12 row(s)

Hive 中的数据探索和功能设计Data exploration and feature engineering in Hive

备注

这通常是数据科学家的任务。This is typically a data scientist task.

对于加载到 Hive 表中的数据,可以使用 Hive 查询完成数据探索和特征工程任务。You can use Hive queries to accomplish data exploration and feature engineering tasks for the data loaded into the Hive tables. 下面是此类任务的示例:Here are examples of such tasks:

  • 查看两个表中的前 10 条记录。View the top 10 records in both tables.
  • 在不同的时间范围中探索几个字段的数据分布。Explore data distributions of a few fields in varying time windows.
  • 调查经度和纬度字段的数据质量。Investigate data quality of the longitude and latitude fields.
  • 根据小费金额生成二元和多元分类标签。Generate binary and multiclass classification labels based on the tip amount.
  • 通过计算直接行程距离生成特性。Generate features by computing the direct trip distances.

浏览:查看行程表中的前 10 条记录Exploration: View the top 10 records in table trip

备注

这通常是数据科学家的任务。This is typically a data scientist task.

为了查看数据是什么样,我们会检查每个表的 10 条记录。To see what the data looks like, examine 10 records from each table. 若要检查记录,请在 Hadoop 命令行控制台中,从 Hive 目录提示符分别运行以下两个查询。To inspect the records, run the following two queries separately from the Hive directory prompt in the Hadoop command-line console.

获取 trip 表中第一个月的前 10 条记录:To get the top 10 records in the trip table from the first month:

hive -e "select * from nyctaxidb.trip where month=1 limit 10;"

获取 fare 表中第一个月的前 10 条记录:To get the top 10 records in the fare table from the first month:

hive -e "select * from nyctaxidb.fare where month=1 limit 10;"

为了方便查看,可以将记录保存到文件中,只需稍微更改前面的查询即可:You can save the records to a file for convenient viewing with a small change to the preceding query:

hive -e "select * from nyctaxidb.fare where month=1 limit 10;" > C:\temp\testoutput

浏览:查看 12 个分区中每个分区的记录数Exploration: View the number of records in each of the 12 partitions

备注

这通常是数据科学家的任务。This is typically a data scientist task.

此任务关注的是在历年内行程次数如何变化。Of interest is how the number of trips varies during the calendar year. 按月分组可以显示行程的分布情况。Grouping by month shows the distribution of trips.

hive -e "select month, count(*) from nyctaxidb.trip group by month;"

该命令生成以下输出:This command produces the following output:

1       14776615
2       13990176
3       15749228
4       15100468
5       15285049
6       14385456
7       13823840
8       12597109
9       14107693
10      15004556
11      14388451
12      13971118
Time taken: 283.406 seconds, Fetched: 12 row(s)

此处,第一列表示月份,第二列表示该月份的行程数。Here, the first column is the month, and the second is the number of trips for that month.

我们还可以通过在 Hive 目录提示符下运行以下命令来计算行程数据集中的记录总数:We can also count the total number of records in our trip dataset by running the following command at the Hive directory prompt:

hive -e "select count(*) from nyctaxidb.trip;"

此命令生成:This command yields:

173179759
Time taken: 284.017 seconds, Fetched: 1 row(s)

使用类似于对行程数据集显示的命令,可以从 Hive 目录提示符中针对费用数据集发出 Hive 查询,以便验证记录数。Using commands similar to those shown for the trip dataset, we can issue Hive queries from the Hive directory prompt for the fare dataset to validate the number of records.

hive -e "select month, count(*) from nyctaxidb.fare group by month;"

此命令生成以下输出:This command produces this output:

1       14776615
2       13990176
3       15749228
4       15100468
5       15285049
6       14385456
7       13823840
8       12597109
9       14107693
10      15004556
11      14388451
12      13971118
Time taken: 253.955 seconds, Fetched: 12 row(s)

每月为两个数据集返回的行程数完全相同,这是首次验证了数据已正确加载。The exact same number of trips per month is returned for both datasets, providing the first validation that the data has been loaded correctly.

可以通过在 Hive 目录提示符下运行以下命令来计算 fare 数据集中的记录总数:You can count the total number of records in the fare dataset by using the following command from the Hive directory prompt:

hive -e "select count(*) from nyctaxidb.fare;"

此命令生成:This command yields:

173179759
Time taken: 186.683 seconds, Fetched: 1 row(s)

两个表中的记录总数也相同,这再次验证了数据已正确加载。The total number of records in both tables is also the same, providing a second validation that the data has been loaded correctly.

浏览:依据徽章的行程分布Exploration: Trip distribution by medallion

备注

这种分析通常是数据科学家的任务。This analysis is typically a data scientist task.

此示例标识在给定的时间段内具有 100 多个行程的徽章(出租车数)。This example identifies the medallions (taxi numbers) with greater than 100 trips within a given time period. 查询受益于分区表访问,因为它受分区变量 month 的限制。The query benefits from the partitioned table access, because it is conditioned by the partition variable month . 查询结果将写入头节点上 C:\temp 中的本地文件 queryoutput.tsvThe query results are written to a local file, queryoutput.tsv , in C:\temp on the head node.

hive -f "C:\temp\sample_hive_trip_count_by_medallion.hql" > C:\temp\queryoutput.tsv

下面是要检查的 sample_hive_trip_count_by_medallion.hql 文件的内容。Here is the content of the sample_hive_trip_count_by_medallion.hql file for inspection.

SELECT medallion, COUNT(*) as med_count
FROM nyctaxidb.fare
WHERE month<=3
GROUP BY medallion
HAVING med_count > 100
ORDER BY med_count desc;

NYC 出租车数据集中的牌照标识一辆唯一的出租车。The medallion in the NYC taxi dataset identifies a unique cab. 通过询问特定时间段内,哪些出租车的行程数超过了一定量,可以确定哪些车处于相对忙碌状态。You can identify which cabs are comparatively busy by asking which ones made more than a certain number of trips in a particular time period. 以下示例标识前三个月内行程数超过 100 的出租车,并将查询结果保存到本地文件 C:\temp\queryoutput.tsvThe following example identifies cabs that made more than a hundred trips in the first three months, and saves the query results to a local file, C:\temp\queryoutput.tsv .

下面是要检查的 sample_hive_trip_count_by_medallion.hql 文件的内容。Here is the content of the sample_hive_trip_count_by_medallion.hql file for inspection.

SELECT medallion, COUNT(*) as med_count
FROM nyctaxidb.fare
WHERE month<=3
GROUP BY medallion
HAVING med_count > 100
ORDER BY med_count desc;

在 Hive 目录提示符下运行以下命令:From the Hive directory prompt, run the following command:

hive -f "C:\temp\sample_hive_trip_count_by_medallion.hql" > C:\temp\queryoutput.tsv

浏览:依据徽章和出租汽车执照的行程分布Exploration: Trip distribution by medallion and hack license

备注

此任务通常是为数据科学家设置的。This task is typically for a data scientist.

探索数据集时,我们经常需要检查一系列值的分布情况。When exploring a dataset, we frequently want to examine the distributions of groups of values. 本部分提供如何针对出租车和司机执行此分析的示例。This section provides an example of how to do this analysis for cabs and drivers.

sample_hive_trip_count_by_medallion_license.hql 文件将 medallionhack_license 上的费用数据集分组,并返回每个组合的计数。The sample_hive_trip_count_by_medallion_license.hql file groups the fare dataset on medallion and hack_license , and returns counts of each combination. 以下是其内容:Here are its contents:

SELECT medallion, hack_license, COUNT(*) as trip_count
FROM nyctaxidb.fare
WHERE month=1
GROUP BY medallion, hack_license
HAVING trip_count > 100
ORDER BY trip_count desc;

此查询返回出租车和司机的组合,按行程数降序排序。This query returns cab and driver combinations, ordered by descending number of trips.

从 Hive 目录提示符中,运行:From the Hive directory prompt, run:

hive -f "C:\temp\sample_hive_trip_count_by_medallion_license.hql" > C:\temp\queryoutput.tsv

查询结果将写入本地文件 C:\temp\queryoutput.tsvThe query results are written to a local file, C:\temp\queryoutput.tsv .

浏览:通过检查无效的经度或纬度记录,评估数据质量Exploration: Assessing data quality by checking for invalid longitude or latitude records

备注

这通常是数据科学家的任务。This is typically a data scientist task.

探索数据分析的共同目标是剔除无效或错误的记录。A common objective of exploratory data analysis is to weed out invalid or bad records. 本部分中的示例确定经度或纬度字段是否包含 NYC 区域外的值。The example in this section determines whether either the longitude or latitude fields contain a value far outside the NYC area. 由于有可能此类记录具有错误的经纬值,因此我们希望将其从任何用于建模的数据中清除。Since it is likely that such records have an erroneous longitude-latitude value, we want to eliminate them from any data that is to be used for modeling.

以下是用于检查的 sample_hive_quality_assessment.hql 文件内容。Here is the content of sample_hive_quality_assessment.hql file for inspection.

    SELECT COUNT(*) FROM nyctaxidb.trip
    WHERE month=1
    AND  (CAST(pickup_longitude AS float) NOT BETWEEN -90 AND -30
    OR    CAST(pickup_latitude AS float) NOT BETWEEN 30 AND 90
    OR    CAST(dropoff_longitude AS float) NOT BETWEEN -90 AND -30
    OR    CAST(dropoff_latitude AS float) NOT BETWEEN 30 AND 90);

从 Hive 目录提示符中,运行:From the Hive directory prompt, run:

hive -S -f "C:\temp\sample_hive_quality_assessment.hql"

此命令中包含的 -S 参数阻止状态屏幕打印输出 Hive Map/Reduce 作业。The -S argument included in this command suppresses the status screen printout of the Hive Map/Reduce jobs. 此命令非常有用,因为它可以使 Hive 查询输出的屏幕打印更加易读。This command is useful because it makes the screen print of the Hive query output more readable.

浏览:行程小费的二元类分布Exploration: Binary class distributions of trip tips

备注

这通常是数据科学家的任务。This is typically a data scientist task.

对于预测任务示例部分中所述的二元分类问题,了解是否已付小费非常有用。For the binary classification problem outlined in the Examples of prediction tasks section, it is useful to know whether a tip was given or not. 小费的分布为二元形式:This distribution of tips is binary:

  • 已付小费(类 1,tip_amount > $0)tip given (Class 1, tip_amount > $0)
  • 无小费(类 0,tip_amount = $0)no tip (Class 0, tip_amount = $0)

以下 sample_hive_tipped_frequencies.hql 文件显示要运行的命令:The following sample_hive_tipped_frequencies.hql file shows the command to run:

SELECT tipped, COUNT(*) AS tip_freq
FROM
(
    SELECT if(tip_amount > 0, 1, 0) as tipped, tip_amount
    FROM nyctaxidb.fare
)tc
GROUP BY tipped;

从 Hive 目录提示符中,运行:From the Hive directory prompt, run:

hive -f "C:\temp\sample_hive_tipped_frequencies.hql"

浏览:多类设置中的类分布Exploration: Class distributions in the multiclass setting

备注

这通常是数据科学家的任务。This is typically a data scientist task.

对于预测任务示例部分中所述的多类分类问题,此数据集也适用于自然分类,在这种分类中可预测所付小费的金额。For the multiclass classification problem outlined in the Examples of prediction tasks section, this dataset also lends itself to a natural classification to predict the amount of the tips given. 我们可以使用 bin 在查询中定义小费范围。We can use bins to define tip ranges in the query. 若要获取各种小费范围的类分布,请使用 sample_hive_tip_range_frequencies.hql 文件。To get the class distributions for the various tip ranges, use the sample_hive_tip_range_frequencies.hql file. 以下是其内容。Here are its contents.

SELECT tip_class, COUNT(*) AS tip_freq
FROM
(
    SELECT if(tip_amount=0, 0,
        if(tip_amount>0 and tip_amount<=5, 1,
        if(tip_amount>5 and tip_amount<=10, 2,
        if(tip_amount>10 and tip_amount<=20, 3, 4)))) as tip_class, tip_amount
    FROM nyctaxidb.fare
)tc
GROUP BY tip_class;

从 Hadoop 命令行控制台运行以下命令:Run the following command from the Hadoop command-line console:

hive -f "C:\temp\sample_hive_tip_range_frequencies.hql"

浏览:计算两个经纬位置之间的直接距离Exploration: Compute the direct distance between two longitude-latitude locations

备注

这通常是数据科学家的任务。This is typically a data scientist task.

你可能想要知道两个位置之间的直接距离是否有差异,以及出租车的实际行程距离。You might want to know if there is a difference between the direct distance between two locations, and the actual trip distance of the taxi. 如果乘客发现司机故意绕远路,该乘客提供小费的可能性更低。A passenger might be less likely to tip if they figure out that the driver has intentionally taken them by a longer route.

为了查看实际行程距离与两个经纬点(“大圆”距离)之间的半正矢距离的比较结果,我们使用 Hive 中可用的三角函数:To see the comparison between actual trip distance and the Haversine distance between two longitude-latitude points (the "great circle" distance), you can use the trigonometric functions available within Hive:

set R=3959;
set pi=radians(180);

insert overwrite directory 'wasb:///queryoutputdir'

select pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, trip_distance, trip_time_in_secs,
${hiveconf:R}*2*2*atan((1-sqrt(1-pow(sin((dropoff_latitude-pickup_latitude)
 *${hiveconf:pi}/180/2),2)-cos(pickup_latitude*${hiveconf:pi}/180)
 *cos(dropoff_latitude*${hiveconf:pi}/180)*pow(sin((dropoff_longitude-pickup_longitude)*${hiveconf:pi}/180/2),2)))
 /sqrt(pow(sin((dropoff_latitude-pickup_latitude)*${hiveconf:pi}/180/2),2)
 +cos(pickup_latitude*${hiveconf:pi}/180)*cos(dropoff_latitude*${hiveconf:pi}/180)*
 pow(sin((dropoff_longitude-pickup_longitude)*${hiveconf:pi}/180/2),2))) as direct_distance
from nyctaxidb.trip
where month=1
and pickup_longitude between -90 and -30
and pickup_latitude between 30 and 90
and dropoff_longitude between -90 and -30
and dropoff_latitude between 30 and 90;

在上面的查询中,R 表示以英里为单位的地球半径,pi 转换为弧度。In the preceding query, R is the radius of the Earth in miles, and pi is converted to radians. 筛选了经纬点,以便删除远离 NYC 区域的值。The longitude-latitude points are filtered to remove values that are far from the NYC area.

在此例中,我们将结果写入名为 queryoutputdir 的目录。In this case, we write the results to a directory called queryoutputdir . 以下命令序列先创建此输出目录,并运行 Hive 命令。The sequence of the following commands first creates this output directory, and then runs the Hive command.

从 Hive 目录提示符中,运行:From the Hive directory prompt, run:

hdfs dfs -mkdir wasb:///queryoutputdir

hive -f "C:\temp\sample_hive_trip_direct_distance.hql"

查询结果将写入 Hadoop 群集的默认容器下的 9 个 Azure Blob( queryoutputdir/000000_0queryoutputdir/000008_0 )。The query results are written to nine Azure blobs ( queryoutputdir/000000_0 to queryoutputdir/000008_0 ), under the default container of the Hadoop cluster.

若要查看各个 Blob 的大小,可在 Hive 目录提示符下运行以下命令:To see the size of the individual blobs, run the following command from the Hive directory prompt:

hdfs dfs -ls wasb:///queryoutputdir

若要查看给定文件(例如 000000_0 )的内容,可以使用 Hadoop 的 copyToLocal 命令。To see the contents of a given file, say 000000_0 , use Hadoop's copyToLocal command.

hdfs dfs -copyToLocal wasb:///queryoutputdir/000000_0 C:\temp\tempfile

警告

对于大型文件,copyToLocal 可能非常慢,因此不建议将其用于此类文件。copyToLocal can be very slow for large files, and is not recommended for use with them.

将此数据驻留在 Azure Blob 中的一个主要优点是,我们可以使用导入数据模块在机器学习中探索数据。A key advantage of having this data reside in an Azure blob is that we can explore the data within Machine Learning, by using the Import Data module.

在机器学习中对数据进行下采样和构建模型Down-sample data and build models in Machine Learning

备注

这通常是数据科学家的任务。This is typically a data scientist task.

在探索数据分析阶段之后,便可以开始在机器学习中对数据进行下采样,以便构建模型。After the exploratory data analysis phase, we are now ready to down-sample the data for building models in Machine Learning. 本部分演示如何使用 Hive 查询对数据进行下采样。In this section, we show how to use a Hive query to down-sample the data. 然后,机器学习将通过导入数据模块访问该数据。Machine Learning then accesses it from the Import Data module.

对数据进行下采样Down-sampling the data

此过程包含两个步骤。There are two steps in this procedure. 首先,在存在于所有记录中的三个键上将 nyctaxidb.tripnyctaxidb.fare 表相联接,这三个键是: medallionhack_licensepickup_datetimeFirst we join the nyctaxidb.trip and nyctaxidb.fare tables on three keys that are present in all records: medallion , hack_license , and pickup_datetime . 然后,生成一个二元分类标签 tipped 和一个多类分类标签 tip_classWe then generate a binary classification label, tipped , and a multiclass classification label, tip_class .

为了能够直接从机器学习中的导入数据模块使用已经过下采样的数据,应将上述查询的结果存储到内部 Hive 表中。To be able to use the down-sampled data directly from the Import Data module in Machine Learning, you should store the results of the preceding query to an internal Hive table. 接下来,我们将创建一个内部 Hive 表,并使用已联接且已经过下采样的数据填充其内容。In what follows, we create an internal Hive table and populate its contents with the joined and down-sampled data.

查询直接应用标准 Hive 函数,以从 pickup_datetime 字段生成以下时间参数 :The query applies standard Hive functions directly to generate the following time parameters from the pickup_datetime field:

  • 一天的某一小时hour of day
  • 一年的某一周week of year
  • 一周的某一天(“1”代表星期一,而“7”代表星期日)weekday ('1' stands for Monday, and '7' stands for Sunday)

该查询还会生成上车与下车位置之间的直接距离。The query also generates the direct distance between the pick-up and dropoff locations. 有关此类函数的完整列表,请参阅 LanguageManual UDFFor a complete list of such functions, see LanguageManual UDF.

然后,查询会对数据进行下采样,以便查询结果适合 Azure 机器学习工作室。The query then down-samples the data so that the query results can fit into Azure Machine Learning Studio. 导入到该工作室中的原始数据集仅有 1%。Only about 1 percent of the original dataset is imported into the studio.

下面是 sample_hive_prepare_for_aml_full.hql 文件的内容,用于为在机器学习中构建模型准备数据。Here are the contents of sample_hive_prepare_for_aml_full.hql file that prepares data for model building in Machine Learning:

set R = 3959;
set pi=radians(180);

create table if not exists nyctaxidb.nyctaxi_downsampled_dataset (

medallion string,
hack_license string,
vendor_id string,
rate_code string,
store_and_fwd_flag string,
pickup_datetime string,
dropoff_datetime string,
pickup_hour string,
pickup_week string,
weekday string,
passenger_count int,
trip_time_in_secs double,
trip_distance double,
pickup_longitude double,
pickup_latitude double,
dropoff_longitude double,
dropoff_latitude double,
direct_distance double,
payment_type string,
fare_amount double,
surcharge double,
mta_tax double,
tip_amount double,
tolls_amount double,
total_amount double,
tipped string,
tip_class string
)
row format delimited fields terminated by ','
lines terminated by '\n'
stored as textfile;

--- now insert contents of the join into the above internal table

insert overwrite table nyctaxidb.nyctaxi_downsampled_dataset
select
t.medallion,
t.hack_license,
t.vendor_id,
t.rate_code,
t.store_and_fwd_flag,
t.pickup_datetime,
t.dropoff_datetime,
hour(t.pickup_datetime) as pickup_hour,
weekofyear(t.pickup_datetime) as pickup_week,
from_unixtime(unix_timestamp(t.pickup_datetime, 'yyyy-MM-dd HH:mm:ss'),'u') as weekday,
t.passenger_count,
t.trip_time_in_secs,
t.trip_distance,
t.pickup_longitude,
t.pickup_latitude,
t.dropoff_longitude,
t.dropoff_latitude,
t.direct_distance,
f.payment_type,
f.fare_amount,
f.surcharge,
f.mta_tax,
f.tip_amount,
f.tolls_amount,
f.total_amount,
if(tip_amount>0,1,0) as tipped,
if(tip_amount=0,0,
if(tip_amount>0 and tip_amount<=5,1,
if(tip_amount>5 and tip_amount<=10,2,
if(tip_amount>10 and tip_amount<=20,3,4)))) as tip_class

from
(
select
medallion,
hack_license,
vendor_id,
rate_code,
store_and_fwd_flag,
pickup_datetime,
dropoff_datetime,
passenger_count,
trip_time_in_secs,
trip_distance,
pickup_longitude,
pickup_latitude,
dropoff_longitude,
dropoff_latitude,
${hiveconf:R}*2*2*atan((1-sqrt(1-pow(sin((dropoff_latitude-pickup_latitude)
*${hiveconf:pi}/180/2),2)-cos(pickup_latitude*${hiveconf:pi}/180)
*cos(dropoff_latitude*${hiveconf:pi}/180)*pow(sin((dropoff_longitude-pickup_longitude)*${hiveconf:pi}/180/2),2)))
/sqrt(pow(sin((dropoff_latitude-pickup_latitude)*${hiveconf:pi}/180/2),2)
+cos(pickup_latitude*${hiveconf:pi}/180)*cos(dropoff_latitude*${hiveconf:pi}/180)*pow(sin((dropoff_longitude-pickup_longitude)*${hiveconf:pi}/180/2),2))) as direct_distance,
rand() as sample_key

from nyctaxidb.trip
where pickup_latitude between 30 and 90
    and pickup_longitude between -90 and -30
    and dropoff_latitude between 30 and 90
    and dropoff_longitude between -90 and -30
)t
join
(
select
medallion,
hack_license,
vendor_id,
pickup_datetime,
payment_type,
fare_amount,
surcharge,
mta_tax,
tip_amount,
tolls_amount,
total_amount
from nyctaxidb.fare
)f
on t.medallion=f.medallion and t.hack_license=f.hack_license and t.pickup_datetime=f.pickup_datetime
where t.sample_key<=0.01

从 Hive 目录提示符中运行此查询:To run this query from the Hive directory prompt:

hive -f "C:\temp\sample_hive_prepare_for_aml_full.hql"

创建内部表 nyctaxidb.nyctaxi_downsampled_dataset 之后,可以使用机器学习中的 。We now have an internal table, nyctaxidb.nyctaxi_downsampled_dataset , which can be accessed by using the Import Data module from Machine Learning. 此外,可将此数据集用于构建机器学习模型。Furthermore, we can use this dataset for building Machine Learning models.

使用机器学习中的“导入数据”模块访问已经过下采样的数据Use the Import Data module in Machine Learning to access the down-sampled data

若要在机器学习的导入数据模块中发出 Hive 查询,你必须访问机器学习工作区。To issue Hive queries in the Import Data module of Machine Learning, you need access to a Machine Learning workspace. 此外,还需有权访问群集凭据及其关联的存储帐户。You also need access to the credentials of the cluster and its associated storage account.

下面是有关导入数据模块的一些详细信息和要输入的参数:Here are some details about the Import Data module and the parameters to input:

HCatalog server URI : If the cluster name is abc123 , then use: https://abc123.azurehdinsight.cn.

Hadoop 用户帐户名称 :为群集选择的用户名(不是远程访问用户名)。Hadoop user account name : The user name chosen for the cluster (not the remote access user name).

Hadoop 用户帐户密码 :为群集选择的密码(不是远程访问密码)。Hadoop user account password : The password chosen for the cluster (not the remote access password).

输出数据的位置 :选择 Azure。Location of output data : Chosen to be Azure.

Azure 存储帐户名称 :与群集关联的默认存储帐户的名称。Azure Storage account name : Name of the default storage account associated with the cluster.

Azure 容器名称 :群集的默认容器名称,通常与群集名称相同。Azure container name : The default container name for the cluster, and is typically the same as the cluster name. 名为 abc123 的群集,它的容器名称为 abc123 。For a cluster called abc123 , the name is abc123.

重要

我们希望使用机器学习中的导入数据模块查询的任何表都必须是内部表。Any table we wish to query by using the Import Data module in Machine Learning must be an internal table.

下面介绍如何确定数据库 D.db 中的表 T 是否是内部表。Here is how to determine if a table T in a database D.db is an internal table. 在 Hive 目录提示符下运行以下命令:From the Hive directory prompt, run the following command:

hdfs dfs -ls wasb:///D.db/T

如果该表是内部表并且已填充,则其内容一定会在此处显示。If the table is an internal table and it is populated, its contents must show here.

确定表是否是内部表的另一种方法是使用 Azure 存储资源管理器。Another way to determine whether a table is an internal table is to use Azure Storage Explorer. 使用它导航到群集的默认容器名称,并按表名称进行筛选。Use it to navigate to the default container name of the cluster, and then filter by the table name. 如果显示了表及其内容,则可确定该表为内部表。If the table and its contents show up, this confirms that it is an internal table.

下面是 Hive 查询和导入数据模块的屏幕截图:Here is a screenshot of the Hive query and the Import Data module:

“导入数据”模块的 Hive 查询屏幕截图

由于已经过下采样的数据驻留在默认容器中,从机器学习生成的 Hive 查询很简单。Because our down-sampled data resides in the default container, the resulting Hive query from Machine Learning is simple. 只是 SELECT * FROM nyctaxidb.nyctaxi_downsampled_dataIt is just a SELECT * FROM nyctaxidb.nyctaxi_downsampled_data .

现在,可将数据集用作构建机器学习模型的起点。The dataset can now be used as the starting point for building Machine Learning models.

在机器学习中构建模型Build models in Machine Learning

现在,可以在机器学习中继续构建模型以及为部署建模。You can now proceed to model building and model deployment in Machine Learning. 数据已可用于解决上述预测问题:The data is ready for us to use in addressing the prediction problems identified earlier:

  • 二元分类 :预测某个行程是否支付小费。Binary classification : To predict whether or not a tip was paid for a trip.

    使用的学习器: 双类逻辑回归Learner used: Two-class logistic regression

    a.a. 对于此问题,目标(或类)标签为 tippedFor this problem, the target (or class) label is tipped . 原始下采样数据集具有几个列,这些列是此分类实验的目标泄漏。The original down-sampled dataset has a few columns that are target leaks for this classification experiment. 具体而言, tip_classtip_amounttotal_amount 可揭示有关测试时不可用的目标标签的信息。In particular, tip_class , tip_amount , and total_amount reveal information about the target label that is not available at testing time. 使用选择数据集中的列模块删除这些列,不将其纳入考虑。We remove these columns from consideration by using the Select Columns in Dataset module.

    下图显示预测给定行程是否支付小费的试验。The following diagram shows our experiment to predict whether or not a tip was paid for a given trip:

    示意图:预测是否支付小费试验

    b.b. 对于此实验,我们的目标标签分布大约是 1:1。For this experiment, our target label distributions were roughly 1:1.

    下图显示该二元分类问题的小费类标签的分布情况。The following chart shows the distribution of tip class labels for the binary classification problem:

    tip 类标签分布图表

    最终,我们获得的曲线下面积 (AUC) 为 0.987,如下图所示:As a result, we obtain an area under the curve (AUC) of 0.987, as shown in the following figure:

    AUC 值的图表

  • 多类分类 :使用以前定义的类预测为行程支付的小费金额范围。Multiclass classification : To predict the range of tip amounts paid for the trip, by using the previously defined classes.

    使用的学习器: 多类逻辑回归Learner used: Multiclass logistic regression

    a.a. 对于此问题,我们的目标(或类)标签为 tip_class ,其取值有五种选择(0、1、2、3、4)。For this problem, our target (or class) label is tip_class , which can take one of five values (0,1,2,3,4). 与二元分类的情况类似,我们也具有几个作为此实验的目标泄漏的列。As in the binary classification case, we have a few columns that are target leaks for this experiment. 具体而言, tippedtip_amounttotal_amount 可揭示有关测试时不可用的目标标签的信息。In particular, tipped , tip_amount , and total_amount reveal information about the target label that is not available at testing time. 使用选择数据集中的列模块删除这些列。We remove these columns by using the Select Columns in Dataset module.

    下图显示预测小费可能归属的 bin 的试验。The following diagram shows the experiment to predict in which bin a tip is likely to fall. bin 为:类 0:小费 = $0,类 1:小费 > $0 且 <= $5,类 2:小费 > $5 且 <= $10,类 3:小费 > $10 且 <= $20,类 4:小费 > $20。The bins are: Class 0: tip = $0, Class 1: tip > $0 and tip <= $5, Class 2: tip > $5 and tip <= $10, Class 3: tip > $10 and tip <= $20, and Class 4: tip > $20.

    示意图:预测小费的 bin 试验

    现在,我们将展示实际测试类的分布情况。We now show what the actual test class distribution looks like. 类 0 和类 1 的情况很普遍,而其他类的情况很少。Class 0 and Class 1 are prevalent, and the other classes are rare.

    测试类分布图表

    b.b. 对于本试验,我们使用混淆矩阵检查来查看预测准确性,如下所示:For this experiment, we use a confusion matrix to look at the prediction accuracies as shown here:

    混淆矩阵

    虽然此模型对于普遍类的预测准确性很高,但对于较少情况的类,其并未做好“学习”工作。While the class accuracies on the prevalent classes are good, the model does not do a good job of "learning" on the rarer classes.

  • 回归任务 :预测为行程支付的小费金额。Regression task : To predict the amount of tip paid for a trip.

    使用的学习器: 提升决策树Learner used: Boosted decision tree

    a.a. 对于此问题,目标(或类)标签为 tip_amountFor this problem, the target (or class) label is tip_amount . 在本例中,目标泄漏为: tippedtip_classtotal_amountThe target leaks in this case are: tipped , tip_class , and total_amount . 所有这些变量都揭示有关测试时通常不可用的小费金额的信息。All these variables reveal information about the tip amount that is typically unavailable at testing time. 使用选择数据集中的列模块删除这些列。We remove these columns by using the Select Columns in Dataset module.

    下图显示预测支付的小费金额的试验。The following diagram shows the experiment to predict the amount of the given tip:

    示意图:预测小费金额的试验

    b.b. 对于回归问题,我们将通过查看预测中的平方误差和决定系数,测量预测准确性:For regression problems, we measure the accuracies of the prediction by looking at the squared error in the predictions, and the coefficient of determination:

    预测统计信息的屏幕截图

    此处,决定系数是 0.709,这意味着模型系数解释了大约 71% 的方差。Here, the coefficient of determination is 0.709, implying that about 71 percent of the variance is explained by the model coefficients.

重要

若要深入了解机器学习及其访问和使用方式,请参阅什么是机器学习To learn more about Machine Learning and how to access and use it, see What's Machine Learning. 此外,Azure AI 库涵盖了各类试验,并提供对机器学习功能范围的全面介绍。In addition, the Azure AI Gallery covers a gamut of experiments and provides a thorough introduction into the range of capabilities of Machine Learning.

许可证信息License information

此示例演练及其附带脚本在 MIT 许可证下由 Microsoft 共享。This sample walkthrough and its accompanying scripts are shared by Microsoft under the MIT license. 有关详细信息,请查看 GitHub 上示例代码目录中的 LICENSE.txt 文件 。For more information, see the LICENSE.txt file in the directory of the sample code on GitHub.

参考References

Andrés Monroy NYC 出租车行程下载页Andrés Monroy NYC Taxi Trips Download Page
由 Chris Whong 提供的 FOILing NYC 出租车行程数据 FOILing NYC’s Taxi Trip Data by Chris Whong
NYC 出租车和礼车委员会研究和统计信息NYC Taxi and Limousine Commission Research and Statistics