团队数据科学过程实务:使用 SQL ServerThe Team Data Science Process in action: using SQL Server

在本教程中,将逐步指导完成使用 SQL Server 和可公开取得的数据集 NYC 出租车行程,构建和部署机器学习模型的过程。In this tutorial, you walk through the process of building and deploying a machine learning model using SQL Server and a publicly available dataset -- the NYC Taxi Trips dataset. 该程序遵循标准数据科学工作流,包括:引入和浏览数据,设计功能以促进学习,并构建和部署模型。The procedure follows a standard data science workflow: ingest and explore the data, engineer features to facilitate learning, then build and deploy a model.

NYC 出租车行程数据集介绍NYC Taxi Trips Dataset Description

NYC 出租车行程数据是大约 20 GB(未压缩时约为 48 GB)的压缩 CSV 文件,其中包含超过 1.73 亿个单独行程及每个行程支付的费用。The NYC Taxi Trip data is about 20GB of compressed CSV files (~48GB uncompressed), comprising more than 173 million individual trips and the fares paid for each trip. 每个行程记录都包括上车和下车的位置和时间、匿名的出租车司机驾驶证编号和徽章(出租车的唯一 ID)编号。Each trip record includes the pickup and dropoff location and time, anonymized hack (driver's) license number and medallion (taxi’s unique id) number. 数据涵盖 2013 年的所有行程,并在每个月的以下两个数据集中提供:The data covers all trips in the year 2013 and is provided in the following two datasets for each month:

  1. 'trip_data' CSV 包含行程的详细信息,例如乘客数、上车和下车地点、行程持续时间和行程距离。The 'trip_data' CSV contains trip details, such as number of passengers, pickup and dropoff points, trip duration, and trip length. 下面是一些示例记录:Here are a few sample records:

     medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
     89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171
     0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066
     0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002
     DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388
     DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868
    
  2. 'trip_fare' CSV 包含每个行程所付费用的详细信息,例如付款类型、费用金额、附加税和税金、小费和通行税,以及支付的总金额等。The 'trip_fare' CSV contains details of the fare paid for each trip, such as payment type, fare amount, surcharge and taxes, tips and tolls, and the total amount paid. 下面是一些示例记录:Here are a few sample records:

     medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount
     89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01 15:11:48,CSH,6.5,0,0.5,0,0,7
     0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06 00:18:35,CSH,6,0.5,0.5,0,0,7
     0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05 18:49:41,CSH,5.5,1,0.5,0,0,7
     DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:54:15,CSH,5,0.5,0.5,0,0,6
     DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:25:03,CSH,9.5,0.5,0.5,0,0,10.5
    

联接 trip_data 和 trip_fare 的唯一键由以下字段组成:medallion、hack_licence 和 pickup_datetime。The unique key to join trip_data and trip_fare is composed of the fields: medallion, hack_licence and pickup_datetime.

预测任务示例Examples of Prediction Tasks

我们会根据 tip_amount 编写三个预测问题的公式,即:We will formulate three prediction problems based on the tip_amount, namely:

  1. 二元分类:预测是否已支付某个行程的小费,即大于 $0 的 tip_amount 是正例,等于 $0 的 tip_amount 是反例 。Binary classification: Predict whether or not a tip was paid for a trip, i.e. a tip_amount that is greater than $0 is a positive example, while a tip_amount of $0 is a negative example.

  2. 多类分类:预测为行程支付的小费的范围。Multiclass classification: To predict the range of tip paid for the trip. 我们将 tip_amount 划分五个分类收纳组或类别:We divide the tip_amount into five bins or classes:

     Class 0 : tip_amount = $0
     Class 1 : tip_amount > $0 and tip_amount <= $5
     Class 2 : tip_amount > $5 and tip_amount <= $10
     Class 3 : tip_amount > $10 and tip_amount <= $20
     Class 4 : tip_amount > $20
    
  3. 回归任务:预测为行程支付的小费金额。Regression task: To predict the amount of tip paid for a trip.

设置 Azure 数据科学环境进行高级分析Setting Up the Azure data science environment for advanced analytics

正如你从规划环境指南中看到的那样,在 Azure 中使用 NYC 出租车行程数据集时,有几个选项可以使用:As you can see from the Plan Your Environment guide, there are several options to work with the NYC Taxi Trips dataset in Azure:

  • 使用 Azure blob 中的数据,并在 Azure 机器学习中模型化Work with the data in Azure blobs then model in Azure Machine Learning
  • 将数据加载到 SQL Server 数据库,然后在 Azure 机器学习中模型化Load the data into a SQL Server database then model in Azure Machine Learning

在本教程中,我们将展示将数据并行批量导入 SQL Server、数据浏览、功能设计以及使用 SQL Server Management Studio 和 IPython Notebook 向下采样。In this tutorial we will demonstrate parallel bulk import of the data to a SQL Server, data exploration, feature engineering and down sampling using SQL Server Management Studio as well as using IPython Notebook. 示例脚本IPython notebook 在 GitHub 中共享。Sample scripts and IPython notebooks are shared in GitHub. 使用 Azure blob 中数据的示例 IPython notebook 也可以在同一位置取得。A sample IPython notebook to work with the data in Azure blobs is also available in the same location.

要设置 Azure 数据科学环境:To set up your Azure Data Science environment:

  1. 创建存储帐户Create a storage account

  2. 创建 Azure 机器学习工作区Create an Azure Machine Learning workspace

  3. 预配数据科研虚拟机,提供 SQL Server 和 IPython Notebook 服务器。Provision a Data Science Virtual Machine, which provides a SQL Server and an IPython Notebook server.

    备注

    在安装过程中,示例脚本和 IPython notebook 将下载到数据科学虚拟机。The sample scripts and IPython notebooks will be downloaded to your Data Science virtual machine during the setup process. VM 后续安装脚本完成后,这些示例会在虚拟机文档库中:When the VM post-installation script completes, the samples will be in your VM's Documents library:

    • 示例脚本: C:\Users\<user_name>\Documents\Data Science ScriptsSample Scripts: C:\Users\<user_name>\Documents\Data Science Scripts
    • 示例 IPython Notebook:C:\Users\<user_name>\Documents\IPython Notebooks\DataScienceSamplesSample IPython Notebooks: C:\Users\<user_name>\Documents\IPython Notebooks\DataScienceSamples
      其中 <user_name> 是 VM 的 Windows 登录名。where <user_name> is your VM's Windows login name. 我们将示例文件夹称为示例脚本示例 IPython NotebookWe will refer to the sample folders as Sample Scripts and Sample IPython Notebooks.

根据数据集大小、数据源位置和所选的 Azure 目标环境,此应用场景类似于应用场景 #5:本地文件中的大型数据集、Azure VM 中的目标 SQL ServerBased on the dataset size, data source location, and the selected Azure target environment, this scenario is similar to Scenario #5: Large dataset in a local files, target SQL Server in Azure VM.

从公共源获取数据Get the Data from Public Source

要从 NYC 出租车行程数据集的公共位置获取该数据集,可以使用将数据从 Azure Blob 存储移入和移出中所述的任意方法,将数据复制到新的虚拟机。To get the NYC Taxi Trips dataset from its public location, you may use any of the methods described in Move Data to and from Azure Blob Storage to copy the data to your new virtual machine.

使用 AzCopy 复制数据:To copy the data using AzCopy:

  1. 登录到虚拟机 (VM)Log in to your virtual machine (VM)

  2. 在 VM 的数据磁盘中创建一个新目录(注意:不要将 VM 附带的临时磁盘用作数据磁盘)。Create a new directory in the VM's data disk (Note: Do not use the Temporary Disk that comes with the VM as a Data Disk).

  3. 在“命令提示符”窗口中,运行以下 Azcopy 命令行,将 < path_to_data_folder > 替换成在步骤 (2) 中创建的数据文件夹:In a Command Prompt window, run the following Azcopy command line, replacing <path_to_data_folder> with your data folder created in (2):

     "C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\azcopy" /Source:https://nyctaxitrips.blob.core.chinacloudapi.cn/data /Dest:<path_to_data_folder> /S
    

    AzCopy 完成时,数据文件夹中总共应有 24 个压缩 CSV 文件(其中 12 个文件是 trip_data,12 个文件是 trip_fare)。When the AzCopy completes, a total of 24 zipped CSV files (12 for trip_data and 12 for trip_fare) should be in the data folder.

  4. 解压缩下载的文件。Unzip the downloaded files. 注意未压缩的文件所在的文件夹。Note the folder where the uncompressed files reside. 此文件夹将称为 <path_to_data_files>。This folder will be referred to as the <path_to_data_files>.

将数据批量导入 SQL Server 数据库Bulk Import Data into SQL Server Database

使用分区表和视图,即可提升将大量数据加载/传输到 SQL 数据库及后续查询的性能。The performance of loading/transferring large amounts of data to an SQL database and subsequent queries can be improved by using Partitioned Tables and Views. 在本部分中,我们将按照使用 SQL 分区表平行批量量导入数据中的说明进行操作,创建新数据库并将数据并行加载到分区表。In this section, we will follow the instructions described in Parallel Bulk Data Import Using SQL Partition Tables to create a new database and load the data into partitioned tables in parallel.

  1. 登录到 VM 后,启动 SQL Server Management StudioWhile logged in to your VM, start SQL Server Management Studio.

  2. 使用 Windows 身份验证进行连接。Connect using Windows Authentication.

    SSMS 连接

  3. 如果尚未更改的 SQL Server 身份验证模式并尚未创建新的 SQL 登录用户,请打开 Sample Scripts 文件夹中名为 change_auth.sql 的脚本文件。If you have not yet changed the SQL Server authentication mode and created a new SQL login user, open the script file named change_auth.sql in the Sample Scripts folder. 更改默认用户名和密码。Change the default user name and password. 单击工具栏中的“ ! 执行”来运此该脚本。Click !Execute in the toolbar to run the script.

    执行脚本

  4. 验证和/或更改 SQL Server 默认数据库和日志文件夹,以确保新创建的数据库将存储在数据磁盘中。Verify and/or change the SQL Server default database and log folders to ensure that newly created databases will be stored in a Data Disk. 系统将使用数据和日志磁盘,预配置针对数据存储加载进行优化的 SQL Server VM 映像。The SQL Server VM image that is optimized for datawarehousing loads is pre-configured with data and log disks. 如果 VM 不包含数据磁盘,并且在 VM 安装过程中添加了新的虚拟硬盘,则需按照如下步骤更改默认文件夹:If your VM did not include a Data Disk and you added new virtual hard disks during the VM setup process, change the default folders as follows:

    • 右键单击左侧面板中的 SQL Server 名称,并单击“属性”。Right-click the SQL Server name in the left panel and click Properties.

      SQL Server 属性

    • 从左侧的“选择页”列表中选择“数据库设置”。Select Database Settings from the Select a page list to the left.

    • 验证数据库默认位置,并/或将其更改为所选的数据磁盘位置。Verify and/or change the Database default locations to the Data Disk locations of your choice. 此位置是使用默认设置创建的新数据库所在的位置。This location is where new databases reside if created with the default settings.

      SQL 数据库默认设置

  5. 若要创建新数据库和一组文件组来保存已分区的表,请打开示例脚本 create_db_default.sqlTo create a new database and a set of filegroups to hold the partitioned tables, open the sample script create_db_default.sql. 该脚本会在默认数据位置创建一个名为 TaxiNYC 的新数据库和 12 个文件组。The script will create a new database named TaxiNYC and 12 filegroups in the default data location. 每个文件组将保存一个月内的 trip_data 和 trip_fare 数据。Each filegroup will hold one month of trip_data and trip_fare data. 根据需要修改数据库名称。Modify the database name, if desired. 单击“执行”以运行该脚本。Click Execute to run the script.

  6. 接下来,创建两个分区表,一个用于 trip_data,另一个用于 trip_fare。Next, create two partition tables, one for the trip_data and another for the trip_fare. 打开示例脚本 create_partitioned_table.sql,其功能如下:Open the sample script create_partitioned_table.sql, which will:

    • 创建分区函数,以按月拆分数据。Create a partition function to split the data by month.

    • 创建分区方案,以将每个月的数据映射到不同的文件组。Create a partition scheme to map each month's data to a different filegroup.

    • 创建两个映射到分区方案的分区表:nyctaxi_trip 将保存 trip_data,而 nyctaxi_fare 将保存 trip_fare 数据。Create two partitioned tables mapped to the partition scheme: nyctaxi_trip will hold the trip_data and nyctaxi_fare will hold the trip_fare data.

      单击“执行”,运行该脚本并创建分区表。Click Execute to run the script and create the partitioned tables.

  7. 在“示例脚本”文件夹中,提供了两个示例 PowerShell 脚本,可用于演示将数据并行批量导入到 SQL Server 表的方式。In the Sample Scripts folder, there are two sample PowerShell scripts provided to demonstrate parallel bulk imports of data to SQL Server tables.

    • bcp_parallel_generic.ps1 是将数据并行批量导入到表的通用脚本。bcp_parallel_generic.ps1 is a generic script to parallel bulk import data into a table. 修改此脚本以设置此脚本的注释行中指示的输入和目标变量。Modify this script to set the input and target variables as indicated in the comment lines in the script.
    • bcp_parallel_nyctaxi.ps1 是通用脚本的预配置版本,可用于为 NYC 出租车行程数据同时加载这两个表。bcp_parallel_nyctaxi.ps1 is a pre-configured version of the generic script and can be used to load both tables for the NYC Taxi Trips data.
  8. 右键单击 bcp_parallel_nyctaxi.ps1 脚本名称,然后单击“编辑”可在 PowerShell 中将其打开。Right-click the bcp_parallel_nyctaxi.ps1 script name and click Edit to open it in PowerShell. 查看预设的变量,并根据所选的数据库名称、输入数据文件夹、目标日志文件夹和到示例格式文件 nyctaxi_trip.xmlnyctaxi_fare.xml(在“示例脚本”文件夹中提供)的路径进行修改。Review the preset variables and modify according to your selected database name, input data folder, target log folder, and paths to the sample format files nyctaxi_trip.xml and nyctaxi_fare.xml (provided in the Sample Scripts folder).

    批量导入数据

    也可以选择身份验证模式,默认值为 Windows 身份验证。You may also select the authentication mode, default is Windows Authentication. 单击工具栏中的绿色箭头运行。Click the green arrow in the toolbar to run. 该脚本将启动 24 个并行批量导入操作,每个分区表对应 12 个操作。The script will launch 24 bulk import operations in parallel, 12 for each partitioned table. 也可以通过打开上述步骤中设置的 SQL Server 默认数据文件夹,监测数据导入进度。You may monitor the data import progress by opening the SQL Server default data folder as set above.

  9. PowerShell 脚本将报告起始和结束时间。The PowerShell script reports the starting and ending times. 所有批量导入完成时,将报告结束时间。When all bulk imports complete, the ending time is reported. 检查目标日志文件夹以验证批量导入成功,即未报告目标日志文件夹存在任何错误。Check the target log folder to verify that the bulk imports were successful, that is, no errors reported in the target log folder.

  10. 数据库已就绪,可以进行浏览、功能设计及需要的其他操作。Your database is now ready for exploration, feature engineering, and other operations as desired. 由于这些表是根据 pickup_datetime 字段进行分区的,因此,将 pickup_datetime 条件纳入 WHERE 子句的查询将从分区方案获益 。Since the tables are partitioned according to the pickup_datetime field, queries that include pickup_datetime conditions in the WHERE clause will benefit from the partition scheme.

  11. SQL Server Management Studio 中,探索提供的示例脚本 sample_queries.sqlIn SQL Server Management Studio, explore the provided sample script sample_queries.sql. 要运行任意示例查询,请突出显示查询行,并单击工具栏中的“执行”。To run any of the sample queries, highlight the query lines then click Execute in the toolbar.

  12. NYC 出租车行程数据加载到两个独立的表中。The NYC Taxi Trips data is loaded in two separate tables. 若要改进联接操作,强烈建议为表建立索引。To improve join operations, it is highly recommended to index the tables. 示例脚本 create_partitioned_index.sql 会在复合联接键 medallion、hack_license 和 pickup_datetime 上创建分区索引。The sample script create_partitioned_index.sql creates partitioned indexes on the composite join key medallion, hack_license, and pickup_datetime.

SQL Server 中的数据浏览和功能设计Data Exploration and Feature Engineering in SQL Server

在此部分中,我们通过使用之前创建的 SQL Server 数据库,直接在 SQL Server Management Studio 中运行 SQL 查询来执行数据浏览和功能设计。In this section, we will perform data exploration and feature generation by running SQL queries directly in the SQL Server Management Studio using the SQL Server database created earlier. 示例脚本”文件夹中提供了名为 sample_queries.sql 的示例脚本。A sample script named sample_queries.sql is provided in the Sample Scripts folder. 如果数据库名称不同于默认值:TaxiNYC,请修改此脚本以更改数据库名称 。Modify the script to change the database name, if it is different from the default: TaxiNYC.

在本练习中,我们将:In this exercise, we will:

  • 使用 Windows 身份验证,或 SQL 身份验证及 SQL 登录名和密码,连接到 SQL Server Management StudioConnect to SQL Server Management Studio using either Windows Authentication or using SQL Authentication and the SQL login name and password.
  • 在不同的时间范围中探索几个字段的数据分布。Explore data distributions of a few fields in varying time windows.
  • 调查经度和纬度字段的数据质量。Investigate data quality of the longitude and latitude fields.
  • 根据 tip_amount 生成二元和多元分类标签。Generate binary and multiclass classification labels based on the tip_amount.
  • 生成特征,并计算/比较行程距离。Generate features and compute/compare trip distances.
  • 联接两个表,并提取用于构建模型的随机示例。Join the two tables and extract a random sample that will be used to build models.

准备好进行 Azure 机器学习后,也可以:When you are ready to proceed to Azure Machine Learning, you may either:

  1. 保存最终的 SQL 查询,以提取和采样数据,并直接将查询复制和粘贴到 Azure 机器学习中的“导入数据”模块,或者Save the final SQL query to extract and sample the data and copy-paste the query directly into a Import Data module in Azure Machine Learning, or
  2. 保留计划用于在新数据库表中进行建模的抽样和工程数据,并使用 Azure 机器学习的导入数据模块中的新表。Persist the sampled and engineered data you plan to use for model building in a new database table and use the new table in the Import Data module in Azure Machine Learning.

在本部分中,我们将保存最终查询以提取和采样数据。In this section we will save the final query to extract and sample the data. 第二种方法在 IPython Notebook 中的数据浏览和特征工程部分进行了演示。The second method is demonstrated in the Data Exploration and Feature Engineering in IPython Notebook section.

若要在之前使用并行批量导入填充的表中快速验证表的多个行和列,请采用如下方式:For a quick verification of the number of rows and columns in the tables populated earlier using parallel bulk import,

-- Report number of rows in table nyctaxi_trip without table scan
SELECT SUM(rows) FROM sys.partitions WHERE object_id = OBJECT_ID('nyctaxi_trip')

-- Report number of columns in table nyctaxi_trip
SELECT COUNT(*) FROM information_schema.columns WHERE table_name = 'nyctaxi_trip'

浏览:依据徽章的行程分布Exploration: Trip distribution by medallion

此示例标识在给定的时间段内具有 100 多个行程的徽章(出租车数)。This example identifies the medallion (taxi numbers) with more than 100 trips within a given time period. 查询将受益于分区表访问,因为它受 pickup_datetime 分区方案的限制。The query would benefit from the partitioned table access since it is conditioned by the partition scheme of pickup_datetime. 查询完整数据集还将使用分区表和/或索引扫描。Querying the full dataset will also make use of the partitioned table and/or index scan.

SELECT medallion, COUNT(*)
FROM nyctaxi_fare
WHERE pickup_datetime BETWEEN '20130101' AND '20130331'
GROUP BY medallion
HAVING COUNT(*) > 100

浏览:依据徽章和 hack_license 的行程分布Exploration: Trip distribution by medallion and hack_license

SELECT medallion, hack_license, COUNT(*)
FROM nyctaxi_fare
WHERE pickup_datetime BETWEEN '20130101' AND '20130131'
GROUP BY medallion, hack_license
HAVING COUNT(*) > 100

数据质量评估:验证含有不正确的经度和/或纬度的记录Data Quality Assessment: Verify records with incorrect longitude and/or latitude

此示例将调查是否有任何一个经度和/或纬度字段包含无效的值(弧度应介于 -90 到 90 之间),或者具有(0,0)坐标。This example investigates if any of the longitude and/or latitude fields either contain an invalid value (radian degrees should be between -90 and 90), or have (0, 0) coordinates.

SELECT COUNT(*) FROM nyctaxi_trip
WHERE pickup_datetime BETWEEN '20130101' AND '20130331'
AND  (CAST(pickup_longitude AS float) NOT BETWEEN -90 AND 90
OR    CAST(pickup_latitude AS float) NOT BETWEEN -90 AND 90
OR    CAST(dropoff_longitude AS float) NOT BETWEEN -90 AND 90
OR    CAST(dropoff_latitude AS float) NOT BETWEEN -90 AND 90
OR    (pickup_longitude = '0' AND pickup_latitude = '0')
OR    (dropoff_longitude = '0' AND dropoff_latitude = '0'))

浏览:已付小费与未付小费的行程分布Exploration: Tipped vs. Not Tipped Trips distribution

此示例将查找给定时间段(或如果时间段为全年,则是在完整的数据集)内,已付小费与未付小费的行程的数量。This example finds the number of trips that were tipped vs. not tipped in a given time period (or in the full dataset if covering the full year). 此分布反映二元标签分布,以便以后用于二元分类建模。This distribution reflects the binary label distribution to be later used for binary classification modeling.

SELECT tipped, COUNT(*) AS tip_freq FROM (
  SELECT CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END AS tipped, tip_amount
  FROM nyctaxi_fare
  WHERE pickup_datetime BETWEEN '20130101' AND '20131231') tc
GROUP BY tipped

浏览:小费分类/范围分布Exploration: Tip Class/Range Distribution

此示例将计算给定的时间段(或如果时间段为全年,则表示完整的数据集)内的小费范围分布。This example computes the distribution of tip ranges in a given time period (or in the full dataset if covering the full year). 这是以后用于多类分类建模的标签类的分布。This distribution of the label classes will be used later for multiclass classification modeling.

SELECT tip_class, COUNT(*) AS tip_freq FROM (
    SELECT CASE
        WHEN (tip_amount = 0) THEN 0
        WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
        WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
        WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
        ELSE 4
    END AS tip_class
FROM nyctaxi_fare
WHERE pickup_datetime BETWEEN '20130101' AND '20131231') tc
GROUP BY tip_class

浏览:计算并比较行程距离Exploration: Compute and Compare Trip Distance

此示例将上车和下车经度和纬度转换为 SQL 地理位置点,使用 SQL 地理位置点差异计算行程距离,并返回一个随机抽样的结果进行比较。This example converts the pickup and dropoff longitude and latitude to SQL geography points, computes the trip distance using SQL geography points difference, and returns a random sample of the results for comparison. 该示例仅在使用前面介绍的数据质量评估查询,将结果限制为有效坐标。The example limits the results to valid coordinates only using the data quality assessment query covered earlier.

SELECT
pickup_location=geography::STPointFromText('POINT(' + pickup_longitude + ' ' + pickup_latitude + ')', 4326)
,dropoff_location=geography::STPointFromText('POINT(' + dropoff_longitude + ' ' + dropoff_latitude + ')', 4326)
,trip_distance
,computedist=round(geography::STPointFromText('POINT(' + pickup_longitude + ' ' + pickup_latitude + ')', 4326).STDistance(geography::STPointFromText('POINT(' + dropoff_longitude + ' ' + dropoff_latitude + ')', 4326))/1000, 2)
FROM nyctaxi_trip
tablesample(0.01 percent)
WHERE CAST(pickup_latitude AS float) BETWEEN -90 AND 90
AND   CAST(dropoff_latitude AS float) BETWEEN -90 AND 90
AND   pickup_longitude != '0' AND dropoff_longitude != '0'

SQL 查询中的功能设计Feature Engineering in SQL Queries

标签生成和地理转换浏览查询还可通过删除计数部分,用于生成标签/功能。The label generation and geography conversion exploration queries can also be used to generate labels/features by removing the counting part. 其他功能设计 SQL 示例在 IPython Notebook 中的数据浏览和特征工程部分提供。Additional feature engineering SQL examples are provided in the Data Exploration and Feature Engineering in IPython Notebook section. 使用可在 SQL Server 数据库实例直接运行的 SQL 查询,以更高效的方式在完整数据集或其大型子集上运行功能生成查询。It is more efficient to run the feature generation queries on the full dataset or a large subset of it using SQL queries that run directly on the SQL Server database instance. 该查询可能在 SQL Server Management Studio、IPython Notebook 或任何可本地或远程访问数据库的开发工具或环境中执行。The queries may be executed in SQL Server Management Studio, IPython Notebook, or any development tool or environment that can access the database locally or remotely.

准备建模的数据Preparing Data for Model Building

下面的查询可联接 nyctaxi_tripnyctaxi_fare 表,生成一个二元分类标签 tipped、多类分类标签 tip_class,以及从完整联接的数据集中提取 1% 的随机样本。The following query joins the nyctaxi_trip and nyctaxi_fare tables, generates a binary classification label tipped, a multi-class classification label tip_class, and extracts a 1% random sample from the full joined dataset. 可以复制此查询,然后将其直接粘贴到 Azure 机器学习工作室导入数据模块中,以便从 Azure 中的 SQL Server 数据库实例进行直接数据引入。This query can be copied then pasted directly in the Azure Machine Learning Studio Import Data module for direct data ingestion from the SQL Server database instance in Azure. 此查询将排除具有不正确(0,0)坐标的记录。The query excludes records with incorrect (0, 0) coordinates.

SELECT t.*, f.payment_type, f.fare_amount, f.surcharge, f.mta_tax, f.tolls_amount,     f.total_amount, f.tip_amount,
    CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END AS tipped,
    CASE WHEN (tip_amount = 0) THEN 0
        WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
        WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
        WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
        ELSE 4
    END AS tip_class
FROM nyctaxi_trip t, nyctaxi_fare f
TABLESAMPLE (1 percent)
WHERE t.medallion = f.medallion
AND   t.hack_license = f.hack_license
AND   t.pickup_datetime = f.pickup_datetime
AND   pickup_longitude != '0' AND dropoff_longitude != '0'

IPython Notebook 中的数据浏览和功能设计Data Exploration and Feature Engineering in IPython Notebook

在此部分中,我们会在之前创建的 SQL Server 数据库中使用 Python 和 SQL 查询,执行数据浏览和功能生成。In this section, we will perform data exploration and feature generation using both Python and SQL queries against the SQL Server database created earlier. Sample IPython Notebooks”文件夹中提供了名为 machine-Learning-data-science-process-sql-story.ipynb 的示例 IPython notebook。A sample IPython notebook named machine-Learning-data-science-process-sql-story.ipynb is provided in the Sample IPython Notebooks folder. GitHub 也提供此 Notebook。This notebook is also available on GitHub.

使用大数据时,遵循以下建议的顺序:When working with big data, follow this recommended sequence:

  • 将小型数据示例读入到内存中的数据帧。Read in a small sample of the data into an in-memory data frame.
  • 使用抽样数据执行一些可视化效果和浏览。Perform some visualizations and explorations using the sampled data.
  • 使用抽样数据进行功能设计实验。Experiment with feature engineering using the sampled data.
  • 对于大型数据浏览、数据操作和功能设计,请使用 Python 直接针对 Azure VM 中的 SQL Server 数据库发出 SQL 查询。For larger data exploration, data manipulation and feature engineering, use Python to issue SQL Queries directly against the SQL Server database in the Azure VM.
  • 决定用于 Azure 机器学习建模的样本大小。Decide the sample size to use for Azure Machine Learning model building.

准备好继续进行 Azure 机器学习后,也可以:When ready to proceed to Azure Machine Learning, you may either:

  1. 保存最终的 SQL 查询,以提取和采样数据,然后直接将查询复制和粘贴到 Azure 机器学习中的“导入数据”模块。Save the final SQL query to extract and sample the data and copy-paste the query directly into a Import Data module in Azure Machine Learning. 此方法于在 Azure 机器学习中建模部分进行展示。This method is demonstrated in the Building Models in Azure Machine Learning section.
  2. 保留计划用于在新数据库表中进行建模的抽样和工程数据,并在导入数据模块中使用新表。Persist the sampled and engineered data you plan to use for model building in a new database table, then use the new table in the Import Data module.

以下是几个数据浏览、数据可视化和功能设计示例。The following are a few data exploration, data visualization, and feature engineering examples. 有关更多示例,请参阅 Sample IPython Notebooks 文件夹中的 SQL IPython Notebook 示例。For more examples, see the sample SQL IPython notebook in the Sample IPython Notebooks folder.

初始化数据库凭据Initialize Database Credentials

使用以下变量初始化数据库连接设置:Initialize your database connection settings in the following variables:

SERVER_NAME=<server name>
DATABASE_NAME=<database name>
USERID=<user name>
PASSWORD=<password>
DB_DRIVER = <database server>

创建数据库连接Create Database Connection

CONNECTION_STRING = 'DRIVER={'+DRIVER+'};SERVER='+SERVER_NAME+';DATABASE='+DATABASE_NAME+';UID='+USERID+';PWD='+PASSWORD
conn = pyodbc.connect(CONNECTION_STRING)

报告表 nyctaxi_trip 中的行数和列数Report number of rows and columns in table nyctaxi_trip

nrows = pd.read_sql('''
    SELECT SUM(rows) FROM sys.partitions
    WHERE object_id = OBJECT_ID('nyctaxi_trip')
''', conn)

print 'Total number of rows = %d' % nrows.iloc[0,0]

ncols = pd.read_sql('''
    SELECT COUNT(*) FROM information_schema.columns
    WHERE table_name = ('nyctaxi_trip')
''', conn)

print 'Total number of columns = %d' % ncols.iloc[0,0]
  • 总行数 = 173179759Total number of rows = 173179759
  • 总列数 = 14Total number of columns = 14

从 SQL Server 数据库读入小型数据样本Read-in a small data sample from the SQL Server Database

t0 = time.time()

query = '''
    SELECT t.*, f.payment_type, f.fare_amount, f.surcharge, f.mta_tax,
        f.tolls_amount, f.total_amount, f.tip_amount
    FROM nyctaxi_trip t, nyctaxi_fare f
    TABLESAMPLE (0.05 PERCENT)
    WHERE t.medallion = f.medallion
    AND   t.hack_license = f.hack_license
    AND   t.pickup_datetime = f.pickup_datetime
'''

df1 = pd.read_sql(query, conn)

t1 = time.time()
print 'Time to read the sample table is %f seconds' % (t1-t0)

print 'Number of rows and columns retrieved = (%d, %d)' % (df1.shape[0], df1.shape[1])

读取示例表的时间为 6.492000 秒Time to read the sample table is 6.492000 seconds
检索到的行数和列数 = (84952, 21)Number of rows and columns retrieved = (84952, 21)

描述性统计信息Descriptive Statistics

浏览抽样数据现已就绪。Now are ready to explore the sampled data. 接下来,首先查看 trip_distance(或任何其他)字段的描述性统计信息:We start with looking at descriptive statistics for the trip_distance (or any other) field(s):

df1['trip_distance'].describe()

可视化效果:盒须图示例Visualization: Box Plot Example

接下来我们查看行程距离的盒须图,以可视化分位数Next we look at the box plot for the trip distance to visualize the quantiles

df1.boxplot(column='trip_distance',return_type='dict')

Plot #1

可视化效果:分布图示例Visualization: Distribution Plot Example

fig = plt.figure()
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
df1['trip_distance'].plot(ax=ax1,kind='kde', style='b-')
df1['trip_distance'].hist(ax=ax2, bins=100, color='k')

Plot #2

可视化效果:条形图和折线图Visualization: Bar and Line Plots

在此示例中,我们可以将行程距离量化为五个分类收纳组,并将分类收纳结果可视化。In this example, we bin the trip distance into five bins and visualize the binning results.

trip_dist_bins = [0, 1, 2, 4, 10, 1000]
df1['trip_distance']
trip_dist_bin_id = pd.cut(df1['trip_distance'], trip_dist_bins)
trip_dist_bin_id

我们可以在条形图和折线图中绘制上述分类收纳组分布,如下所示We can plot the above bin distribution in a bar or line plot as below

pd.Series(trip_dist_bin_id).value_counts().plot(kind='bar')

Plot #3

pd.Series(trip_dist_bin_id).value_counts().plot(kind='line')

Plot #4

可视化效果:散点图示例Visualization: Scatterplot Example

我们会显示在 trip_time_in_secstrip_distance 之间的散点图,以查看是否存在任何关联We show scatter plot between trip_time_in_secs and trip_distance to see if there is any correlation

plt.scatter(df1['trip_time_in_secs'], df1['trip_distance'])

Plot #6

同样地,我们可以检查 rate_codetrip_distance 之间的关系。Similarly we can check the relationship between rate_code and trip_distance.

plt.scatter(df1['passenger_count'], df1['trip_distance'])

Plot #8

二次采样 SQL 中的数据Sub-Sampling the Data in SQL

Azure 机器学习工作室中准备建模数据时,可以决定在“导入数据”模块中直接使用 SQL 查询,或者将工程和抽样数据保留在新表中,这样就可以通过简单的“SELECT * FROM <your_new_table_name> ”在导入数据模块中使用。When preparing data for model building in Azure Machine Learning Studio, you may either decide on the SQL query to use directly in the Import Data module or persist the engineered and sampled data in a new table, which you could use in the Import Data module with a simple SELECT * FROM <your_new_table_name>.

在本部分中,我们将创建新表以保存抽样和工程数据。In this section we will create a new table to hold the sampled and engineered data. SQL Server 中的数据浏览和特征工程部分提供了可用于模型构建的直接 SQL 查询示例。An example of a direct SQL query for model building is provided in the Data Exploration and Feature Engineering in SQL Server section.

创建取样表,并使用 1% 已联接表的填充。Create a Sample Table and Populate with 1% of the Joined Tables. 如果存在此表,请首先将其删除。Drop Table First if it Exists.

在本部分中,我们会联接表 nyctaxi_tripnyctaxi_fare,提取 1% 的随机样本,并将抽样数据保存到名为 nyctaxi_one_percent 的新表中:In this section, we join the tables nyctaxi_trip and nyctaxi_fare, extract a 1% random sample, and persist the sampled data in a new table name nyctaxi_one_percent:

cursor = conn.cursor()

drop_table_if_exists = '''
    IF OBJECT_ID('nyctaxi_one_percent', 'U') IS NOT NULL DROP TABLE nyctaxi_one_percent
'''

nyctaxi_one_percent_insert = '''
    SELECT t.*, f.payment_type, f.fare_amount, f.surcharge, f.mta_tax, f.tolls_amount, f.total_amount, f.tip_amount
    INTO nyctaxi_one_percent
    FROM nyctaxi_trip t, nyctaxi_fare f
    TABLESAMPLE (1 PERCENT)
    WHERE t.medallion = f.medallion
    AND   t.hack_license = f.hack_license
    AND   t.pickup_datetime = f.pickup_datetime
    AND   pickup_longitude <> '0' AND dropoff_longitude <> '0'
'''

cursor.execute(drop_table_if_exists)
cursor.execute(nyctaxi_one_percent_insert)
cursor.commit()

在 IPython Notebook 中使用 SQL 查询进行数据浏览Data Exploration using SQL Queries in IPython Notebook

在本部分中,我们将使用之前创建的新表中保存的 1% 抽样数据来浏览数据分布。In this section, we explore data distributions using the 1% sampled data which is persisted in the new table we created above. 请注意,如 SQL Server 中的数据浏览和特征工程部分所示,可使用原始表或使用 TABLESAMPLE 执行类似浏览,以限制浏览示例,或通过使用 pickup_datetime 分区,将结果限制为给定时间段。Note that similar explorations can be performed using the original tables, optionally using TABLESAMPLE to limit the exploration sample or by limiting the results to a given time period using the pickup_datetime partitions, as illustrated in the Data Exploration and Feature Engineering in SQL Server section.

浏览:每日行程分布Exploration: Daily distribution of trips

query = '''
    SELECT CONVERT(date, dropoff_datetime) AS date, COUNT(*) AS c
    FROM nyctaxi_one_percent
    GROUP BY CONVERT(date, dropoff_datetime)
'''

pd.read_sql(query,conn)

浏览:按徽章的行程分布Exploration: Trip distribution per medallion

query = '''
    SELECT medallion,count(*) AS c
    FROM nyctaxi_one_percent
    GROUP BY medallion
'''

pd.read_sql(query,conn)

在 IPython Notebook 中使用 SQL 查询进行功能设计Feature Generation Using SQL Queries in IPython Notebook

在本部分中,我们将使用 SQL 查询直接生成新标签和特征,以在上一部分创建的 1% 的取样表上操作。In this section we will generate new labels and features directly using SQL queries, operating on the 1% sample table we created in the previous section.

标签生成:生成类标签Label Generation: Generate Class Labels

在下面示例中,我们会生成两组用于建模的标签:In the following example, we generate two sets of labels to use for modeling:

  1. 二进制类标签 tipped(预测是否会支付小费)Binary Class Labels tipped (predicting if a tip will be given)

  2. 多类标签 tip_class(预测小费的收纳组或范围)Multiclass Labels tip_class (predicting the tip bin or range)

     nyctaxi_one_percent_add_col = '''
         ALTER TABLE nyctaxi_one_percent ADD tipped bit, tip_class int
     '''
    
     cursor.execute(nyctaxi_one_percent_add_col)
     cursor.commit()
    
     nyctaxi_one_percent_update_col = '''
         UPDATE nyctaxi_one_percent
         SET
            tipped = CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END,
            tip_class = CASE WHEN (tip_amount = 0) THEN 0
                             WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
                             WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
                             WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
                             ELSE 4
                         END
     '''
    
     cursor.execute(nyctaxi_one_percent_update_col)
     cursor.commit()
    

特征工程:适用于分类列的计数功能Feature Engineering: Count Features for Categorical Columns

此示例会将分类字段替换为数字字段,方法是使用它在数据中发生的计数来替换每个类别。This example transforms a categorical field into a numeric field by replacing each category with the count of its occurrences in the data.

nyctaxi_one_percent_insert_col = '''
    ALTER TABLE nyctaxi_one_percent ADD cmt_count int, vts_count int
'''

cursor.execute(nyctaxi_one_percent_insert_col)
cursor.commit()

nyctaxi_one_percent_update_col = '''
    WITH B AS
    (
        SELECT medallion, hack_license,
            SUM(CASE WHEN vendor_id = 'cmt' THEN 1 ELSE 0 END) AS cmt_count,
            SUM(CASE WHEN vendor_id = 'vts' THEN 1 ELSE 0 END) AS vts_count
        FROM nyctaxi_one_percent
        GROUP BY medallion, hack_license
    )

    UPDATE nyctaxi_one_percent
    SET nyctaxi_one_percent.cmt_count = B.cmt_count,
        nyctaxi_one_percent.vts_count = B.vts_count
    FROM nyctaxi_one_percent A INNER JOIN B
    ON A.medallion = B.medallion AND A.hack_license = B.hack_license
'''

cursor.execute(nyctaxi_one_percent_update_col)
cursor.commit()

特征工程:适用于数值列的收纳组功能Feature Engineering: Bin features for Numerical Columns

此示例将连续的数值字段转换为预设的类别范围,即将数值字段转换为分类字段。This example transforms a continuous numeric field into preset category ranges, that is, transform numeric field into a categorical field.

nyctaxi_one_percent_insert_col = '''
    ALTER TABLE nyctaxi_one_percent ADD trip_time_bin int
'''

cursor.execute(nyctaxi_one_percent_insert_col)
cursor.commit()

nyctaxi_one_percent_update_col = '''
    WITH B(medallion,hack_license,pickup_datetime,trip_time_in_secs, BinNumber ) AS
    (
        SELECT medallion,hack_license,pickup_datetime,trip_time_in_secs,
        NTILE(5) OVER (ORDER BY trip_time_in_secs) AS BinNumber from nyctaxi_one_percent
    )

    UPDATE nyctaxi_one_percent
    SET trip_time_bin = B.BinNumber
    FROM nyctaxi_one_percent A INNER JOIN B
    ON A.medallion = B.medallion
    AND A.hack_license = B.hack_license
    AND A.pickup_datetime = B.pickup_datetime
'''

cursor.execute(nyctaxi_one_percent_update_col)
cursor.commit()

特征工程:从十进制纬度/经度提取位置功能Feature Engineering: Extract Location Features from Decimal Latitude/Longitude

此示例以十进制表示的纬度和/或经度字段划分为多个不同粒度的区域字段,例如国家/地区、城市、城镇、街区等等。注意,新的地理位置字段不映射到实际位置。This example breaks down the decimal representation of a latitude and/or longitude field into multiple region fields of different granularity, such as, country/region, city, town, block, etc. Note that the new geo-fields are not mapped to actual locations. 有关规划地理编码位置的信息,请参 Bing 地图 REST 服务For information on mapping geocode locations, see Bing Maps REST Services.

nyctaxi_one_percent_insert_col = '''
    ALTER TABLE nyctaxi_one_percent
    ADD l1 varchar(6), l2 varchar(3), l3 varchar(3), l4 varchar(3),
        l5 varchar(3), l6 varchar(3), l7 varchar(3)
'''

cursor.execute(nyctaxi_one_percent_insert_col)
cursor.commit()

nyctaxi_one_percent_update_col = '''
    UPDATE nyctaxi_one_percent
    SET l1=round(pickup_longitude,0)
        , l2 = CASE WHEN LEN (PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1)) >= 1 THEN SUBSTRING(PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1),1,1) ELSE '0' END     
        , l3 = CASE WHEN LEN (PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1)) >= 2 THEN SUBSTRING(PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1),2,1) ELSE '0' END     
        , l4 = CASE WHEN LEN (PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1)) >= 3 THEN SUBSTRING(PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1),3,1) ELSE '0' END     
        , l5 = CASE WHEN LEN (PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1)) >= 4 THEN SUBSTRING(PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1),4,1) ELSE '0' END     
        , l6 = CASE WHEN LEN (PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1)) >= 5 THEN SUBSTRING(PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1),5,1) ELSE '0' END     
        , l7 = CASE WHEN LEN (PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1)) >= 6 THEN SUBSTRING(PARSENAME(ROUND(ABS(pickup_longitude) - FLOOR(ABS(pickup_longitude)),6),1),6,1) ELSE '0' END
'''

cursor.execute(nyctaxi_one_percent_update_col)
cursor.commit()

验证功能化表的最终形式Verify the final form of the featurized table

query = '''SELECT TOP 100 * FROM nyctaxi_one_percent'''
pd.read_sql(query,conn)

我们现已准备好在 Azure 机器学习 中进行建模和模型部署。We are now ready to proceed to model building and model deployment in Azure Machine Learning. 数据已可用于之前识别的任意预测问题,即:The data is ready for any of the prediction problems identified earlier, namely:

  1. 二元分类:预测某个行程是否支付小费。Binary classification: To predict whether or not a tip was paid for a trip.
  2. 多类分类:根据以前定义的类,预测小费支付范围。Multiclass classification: To predict the range of tip paid, according to the previously defined classes.
  3. 回归任务:预测为行程支付的小费金额。Regression task: To predict the amount of tip paid for a trip.

在 Azure 机器学习中建模Building Models in Azure Machine Learning

若要开始建模练习,请登录到 Azure 机器学习工作区。To begin the modeling exercise, log in to your Azure Machine Learning workspace. 如果尚未创建机器学习工作区,请参阅创建 Azure 机器学习工作区If you have not yet created a machine learning workspace, see Create an Azure Machine Learning workspace.

  1. 要开始使用 Azure 机器学习,请参阅什么是 Azure 机器学习工作室?To get started with Azure Machine Learning, see What is Azure Machine Learning Studio?
  2. 登录 Azure 机器学习工作室Log in to Azure Machine Learning Studio.
  3. 工作室主页上提供丰富的信息、视频、教程、指向模块参考链接及其他资源。The Studio Home page provides a wealth of information, videos, tutorials, links to the Modules Reference, and other resources. 有关 Azure 机器学习的详细信息,请参阅 Azure 机器学习文档中心Fore more information about Azure Machine Learning, consult the Azure Machine Learning Documentation Center.

典型的训练实验由以下内容组成:A typical training experiment consists of the following:

  1. 新建 + ”实验。Create a +NEW experiment.
  2. 将数据放入 Azure 机器学习。Get the data to Azure Machine Learning.
  3. 根据需要预处理、转换和操作数据。Pre-process, transform and manipulate the data as needed.
  4. 根据需要生成功能。Generate features as needed.
  5. 将数据拆分为训练/验证/测试数据集(或每个类具有单独的数据集)。Split the data into training/validation/testing datasets(or have separate datasets for each).
  6. 根据要解决的学习问题,选择一个或多个机器学习算法。Select one or more machine learning algorithms depending on the learning problem to solve. 例如,二元分类、多类分类、回归。E.g., binary classification, multiclass classification, regression.
  7. 使用定型数据集训练一个或多个模型。Train one or more models using the training dataset.
  8. 使用定型模型对验证数据集评分。Score the validation dataset using the trained model(s).
  9. 评估模型来计算针对学习问题的相关指标。Evaluate the model(s) to compute the relevant metrics for the learning problem.
  10. 微调模型,并选择最佳模型进行部署。Fine tune the model(s) and select the best model to deploy.

在此练习中,我们已经探讨和设计了 SQL Server 中的数据,并确定了要引入 Azure 机器学习中的样本大小。In this exercise, we have already explored and engineered the data in SQL Server, and decided on the sample size to ingest in Azure Machine Learning. 若要构建一个或多个我们确定的预测模型:To build one or more of the prediction models we decided:

  1. 使用数据输入和输出部分的导入数据模块,将数据放入 Azure 机器学习。Get the data to Azure Machine Learning using the Import Data module, available in the Data Input and Output section. 有关详细信息,请参阅导入数据模块参考页。For more information, see the Import Data module reference page.

    Azure 机器学习导入数据

  2. 在“属性”面板中,选择“Azure SQL 数据库”作为数据源Select Azure SQL Database as the Data source in the Properties panel.

  3. 在“数据库服务器名称”字段中输入数据库 DNS 名称。Enter the database DNS name in the Database server name field. 格式:tcp:<your_virtual_machine_DNS_name>,1433Format: tcp:<your_virtual_machine_DNS_name>,1433

  4. 在相应字段中输入数据库名称Enter the Database name in the corresponding field.

  5. 在“服务器用户帐户名” 中输入 SQL 用户名,在“服务器用户帐户密码” 中输入密码Enter the SQL user name in the Server user account name, and the password in the Server user account password.

  6. 数据库查询编辑文本区域,粘贴提取必要数据库字段(包括任何计算的字段,例如标签)的查询,并向下采样数据至所需样本大小。In the Database query edit text area, paste the query which extracts the necessary database fields (including any computed fields such as the labels) and down samples the data to the desired sample size.

二元分类实验直接从 SQL Server 数据库读取数据,示例如下图所示。An example of a binary classification experiment reading data directly from the SQL Server database is in the figure below. 可以针对多类分类和回归问题构建类似实验。Similar experiments can be constructed for multiclass classification and regression problems.

Azure 机器学习训练

重要

在上一部分中提供的建模数据提取和采样查询示例中,这三个建模练习的所有标签都包括在此查询中In the modeling data extraction and sampling query examples provided in previous sections, all labels for the three modeling exercises are included in the query. 每个建模练习的一个重要(必需)步骤是排除其他两个问题不需要的标签,以及任何其他的目标泄漏An important (required) step in each of the modeling exercises is to exclude the unnecessary labels for the other two problems, and any other target leaks. 例如,使用二元分类时,使用标签 tipped并排除字段 tip_classtip_amounttotal_amountFor e.g., when using binary classification, use the label tipped and exclude the fields tip_class, tip_amount, and total_amount. 后者是目标泄漏,因为它们指示支付的小费。The latter are target leaks since they imply the tip paid.

为了排除不需要的列和/或目标泄漏,可以使用选择数据集中的列模块或编辑元数据To exclude unnecessary columns and/or target leaks, you may use the Select Columns in Dataset module or the Edit Metadata. 有关详细信息,请参阅选择数据集中的列编辑元数据参考页。For more information, see Select Columns in Dataset and Edit Metadata reference pages.

在 Azure 机器学习中部署模型Deploying Models in Azure Machine Learning

模型已就绪时,即可轻松地从实验直接将其部署为 Web 服务。When your model is ready, you can easily deploy it as a web service directly from the experiment. 有关部署 Azure 机器学习 Web 服务的详细信息,请参阅部署 Azure 机器学习 Web 服务For more information about deploying Azure Machine Learning web services, see Deploy an Azure Machine Learning web service.

要部署新 Web 服务,需要:To deploy a new web service, you need to:

  1. 创建评分实验。Create a scoring experiment.
  2. 部署 Web 服务。Deploy the web service.

若要从已完成的训练实验中创建评分实验,请单击下方操作栏中的“创建评分实验”。To create a scoring experiment from a Finished training experiment, click CREATE SCORING EXPERIMENT in the lower action bar.

Azure 评分

Azure 机器学习将尝试根据训练实验的组件创建评分实验。Azure Machine Learning will attempt to create a scoring experiment based on the components of the training experiment. 特别是,它将:In particular, it will:

  1. 保存训练的模型,并删除模型训练模块。Save the trained model and remove the model training modules.
  2. 标识逻辑输入端口,以表示预期输入数据架构。Identify a logical input port to represent the expected input data schema.
  3. 标识逻辑输出端口,以表示预期 Web 服务输出架构。Identify a logical output port to represent the expected web service output schema.

创建评分实验后,请检查并根据需要进行调整。When the scoring experiment is created, review it and adjust as needed. 典型调整是将输入数据集和/或查询替换为排除标签字段的数据集和/或查询,因为这些数据集和/或查询在调用该服务时不可用。A typical adjustment is to replace the input dataset and/or query with one which excludes label fields, as these will not be available when the service is called. 如果将输入数据集和/或查询大小减少到几个记录,刚好能够表示输入架构,这也是一个非常好的做法。It is also a good practice to reduce the size of the input dataset and/or query to a few records, just enough to indicate the input schema. 对于输出端口,通常会使用选择数据集中的列模块在输出中排除所有输入字段,仅包括“评分标签” 和“评分概率” 。For the output port, it is common to exclude all input fields and only include the Scored Labels and Scored Probabilities in the output using the Select Columns in Dataset module.

评分实验示例如下图所示。A sample scoring experiment is in the figure below. 准备部署时,请单击下方操作栏中的“发布 WEB 服务”按钮。When ready to deploy, click the PUBLISH WEB SERVICE button in the lower action bar.

Azure 机器学习发布

概括来说,在本分步介绍的教程中,已创建了 Azure 数据科学环境,从数据采集到 Azure 机器学习 Web 服务的模型训练和部署等多项工作中,都会使用大型公共数据集。To recap, in this walkthrough tutorial, you have created an Azure data science environment, worked with a large public dataset all the way from data acquisition to model training and deploying of an Azure Machine Learning web service.

许可证信息License Information

此示例演练和及其附带脚本和 IPython notebook 是在 MIT 许可证下由 Microsoft 共享。This sample walkthrough and its accompanying scripts and IPython notebook(s) are shared by Microsoft under the MIT license. 如需更多详细信息,请查看 GitHub 上示例代码目录中的 LICENSE.txt 文件。Please check the LICENSE.txt file in the directory of the sample code on GitHub for more details.

参考References

Andrés Monroy NYC 出租车行程下载页Andrés Monroy NYC Taxi Trips Download Page
由 Chris Whong 提供的 FOILing NYC 出租车行程数据 FOILing NYC’s Taxi Trip Data by Chris Whong
NYC 出租车和礼车委员会研究和统计信息NYC Taxi and Limousine Commission Research and Statistics