Team Data Science Process 实务:使用 Azure Synapse AnalyticsThe Team Data Science Process in action: using Azure Synapse Analytics

本教程逐步介绍如何使用 Azure Synapse Analytics 为某个公开提供的数据集(纽约市出租车行程数据集)生成和部署机器学习模型。In this tutorial, we walk you through building and deploying a machine learning model using Azure Synapse Analytics for a publicly available dataset -- the NYC Taxi Trips dataset. 构造的二元分类模型可预测是否为某个行程支付了小费。The binary classification model constructed predicts whether or not a tip is paid for a trip. 模型包括多类分类(是否有小费)和回归(已付小费金额的分布)。Models include multiclass classification (whether or not there is a tip) and regression (the distribution for the tip amounts paid).

该过程遵循团队数据科学过程 (TDSP) 工作流。The procedure follows the Team Data Science Process (TDSP) workflow. 我们会介绍如何设置数据科学环境,如何将数据载入 Azure Synapse Analytics,以及如何使用 Azure Synapse Analytics 或 IPython Notebook 来浏览要建模的数据和工程特征。We show how to set up a data science environment, how to load the data into Azure Synapse Analytics, and how to use either Azure Synapse Analytics or an IPython Notebook to explore the data and engineer features to model. 然后,我们会介绍如何使用 Azure 机器学习来构建和部署模型。We then show how to build and deploy a model with Azure Machine Learning.

NYC 出租车行程数据集The NYC Taxi Trips dataset

NYC 出租车车程数据包含大约 20 GB(未压缩约为 48 GB)的压缩 CSV 文件,记录了超过 1.73 亿个单独车程及每个车程支付的费用。The NYC Taxi Trip data consists of about 20GB of compressed CSV files (~48GB uncompressed), recording more than 173 million individual trips and the fares paid for each trip. 每个行程记录都包括上车和下车的位置和时间、匿名出租车司机的驾驶证号和车牌号(出租车的唯一 ID)。Each trip record includes the pickup and dropoff locations and times, anonymized hack (driver's) license number, and the medallion (taxi’s unique ID) number. 数据涵盖 2013 年的所有行程,并在每个月的以下两个数据集中提供:The data covers all trips in the year 2013 and is provided in the following two datasets for each month:

  1. trip_data.csv 文件包含行程的详细信息,例如乘客编号、上车和下车时间、行程持续时间和行程距离。The trip_data.csv file contains trip details, such as number of passengers, pickup and dropoff points, trip duration, and trip length. 下面是一些示例记录:Here are a few sample records:

medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude

89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868

  1. trip_fare.csv 文件包含每个行程费用的详细信息,例如付款类型、费用金额、附加税和税金、小费和通行税以及支付的总金额等。The trip_fare.csv file contains details of the fare paid for each trip, such as payment type, fare amount, surcharge and taxes, tips and tolls, and the total amount paid. 下面是一些示例记录:Here are a few sample records:

medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount

89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01 15:11:48,CSH,6.5,0,0.5,0,0,7

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06 00:18:35,CSH,6,0.5,0.5,0,0,7

0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05 18:49:41,CSH,5.5,1,0.5,0,0,7

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:54:15,CSH,5,0.5,0.5,0,0,6

DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:25:03,CSH,9.5,0.5,0.5,0,0,10.5

用于联接 trip_data 和 trip_fare 的唯一键由以下三个字段组成︰The unique key used to join trip_data and trip_fare is composed of the following three fields:

  • medallion、medallion,
  • hack_license 和hack_license and
  • pickup_datetime。pickup_datetime.

解决三种类型的预测任务Address three types of prediction tasks

我们根据 tip_amount 编写了三个预测问题的公式,来阐明三种类型的建模任务︰We formulate three prediction problems based on the tip_amount to illustrate three kinds of modeling tasks:

  1. 二元分类:预测是否为某个行程支付了小费,即大于 $0 的 tip_amount 是正例,等于 $0 的 tip_amount 是反例 。Binary classification: To predict whether or not a tip was paid for a trip, that is, a tip_amount that is greater than $0 is a positive example, while a tip_amount of $0 is a negative example.
  2. 多类分类:预测为行程支付的小费的范围。Multiclass classification: To predict the range of tip paid for the trip. 我们将 tip_amount 划分五个分类收纳组或类别:We divide the tip_amount into five bins or classes:

Class 0 : tip_amount = $0

Class 1 : tip_amount > $0 and tip_amount <= $5

Class 2 : tip_amount > $5 and tip_amount <= $10

Class 3 : tip_amount > $10 and tip_amount <= $20

Class 4 : tip_amount > $20

  1. 回归任务:预测为行程支付的小费金额。Regression task: To predict the amount of tip paid for a trip.

设置 Azure 数据科学环境进行高级分析Set up the Azure data science environment for advanced analytics

要设置 Azure 数据科学环境,请遵循以下步骤。To set up your Azure Data Science environment, follow these steps.

创建自己的 Azure Blob 存储帐户Create your own Azure blob storage account

  • 设置自己的 Azure Blob 存储时,请为 Azure Blob 存储选择一个位于中国东部或尽可能靠近中国东部的地理位置,NYC 出租车数据存储在中国东部。When you provision your own Azure blob storage, choose a geo-location for your Azure blob storage in or as close as possible to China East, which is where the NYC Taxi data is stored. 将使用 AzCopy 将数据从公共 blob 存储容器复制到自己的存储中的某个容器。The data will be copied using AzCopy from the public blob storage container to a container in your own storage account. Azure Blob 存储越接近中国东部,完成此任务(步骤 4)的速度将越快。The closer your Azure blob storage is to China East, the faster this task (Step 4) will be completed.

  • 若要创建自己的 Azure 存储帐户,请按照关于 Azure 存储帐户中概述的步骤操作。To create your own Azure storage account, follow the steps outlined at About Azure storage accounts. 请务必记下以下存储帐户凭据的值,因为在此演练中稍后将需要它们。Be sure to make notes on the values for following storage account credentials as they will be needed later in this walkthrough.

    • 存储帐户名称Storage Account Name
    • 存储帐户密钥Storage Account Key
    • 容器名称(希望在 Azure Blob 存储中存储数据的容器的名称)Container Name (which you want the data to be stored in the Azure blob storage)

预配 Azure Synapse Analytics 实例。Provision your Azure Synapse Analytics instance. 按照在 Azure 门户中创建和查询 Azure Synapse Analytics 中的文档预配 Azure Synapse Analytics 实例。Follow the documentation at Create and query an Azure Synapse Analytics in the Azure portal to provision a Azure Synapse Analytics instance. 请务必记下以下 Azure Synapse Analytics 凭据,稍后的步骤中会使用它们。Make sure that you make notations on the following Azure Synapse Analytics credentials that will be used in later steps.

  • 服务器名称:<server Name>.database.chinacloudapi.cnServer Name: <server Name>.database.chinacloudapi.cn
  • SQLDW(数据库)名称SQLDW (Database) Name
  • 用户名Username
  • 密码Password

安装 Visual Studio 和 SQL Server Data Tools。Install Visual Studio and SQL Server Data Tools. 有关说明,请参阅适用于 Azure Synapse Analytics 的 Visual Studio 2019 入门For instructions, see Getting started with Visual Studio 2019 for Azure Synapse Analytics.

使用 Visual Studio 连接到 Azure Synapse Analytics。Connect to your Azure Synapse Analytics with Visual Studio.

备注

在 Azure Synapse Analytics 中创建的数据库上运行下面的 SQL 查询(而不是在连接主题的步骤 3 中提供的查询)来创建一个主密钥。Run the following SQL query on the database you created in your Azure Synapse Analytics (instead of the query provided in step 3 of the connect topic,) to create a master key .

BEGIN TRY
       --Try to create the master key
    CREATE MASTER KEY
END TRY
BEGIN CATCH
       --If the master key exists, do nothing
END CATCH;

在 Azure 订阅下创建一个 Azure 机器学习工作区。Create an Azure Machine Learning workspace under your Azure subscription. 有关说明,请参阅创建 Azure 机器学习工作区For instructions, see Create an Azure Machine Learning workspace.

将数据加载到 Azure Synapse Analytics 中Load the data into Azure Synapse Analytics

打开 Windows PowerShell 命令控制台。Open a Windows PowerShell command console. 运行以下 PowerShell 命令将我们在 GitHub 上与你共享的示例 SQL 脚本文件下载到使用参数 -DestDir 指定的本地目录中。Run the following PowerShell commands to download the example SQL script files that we share with you on GitHub to a local directory that you specify with the parameter -DestDir . 可以将参数 -DestDir 的值更改为任何本地目录。You can change the value of parameter -DestDir to any local directory. 如果 -DestDir 不存在,PowerShell 脚本将创建它。If -DestDir does not exist, it will be created by the PowerShell script.

备注

如果 DestDir 目录需要管理员权限才能创建或向其中写入数据,那么在执行下面的 PowerShell 脚本时,可能需要 以管理员身份运行You might need to Run as Administrator when executing the following PowerShell script if your DestDir directory needs Administrator privilege to create or to write to it.

$source = "https://raw.githubusercontent.com/Azure/Azure-MachineLearning-DataScience/master/Misc/SQLDW/Download_Scripts_SQLDW_Walkthrough.ps1"
$ps1_dest = "$pwd\Download_Scripts_SQLDW_Walkthrough.ps1"
$wc = New-Object System.Net.WebClient
$wc.DownloadFile($source, $ps1_dest)
.\Download_Scripts_SQLDW_Walkthrough.ps1 –DestDir 'C:\tempSQLDW'

在成功执行之后,当前工作目录会更改为 -DestDirAfter successful execution, your current working directory changes to -DestDir . 应该能够看到类似下面的屏幕︰You should be able to see screen like below:

当前工作目录更改

-DestDir 中,在管理员模式下执行下面的 PowerShell 脚本︰In your -DestDir , execute the following PowerShell script in administrator mode:

./SQLDW_Data_Import.ps1

首次运行 PowerShell 脚本时,系统会要求你输入 Azure Synapse Analytics 和 Azure Blob 存储帐户中的信息。When the PowerShell script runs for the first time, you will be asked to input the information from your Azure Synapse Analytics and your Azure blob storage account. 此 PowerShell 脚本完成首次运行之后,输入的凭据那时已经写入到现有工作目录中的一个名为 SQLDW.conf 的配置文件中。When this PowerShell script completes running for the first time, the credentials you input will have been written to a configuration file SQLDW.conf in the present working directory. 以后运行此 PowerShell 脚本文件时,可以选择从此配置文件中读取所有需要的参数。The future run of this PowerShell script file has the option to read all needed parameters from this configuration file. 如果需要更改某些参数,可以选择通过删除此配置文件并按提示输入参数值,根据提示在屏幕上输入参数,或者通过编辑 -DestDir 目录中的 SQLDW.conf 文件更改参数值。If you need to change some parameters, you can choose to input the parameters on the screen upon prompt by deleting this configuration file and inputting the parameters values as prompted or to change the parameter values by editing the SQLDW.conf file in your -DestDir directory.

备注

为了避免架构名称与 Azure Synapse Analytics 中的现有名称发生冲突,在直接从 SQLDW.conf 文件读取参数时,会向 SQLDW.conf 文件中的架构名称添加一个 3 位随机数字,并将其作为每次运行的默认架构名称。In order to avoid schema name conflicts with those that already exist in your Azure Azure Synapse Analytics, when reading parameters directly from the SQLDW.conf file, a 3-digit random number is added to the schema name from the SQLDW.conf file as the default schema name for each run. PowerShell 脚本可能会提示输入架构名称︰该名称可以由用户自行指定。The PowerShell script may prompt you for a schema name: the name may be specified at user discretion.

PowerShell 脚本 文件完成以下任务︰This PowerShell script file completes the following tasks:

  • 下载并安装 AzCopy (如果 AzCopy 尚未安装)Downloads and installs AzCopy , if AzCopy is not already installed

    $AzCopy_path = SearchAzCopy
    if ($AzCopy_path -eq $null){
           Write-Host "AzCopy.exe is not found in C:\Program Files*. Now, start installing AzCopy..." -ForegroundColor "Yellow"
        InstallAzCopy
        $AzCopy_path = SearchAzCopy
    }
        $env_path = $env:Path
        for ($i=0; $i -lt $AzCopy_path.count; $i++){
            if ($AzCopy_path.count -eq 1){
                $AzCopy_path_i = $AzCopy_path
            } else {
                $AzCopy_path_i = $AzCopy_path[$i]
            }
            if ($env_path -notlike '*' +$AzCopy_path_i+'*'){
                Write-Host $AzCopy_path_i 'not in system path, add it...'
                [Environment]::SetEnvironmentVariable("Path", "$AzCopy_path_i;$env_path", "Machine")
                $env:Path = [System.Environment]::GetEnvironmentVariable("Path","Machine")
                $env_path = $env:Path
            }
    
  • 使用 AzCopy 将数据从公共 blob 复制到专用 blob 存储帐户Copies data to your private blob storage account from the public blob with AzCopy

    Write-Host "AzCopy is copying data from public blob to yo storage account. It may take a while..." -ForegroundColor "Yellow"
    $start_time = Get-Date
    AzCopy.exe /Source:$Source /Dest:$DestURL /DestKey:$StorageAccountKey /S
    $end_time = Get-Date
    $time_span = $end_time - $start_time
    $total_seconds = [math]::Round($time_span.TotalSeconds,2)
    Write-Host "AzCopy finished copying data. Please check your storage account to verify." -ForegroundColor "Yellow"
    Write-Host "This step (copying data from public blob to your storage account) takes $total_seconds seconds." -ForegroundColor "Green"
    
  • 通过以下命令 使用 Polybase(通过执行 LoadDataToSQLDW.sql)将数据从专用 blob 存储帐户载入 Azure Synapse AnalyticsLoads data using Polybase (by executing LoadDataToSQLDW.sql) to your Azure Synapse Analytics from your private blob storage account with the following commands.

    • 创建架构Create a schema

      EXEC (''CREATE SCHEMA {schemaname};'');
      
    • 创建数据库范围的凭据Create a database scoped credential

      CREATE DATABASE SCOPED CREDENTIAL {KeyAlias}
      WITH IDENTITY = ''asbkey'' ,
      Secret = ''{StorageAccountKey}''
      
    • 为 Azure 存储 blob 创建外部数据源Create an external data source for an Azure Storage blob

      CREATE EXTERNAL DATA SOURCE {nyctaxi_trip_storage}
      WITH
      (
          TYPE = HADOOP,
          LOCATION =''wasbs://{ContainerName}@{StorageAccountName}.blob.core.chinacloudapi.cn'',
          CREDENTIAL = {KeyAlias}
      )
      ;
      
      CREATE EXTERNAL DATA SOURCE {nyctaxi_fare_storage}
      WITH
      (
          TYPE = HADOOP,
          LOCATION =''wasbs://{ContainerName}@{StorageAccountName}.blob.core.chinacloudapi.cn'',
          CREDENTIAL = {KeyAlias}
      )
      ;
      
    • 为 csv 文件创建外部文件格式。Create an external file format for a csv file. 数据是未压缩的,字段是使用管道字符进行分隔的。Data is uncompressed and fields are separated with the pipe character.

      CREATE EXTERNAL FILE FORMAT {csv_file_format}
      WITH
      (
          FORMAT_TYPE = DELIMITEDTEXT,
          FORMAT_OPTIONS
          (
              FIELD_TERMINATOR ='','',
              USE_TYPE_DEFAULT = TRUE
          )
      )
      ;
      
    • 在 Azure Blob 存储中为 NYC 出租车数据集创建外部费用和行程表。Create external fare and trip tables for NYC taxi dataset in Azure blob storage.

      CREATE EXTERNAL TABLE {external_nyctaxi_fare}
      (
          medallion varchar(50) not null,
          hack_license varchar(50) not null,
          vendor_id char(3),
          pickup_datetime datetime not null,
          payment_type char(3),
          fare_amount float,
          surcharge float,
          mta_tax float,
          tip_amount float,
          tolls_amount float,
          total_amount float
      )
      with (
          LOCATION    = ''/nyctaxifare/'',
          DATA_SOURCE = {nyctaxi_fare_storage},
          FILE_FORMAT = {csv_file_format},
          REJECT_TYPE = VALUE,
          REJECT_VALUE = 12
      )
      
        CREATE EXTERNAL TABLE {external_nyctaxi_trip}
        (
               medallion varchar(50) not null,
               hack_license varchar(50)  not null,
               vendor_id char(3),
               rate_code char(3),
               store_and_fwd_flag char(3),
               pickup_datetime datetime  not null,
               dropoff_datetime datetime,
               passenger_count int,
               trip_time_in_secs bigint,
               trip_distance float,
               pickup_longitude varchar(30),
               pickup_latitude varchar(30),
               dropoff_longitude varchar(30),
               dropoff_latitude varchar(30)
        )
        with (
            LOCATION    = ''/nyctaxitrip/'',
            DATA_SOURCE = {nyctaxi_trip_storage},
            FILE_FORMAT = {csv_file_format},
            REJECT_TYPE = VALUE,
            REJECT_VALUE = 12
        )
      
      • 将数据从 Azure blob 存储中的外部表载入 Azure Synapse AnalyticsLoad data from external tables in Azure blob storage to Azure Synapse Analytics

        CREATE TABLE {schemaname}.{nyctaxi_fare}
        WITH
        (
            CLUSTERED COLUMNSTORE INDEX,
            DISTRIBUTION = HASH(medallion)
        )
        AS
        SELECT *
        FROM   {external_nyctaxi_fare}
        ;
        
        CREATE TABLE {schemaname}.{nyctaxi_trip}
        WITH
        (
            CLUSTERED COLUMNSTORE INDEX,
            DISTRIBUTION = HASH(medallion)
        )
        AS
        SELECT *
        FROM   {external_nyctaxi_trip}
        ;
        
      • 创建示例数据表 (NYCTaxi_Sample),并通过在行程和费用表上选择 SQL 查询向其插入数据。Create a sample data table (NYCTaxi_Sample) and insert data to it from selecting SQL queries on the trip and fare tables. (此演练的某些步骤需要使用此示例表。)(Some steps of this walkthrough need to use this sample table.)

        CREATE TABLE {schemaname}.{nyctaxi_sample}
        WITH
        (
            CLUSTERED COLUMNSTORE INDEX,
            DISTRIBUTION = HASH(medallion)
        )
        AS
        (
            SELECT t.*, f.payment_type, f.fare_amount, f.surcharge, f.mta_tax, f.tolls_amount, f.total_amount, f.tip_amount,
            tipped = CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END,
            tip_class = CASE
                    WHEN (tip_amount = 0) THEN 0
                    WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
                    WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
                    WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
                    ELSE 4
                END
            FROM {schemaname}.{nyctaxi_trip} t, {schemaname}.{nyctaxi_fare} f
            WHERE datepart("mi",t.pickup_datetime) = 1
            AND t.medallion = f.medallion
            AND   t.hack_license = f.hack_license
            AND   t.pickup_datetime = f.pickup_datetime
            AND   pickup_longitude <> ''0''
            AND   dropoff_longitude <> ''0''
        )
        ;
        

存储帐户的地理位置会影响加载时间。The geographic location of your storage accounts affects load times.

备注

根据专用 blob 存储帐户地理位置的不同,将数据从公共 blob 复制到专用存储帐户的过程可能需要大约 15 分钟,甚至更长的时间,而将数据从存储帐户载入 Azure Synapse Analytics 的过程可能需要 20 分钟或更长时间。Depending on the geographical location of your private blob storage account, the process of copying data from a public blob to your private storage account can take about 15 minutes, or even longer,and the process of loading data from your storage account to your Azure Azure Synapse Analytics could take 20 minutes or longer.

将必须决定有重复的源和目标文件时该怎么办。You will have to decide what do if you have duplicate source and destination files.

备注

如果专用 blob 存储帐户中已经有要从公共 blob 存储复制到专用 blob 存储帐户的 .csv 文件,那么 AzCopy 将询问你是否要将其覆盖。If the .csv files to be copied from the public blob storage to your private blob storage account already exist in your private blob storage account, AzCopy will ask you whether you want to overwrite them. 如果不希望覆盖它们,请在提示时输入 nIf you do not want to overwrite them, input n when prompted. 如果希望覆盖它们 全部 ,请在提示时输入 aIf you want to overwrite all of them, input a when prompted. 也可以输入 y 单独覆盖.csv 文件。You can also input y to overwrite .csv files individually.

AzCopy 中的输出

可以使用自己的数据。You can use your own data. 如果数据位于本地计算机上的实际应用程序中,仍然可以使用 AzCopy 将本地数据上传到专用 Azure blob 存储。If your data is in your on-premises machine in your real life application, you can still use AzCopy to upload on-premises data to your private Azure blob storage. 仅需要将 PowerShell 脚本文件的 AzCopy 命令中的 位置 $Source = "http://getgoing.blob.core.windows.net/public/nyctaxidataset" 更改为包含数据的本地目录。You only need to change the Source location, $Source = "http://getgoing.blob.core.windows.net/public/nyctaxidataset", in the AzCopy command of the PowerShell script file to the local directory that contains your data.

提示

如果数据已在现实应用程序的专用 Azure blob 存储中,则可以跳过 PowerShell 脚本中的 AzCopy 步骤,直接将数据上传到 Azure Synapse Analytics。If your data is already in your private Azure blob storage in your real life application, you can skip the AzCopy step in the PowerShell script and directly upload the data to Azure Azure Synapse Analytics. 这会需要对脚本进行额外的编辑,使其适合数据的格式。This will require additional edits of the script to tailor it to the format of your data.

此 PowerShell 脚本还可将 Azure Synapse Analytics 信息插入到数据浏览示例文件 SQLDW_Explorations.sql、SQLDW_Explorations.ipynb 和 SQLDW_Explorations_Scripts.py 中,以便在 PowerShell 脚本完成之后可以立即尝试这三个文件。This PowerShell script also plugs in the Azure Synapse Analytics information into the data exploration example files SQLDW_Explorations.sql, SQLDW_Explorations.ipynb, and SQLDW_Explorations_Scripts.py so that these three files are ready to be tried out instantly after the PowerShell script completes.

在成功执行之后,看到的屏幕与下面类似︰After a successful execution, you will see screen like below:

成功执行脚本的输出

Azure Synapse Analytics 中的数据浏览和特征工程Data exploration and feature engineering in Azure Synapse Analytics

在本部分中,我们会通过直接使用 Visual Studio Data Tools 针对 Azure Synapse Analytics 运行 SQL 查询,来执行数据浏览和特征生成。In this section, we perform data exploration and feature generation by running SQL queries against Azure Synapse Analytics directly using Visual Studio Data Tools . 本部分中使用的所有 SQL 查询都可以在名为 SQLDW_Explorations.sql 的示例脚本中找到。All SQL queries used in this section can be found in the sample script named SQLDW_Explorations.sql . 此文件已经由 PowerShell 脚本下载到本地目录。This file has already been downloaded to your local directory by the PowerShell script. 也可以从 GitHub 检索它。You can also retrieve it from GitHub. 但 GitHub 中的文件并未插入 Azure Synapse Analytics 信息。But the file in GitHub does not have the Azure Synapse Analytics information plugged in.

使用 Visual Studio 以及 Azure Synapse Analytics 登录名和密码连接到 Azure Synapse Analytics,并打开 SQL 对象资源管理器以确认已导入数据库和表。Connect to your Azure Synapse Analytics using Visual Studio with the Azure Synapse Analytics login name and password and open up the SQL Object Explorer to confirm the database and tables have been imported. 检索 SQLDW_Explorations.sql 文件。Retrieve the SQLDW_Explorations.sql file.

备注

要打开并行数据仓库 (PDW) 查询编辑器,请使用 New Query 命令,同时保持 PDW 在 SQL 对象资源管理器 中处于选中状态。To open a Parallel Data Warehouse (PDW) query editor, use the New Query command while your PDW is selected in the SQL Object Explorer . PDW 不支持标准 SQL 查询编辑器。The standard SQL query editor is not supported by PDW.

下面是在本部分中执行的数据浏览和特征生成任务的类型:Here are the types of data exploration and feature generation tasks performed in this section:

  • 在不同的时间范围中探索几个字段的数据分布。Explore data distributions of a few fields in varying time windows.
  • 调查经度和纬度字段的数据质量。Investigate data quality of the longitude and latitude fields.
  • 根据 tip_amount 生成二元和多元分类标签。Generate binary and multiclass classification labels based on the tip_amount .
  • 生成特征,并计算/比较行程距离。Generate features and compute/compare trip distances.
  • 联接两个表,并提取用于构建模型的随机示例。Join the two tables and extract a random sample that will be used to build models.

数据导入验证Data import verification

通过这些查询,可以快速验证早先使用 Polybase 的平行批量导入填充的表中的行和列的数量,These queries provide a quick verification of the number of rows and columns in the tables populated earlier using Polybase's parallel bulk import,

-- 在不进行表扫描的情况下报告表 <nyctaxi_trip> 中的行数-- Report number of rows in table <nyctaxi_trip> without table scan

SELECT SUM(rows) FROM sys.partitions WHERE object_id = OBJECT_ID('<schemaname>.<nyctaxi_trip>')

-- 报告表 <nyctaxi_trip> 中的列数-- Report number of columns in table <nyctaxi_trip>

SELECT COUNT(*) FROM information_schema.columns WHERE table_name = '<nyctaxi_trip>' AND table_schema = '<schemaname>'

输出: 行数应该是 173,179,759,列数应该是 14。Output: You should get 173,179,759 rows and 14 columns.

浏览:依据徽章的行程分布Exploration: Trip distribution by medallion

此示例查询标识在指定的时间段内完成超过 100 个行程的徽章(出租车编号)。This example query identifies the medallions (taxi numbers) that completed more than 100 trips within a specified time period. 查询将受益于分区表访问,因为它受 pickup_datetime 分区方案的限制。The query would benefit from the partitioned table access since it is conditioned by the partition scheme of pickup_datetime . 查询完整数据集还将使用分区表和/或索引扫描。Querying the full dataset will also make use of the partitioned table and/or index scan.

SELECT medallion, COUNT(*)
FROM <schemaname>.<nyctaxi_fare>
WHERE pickup_datetime BETWEEN '20130101' AND '20130331'
GROUP BY medallion
HAVING COUNT(*) > 100

输出: 查询应返回一个表,表中的行指定 13,369 个徽章(出租车)和 2013 年完成的行程数。Output: The query should return a table with rows specifying the 13,369 medallions (taxis) and the number of trips completed in 2013. 最后一列包含已完成的行程数量的计算。The last column contains the count of the number of trips completed.

浏览:依据徽章和 hack_license 的行程分布Exploration: Trip distribution by medallion and hack_license

此示例标识在指定的时间段内完成超过 100 个行程的徽章(出租车编号)和 hack_license 编号(驾驶员)。This example identifies the medallions (taxi numbers) and hack_license numbers (drivers) that completed more than 100 trips within a specified time period.

SELECT medallion, hack_license, COUNT(*)
FROM <schemaname>.<nyctaxi_fare>
WHERE pickup_datetime BETWEEN '20130101' AND '20130131'
GROUP BY medallion, hack_license
HAVING COUNT(*) > 100

输出: 查询应返回一个包含 13,369 行的表,这些行指定在 2013 年已完成超过 100 个行程的 13,369 个汽车/驾驶员 ID。Output: The query should return a table with 13,369 rows specifying the 13,369 car/driver IDs that have completed more that 100 trips in 2013. 最后一列包含已完成的行程数量的计算。The last column contains the count of the number of trips completed.

数据质量评估:验证含有不正确的经度和/或纬度的记录Data quality assessment: Verify records with incorrect longitude and/or latitude

此示例将调查是否有任何一个经度和/或纬度字段包含无效的值(弧度应介于 -90 到 90 之间),或者具有(0,0)坐标。This example investigates if any of the longitude and/or latitude fields either contain an invalid value (radian degrees should be between -90 and 90), or have (0, 0) coordinates.

SELECT COUNT(*) FROM <schemaname>.<nyctaxi_trip>
WHERE pickup_datetime BETWEEN '20130101' AND '20130331'
AND  (CAST(pickup_longitude AS float) NOT BETWEEN -90 AND 90
OR    CAST(pickup_latitude AS float) NOT BETWEEN -90 AND 90
OR    CAST(dropoff_longitude AS float) NOT BETWEEN -90 AND 90
OR    CAST(dropoff_latitude AS float) NOT BETWEEN -90 AND 90
OR    (pickup_longitude = '0' AND pickup_latitude = '0')
OR    (dropoff_longitude = '0' AND dropoff_latitude = '0'))

输出: 查询返回经度和/或纬度字段无效的 837,467 个行程。Output: The query returns 837,467 trips that have invalid longitude and/or latitude fields.

浏览:已付小费与未付小费的行程分布Exploration: Tipped vs. not tipped trips distribution

此示例查找指定时间段(或如果像此处设置的那在,时间段为全年,则表示完整的数据集)内已付小费与未付小费的行程的数量。This example finds the number of trips that were tipped vs. the number that were not tipped in a specified time period (or in the full dataset if covering the full year as it is set up here). 此分布反映二元标签分布,以便以后用于二元分类建模。This distribution reflects the binary label distribution to be later used for binary classification modeling.

SELECT tipped, COUNT(*) AS tip_freq FROM (
  SELECT CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END AS tipped, tip_amount
  FROM <schemaname>.<nyctaxi_fare>
  WHERE pickup_datetime BETWEEN '20130101' AND '20131231') tc
GROUP BY tipped

输出: 查询应返回 2013 年度的以下小费频率:90,447,622 个已付小费的和 82,264,709 个未付小费的。Output: The query should return the following tip frequencies for the year 2013: 90,447,622 tipped and 82,264,709 not-tipped.

浏览:小费分类/范围分布Exploration: Tip class/range distribution

此示例将计算给定的时间段(或如果时间段为全年,则表示完整的数据集)内的小费范围分布。This example computes the distribution of tip ranges in a given time period (or in the full dataset if covering the full year). 这是以后用于多类分类建模的标签类的分布。This distribution of label classes will be used later for multiclass classification modeling.

SELECT tip_class, COUNT(*) AS tip_freq FROM (
    SELECT CASE
        WHEN (tip_amount = 0) THEN 0
        WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
        WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
        WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
        ELSE 4
    END AS tip_class
FROM <schemaname>.<nyctaxi_fare>
WHERE pickup_datetime BETWEEN '20130101' AND '20131231') tc
GROUP BY tip_class

输出:Output:

tip_classtip_class tip_freqtip_freq
11 8223091582230915
22 61988036198803
33 19322231932223
00 8226462582264625
44 8576585765

浏览:计算并比较行程距离Exploration: Compute and compare trip distance

此示例将上车和下车经度和纬度转换为 SQL 地理位置点,使用 SQL 地理位置点差异计算行程距离,并返回一个随机抽样的结果进行比较。This example converts the pickup and dropoff longitude and latitude to SQL geography points, computes the trip distance using SQL geography points difference, and returns a random sample of the results for comparison. 该示例仅在使用前面介绍的数据质量评估查询,将结果限制为有效坐标。The example limits the results to valid coordinates only using the data quality assessment query covered earlier.

/****** Object:  UserDefinedFunction [dbo].[fnCalculateDistance] ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name = 'fnCalculateDistance')
  DROP FUNCTION fnCalculateDistance
GO

-- User-defined function to calculate the direct distance  in mile between two geographical coordinates.
CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2 float, @Long2 float)

RETURNS float
AS
BEGIN
      DECLARE @distance decimal(28, 10)
      -- Convert to radians
      SET @Lat1 = @Lat1 / 57.2958
      SET @Long1 = @Long1 / 57.2958
      SET @Lat2 = @Lat2 / 57.2958
      SET @Long2 = @Long2 / 57.2958
      -- Calculate distance
      SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1))
      --Convert to miles
      IF @distance <> 0
      BEGIN
        SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance);
      END
      RETURN @distance
END
GO

SELECT pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude,
dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) AS DirectDistance
FROM <schemaname>.<nyctaxi_trip>
WHERE datepart("mi",pickup_datetime)=1
AND CAST(pickup_latitude AS float) BETWEEN -90 AND 90
AND CAST(dropoff_latitude AS float) BETWEEN -90 AND 90
AND pickup_longitude != '0' AND dropoff_longitude != '0'

使用 SQL 函数的特性工程Feature engineering using SQL functions

有时 SQL 函数可以成为特性工程的一个有效选项。Sometimes SQL functions can be an efficient option for feature engineering. 在本演练中,我们定义了一个 SQL 函数,用于计算提取位置和减少位置之间的直接距离。In this walkthrough, we defined a SQL function to calculate the direct distance between the pickup and dropoff locations. 可以在 Visual Studio Data Tools 中运行以下 SQL 脚本。You can run the following SQL scripts in Visual Studio Data Tools .

下面是定义距离函数的 SQL 脚本。Here is the SQL script that defines the distance function.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name = 'fnCalculateDistance')
  DROP FUNCTION fnCalculateDistance
GO

-- User-defined function calculate the direct distance between two geographical coordinates.
CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2 float, @Long2 float)

RETURNS float
AS
BEGIN
      DECLARE @distance decimal(28, 10)
      -- Convert to radians
      SET @Lat1 = @Lat1 / 57.2958
      SET @Long1 = @Long1 / 57.2958
      SET @Lat2 = @Lat2 / 57.2958
      SET @Long2 = @Long2 / 57.2958
      -- Calculate distance
      SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1))
      --Convert to miles
      IF @distance <> 0
      BEGIN
        SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance);
      END
      RETURN @distance
END
GO

下面是一个示例,显示如何调用此函数以在 SQL 查询中生成功能︰Here is an example to call this function to generate features in your SQL query:

-- 用于调用函数来创建功能的示例查询-- Sample query to call the function to create features

SELECT pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude,
dbo.fnCalculateDistance(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude) AS DirectDistance
FROM <schemaname>.<nyctaxi_trip>
WHERE datepart("mi",pickup_datetime)=1
AND CAST(pickup_latitude AS float) BETWEEN -90 AND 90
AND CAST(dropoff_latitude AS float) BETWEEN -90 AND 90
AND pickup_longitude != '0' AND dropoff_longitude != '0'

输出: 此查询生成一个表(包含 2,803,538 行),其中包含上下车的经纬度以及相应的直接距离(以英里计)。Output: This query generates a table (with 2,803,538 rows) with pickup and dropoff latitudes and longitudes and the corresponding direct distances in miles. 下面是前三行的结果:Here are the results for first three rows:

(行号)(Row number) pickup_latitudepickup_latitude pickup_longitudepickup_longitude dropoff_latitudedropoff_latitude dropoff_longitudedropoff_longitude DirectDistanceDirectDistance
11 40.73180440.731804 -74.001083-74.001083 40.73662240.736622 -73.988953-73.988953 .7169601222.7169601222
22 40.71579440.715794 -74,010635-74,010635 40.72533840.725338 -74.00399-74.00399 .7448343721.7448343721
33 40.76145640.761456 -73.999886-73.999886 40.76654440.766544 -73.988228-73.988228 0.70372279670.7037227967

准备建模的数据Prepare data for model building

下面的查询联接 nyctaxi_tripnyctaxi_fare 表,生成一个二元分类标签 tipped ,多类分类标签 tip_class ,并从完整联接的数据集中提取样本。The following query joins the nyctaxi_trip and nyctaxi_fare tables, generates a binary classification label tipped , a multi-class classification label tip_class , and extracts a sample from the full joined dataset. 采样是通过检索基于提取时间的行程的子集来完成的。The sampling is done by retrieving a subset of the trips based on pickup time. 可以复制此查询,然后将其直接粘贴到 Azure 机器学习工作室(经典)导入数据模块中,以便从 Azure 中的 SQL 数据库实例直接引入数据。This query can be copied then pasted directly in the Azure Machine Learning Studio (classic) Import Data module for direct data ingestion from the SQL database instance in Azure. 此查询将排除具有不正确(0,0)坐标的记录。The query excludes records with incorrect (0, 0) coordinates.

SELECT t.*, f.payment_type, f.fare_amount, f.surcharge, f.mta_tax, f.tolls_amount,     f.total_amount, f.tip_amount,
    CASE WHEN (tip_amount > 0) THEN 1 ELSE 0 END AS tipped,
    CASE WHEN (tip_amount = 0) THEN 0
        WHEN (tip_amount > 0 AND tip_amount <= 5) THEN 1
        WHEN (tip_amount > 5 AND tip_amount <= 10) THEN 2
        WHEN (tip_amount > 10 AND tip_amount <= 20) THEN 3
        ELSE 4
    END AS tip_class
FROM <schemaname>.<nyctaxi_trip> t, <schemaname>.<nyctaxi_fare> f
WHERE datepart("mi",t.pickup_datetime) = 1
AND   t.medallion = f.medallion
AND   t.hack_license = f.hack_license
AND   t.pickup_datetime = f.pickup_datetime
AND   pickup_longitude != '0' AND dropoff_longitude != '0'

准备好进行 Azure 机器学习后,也可以:When you are ready to proceed to Azure Machine Learning, you may either:

  1. 保存最终的 SQL 查询,以提取数据并对其采样,然后直接将查询复制粘贴到 Azure 机器学习的导入数据]import-data模块中,或者Save the final SQL query to extract and sample the data and copy-paste the query directly into an Import Data]import-data module in Azure Machine Learning, or
  2. 保留计划用于在新 Azure Synapse Analytics 表中进行建模的抽样和工程数据,并使用 Azure 机器学习的导入数据模块中的新表。Persist the sampled and engineered data you plan to use for model building in a new Azure Synapse Analytics table and use the new table in the Import Data module in Azure Machine Learning. 前面步骤中的 PowerShell 脚本已经完成此任务。The PowerShell script in earlier step has done this task for you. 可以直接从“导入数据”模块中的此表读取。You can read directly from this table in the Import Data module.

IPython Notebook 中的数据浏览和功能设计Data exploration and feature engineering in IPython notebook

在本部分中,我们将对之前创建的 Azure Synapse Analytics 使用 Python 和 SQL 查询,以执行数据浏览和特征生成。In this section, we will perform data exploration and feature generation using both Python and SQL queries against the Azure Synapse Analytics created earlier. 名为 SQLDW_Explorations.ipynb 的 IPython Notebook 示例和名为 SQLDW_Explorations_Scripts.py 的 Python 脚本文件已下载到本地目录。A sample IPython notebook named SQLDW_Explorations.ipynb and a Python script file SQLDW_Explorations_Scripts.py have been downloaded to your local directory. GitHub 上也有提供。They are also available on GitHub. 这两个文件在 Python 脚本中相同。These two files are identical in Python scripts. 提供 Python 脚本文件用于应对没有 IPython Notebook 服务器的情况。The Python script file is provided to you in case you do not have an IPython Notebook server. 这两个示例 Python 文件在 Python 2.7 下开发。These two sample Python files are designed under Python 2.7 .

已下载到本地计算机的示例 IPython Notebook 和 Python 脚本文件中所需的 Azure Synapse Analytics 信息此前已通过 PowerShell 脚本插入。The needed Azure Synapse Analytics information in the sample IPython Notebook and the Python script file downloaded to your local machine has been plugged in by the PowerShell script previously. 它们是可执行文件,无需任何修改。They are executable without any modification.

如果已设置好 Azure 机器学习工作区,可以直接将示例 IPython Notebook 上传到 AzureML IPython Notebook 服务,并开始运行。If you have already set up an Azure Machine Learning workspace, you can directly upload the sample IPython Notebook to the AzureML IPython Notebook service and start running it. 下面是上传到 AzureML IPython Notebook 服务的步骤:Here are the steps to upload to the AzureML IPython Notebook service:

  1. 登录 Azure 机器学习工作区、单击顶部的“工作室”,并单击网页左侧的“NOTEBOOKS”。 Log in to your Azure Machine Learning workspace, click Studio" at the top, and click NOTEBOOKS on the left side of the web page.

    依次单击“Studio”和“笔记本”

  2. 单击网页左下角的“新建”,并选择“Python 2”。 Click NEW on the left bottom corner of the web page, and select Python 2 . 然后,为笔记本提供名称,并单击复选标记以创建新的空白 IPython Notebook。Then, provide a name to the notebook and click the check mark to create the new blank IPython Notebook.

    单击“新建”,然后选择“Python 2”

  3. 单击新 IPython Notebook 左上角的 Jupyter 符号。Click the Jupyter symbol on the left top corner of the new IPython Notebook.

    单击 Jupyter 符号

  4. 将示例 IPython Notebook 拖放到 AzureML IPython Notebook 服务的“树”页面,然后单击“上传”。Drag and drop the sample IPython Notebook to the tree page of your AzureML IPython Notebook service, and click Upload . 然后,示例 IPython Notebook 将上传到 AzureML IPython Notebook 服务。Then, the sample IPython Notebook will be uploaded to the AzureML IPython Notebook service.

    单击“上传”

若要运行示例 IPython Notebook 或 Python 脚本文件,需要使用以下 Python 包。In order to run the sample IPython Notebook or the Python script file, the following Python packages are needed. 如果使用的是 AzureML IPython Notebook 服务,这些包已经预安装。If you are using the AzureML IPython Notebook service, these packages have been pre-installed.

  • pandaspandas
  • numpynumpy
  • matplotlibmatplotlib
  • pyodbcpyodbc
  • PyTablesPyTables

在 Azure 机器学习上使用大型数据生成高级分析解决方案的建议顺序如下:When building advanced analytical solutions on Azure Machine Learning with large data, here is the recommended sequence:

  • 将小型数据示例读入到内存中的数据帧。Read in a small sample of the data into an in-memory data frame.
  • 使用抽样数据执行一些可视化效果和浏览。Perform some visualizations and explorations using the sampled data.
  • 使用抽样数据进行功能设计实验。Experiment with feature engineering using the sampled data.
  • 对于大型数据浏览、数据操作和特征工程,请使用 Python 直接对 Azure Synapse Analytics 发出 SQL 查询。For larger data exploration, data manipulation and feature engineering, use Python to issue SQL Queries directly against the Azure Synapse Analytics.
  • 决定适用于 Azure 机器学习建模的样本大小。Decide the sample size to be suitable for Azure Machine Learning model building.

以下是几个数据浏览、数据可视化和功能设计示例。The followings are a few data exploration, data visualization, and feature engineering examples. 示例 IPython Notebook 和示例 Python 脚本文件中包含更多数据浏览。More data explorations can be found in the sample IPython Notebook and the sample Python script file.

初始化数据库凭据Initialize database credentials

使用以下变量初始化数据库连接设置:Initialize your database connection settings in the following variables:

SERVER_NAME=<server name>
DATABASE_NAME=<database name>
USERID=<user name>
PASSWORD=<password>
DB_DRIVER = <database driver>

创建数据库连接Create database connection

下面是创建数据库连接的连接字符串。Here is the connection string that creates the connection to the database.

CONNECTION_STRING = 'DRIVER={'+DRIVER+'};SERVER='+SERVER_NAME+';DATABASE='+DATABASE_NAME+';UID='+USERID+';PWD='+PASSWORD
conn = pyodbc.connect(CONNECTION_STRING)

报告表 <nyctaxi_trip> 中的行数和列数Report number of rows and columns in table <nyctaxi_trip>

nrows = pd.read_sql('''
    SELECT SUM(rows) FROM sys.partitions
    WHERE object_id = OBJECT_ID('<schemaname>.<nyctaxi_trip>')
''', conn)

print 'Total number of rows = %d' % nrows.iloc[0,0]

ncols = pd.read_sql('''
    SELECT COUNT(*) FROM information_schema.columns
    WHERE table_name = ('<nyctaxi_trip>') AND table_schema = ('<schemaname>')
''', conn)

print 'Total number of columns = %d' % ncols.iloc[0,0]
  • 总行数 = 173179759Total number of rows = 173179759
  • 总列数 = 14Total number of columns = 14

报告表 <nyctaxi_fare> 中的行数和列数Report number of rows and columns in table <nyctaxi_fare>

nrows = pd.read_sql('''
    SELECT SUM(rows) FROM sys.partitions
    WHERE object_id = OBJECT_ID('<schemaname>.<nyctaxi_fare>')
''', conn)

print 'Total number of rows = %d' % nrows.iloc[0,0]

ncols = pd.read_sql('''
    SELECT COUNT(*) FROM information_schema.columns
    WHERE table_name = ('<nyctaxi_fare>') AND table_schema = ('<schemaname>')
''', conn)

    print 'Total number of columns = %d' % ncols.iloc[0,0]
  • 总行数 = 173179759Total number of rows = 173179759
  • 总列数 = 11Total number of columns = 11

从 Azure Synapse Analytics 数据库读入小型数据样本Read-in a small data sample from the Azure Synapse Analytics Database

t0 = time.time()

query = '''
    SELECT TOP 10000 t.*, f.payment_type, f.fare_amount, f.surcharge, f.mta_tax,
        f.tolls_amount, f.total_amount, f.tip_amount
    FROM <schemaname>.<nyctaxi_trip> t, <schemaname>.<nyctaxi_fare> f
    WHERE datepart("mi",t.pickup_datetime) = 1
    AND   t.medallion = f.medallion
    AND   t.hack_license = f.hack_license
    AND   t.pickup_datetime = f.pickup_datetime
'''

df1 = pd.read_sql(query, conn)

t1 = time.time()
print 'Time to read the sample table is %f seconds' % (t1-t0)

print 'Number of rows and columns retrieved = (%d, %d)' % (df1.shape[0], df1.shape[1])

读取示例表的时间为 14.096495 秒。Time to read the sample table is 14.096495 seconds. 检索到的行数和列数 = (1000, 21)。Number of rows and columns retrieved = (1000, 21).

描述性统计信息Descriptive statistics

现在可浏览抽样数据。Now you are ready to explore the sampled data. 首先查看 trip_distance (或选择指定的任何其他字段)的一些描述性统计信息。We start with looking at some descriptive statistics for the trip_distance (or any other fields you choose to specify).

df1['trip_distance'].describe()

可视化效果:盒须图示例Visualization: Box plot example

接下来,我们查看行程距离的盒须图,以可视化分位数。Next we look at the box plot for the trip distance to visualize the quantiles.

df1.boxplot(column='trip_distance',return_type='dict')

盒须图输出

可视化效果:分布图示例Visualization: Distribution plot example

可视化分布图和抽样行程距离的直方图。Plots that visualize the distribution and a histogram for the sampled trip distances.

fig = plt.figure()
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
df1['trip_distance'].plot(ax=ax1,kind='kde', style='b-')
df1['trip_distance'].hist(ax=ax2, bins=100, color='k')

分布图输出

可视化效果:条形图和折线图Visualization: Bar and line plots

在此示例中,我们可以将行程距离量化为五个分类收纳组,并将分类收纳结果可视化。In this example, we bin the trip distance into five bins and visualize the binning results.

trip_dist_bins = [0, 1, 2, 4, 10, 1000]
df1['trip_distance']
trip_dist_bin_id = pd.cut(df1['trip_distance'], trip_dist_bins)
trip_dist_bin_id

我们可以使用以下方法在条形图或折线图中绘制上述量化分布:We can plot the above bin distribution in a bar or line plot with:

pd.Series(trip_dist_bin_id).value_counts().plot(kind='bar')

条形图输出

andand

pd.Series(trip_dist_bin_id).value_counts().plot(kind='line')

折线图输出

可视化效果:散点图示例Visualization: Scatterplot examples

我们会显示在 trip_time_in_secstrip_distance 之间的散点图,以查看是否存在任何关联We show scatter plot between trip_time_in_secs and trip_distance to see if there is any correlation

plt.scatter(df1['trip_time_in_secs'], df1['trip_distance'])

时间和距离之间的关系的散点图输出

同样地,我们可以检查 rate_codetrip_distance 之间的关系。Similarly we can check the relationship between rate_code and trip_distance .

plt.scatter(df1['passenger_count'], df1['trip_distance'])

代码和距离之间的关系的散点图输出

在 IPython Notebook 中使用 SQL 查询对抽样数据进行数据浏览Data exploration on sampled data using SQL queries in IPython notebook

在本部分中,我们将使用之前创建的新表中保存的抽样数据来浏览数据分布。In this section, we explore data distributions using the sampled data that is persisted in the new table we created above. 可以使用原始表执行类似浏览。Similar explorations may be performed using the original tables.

浏览:报告抽样表中的行数和列数Exploration: Report number of rows and columns in the sampled table

nrows = pd.read_sql('''SELECT SUM(rows) FROM sys.partitions WHERE object_id = OBJECT_ID('<schemaname>.<nyctaxi_sample>')''', conn)
print 'Number of rows in sample = %d' % nrows.iloc[0,0]

ncols = pd.read_sql('''SELECT count(*) FROM information_schema.columns WHERE table_name = ('<nyctaxi_sample>') AND table_schema = '<schemaname>'''', conn)
print 'Number of columns in sample = %d' % ncols.iloc[0,0]

浏览:已付小费/未付小费分布Exploration: Tipped/not tripped Distribution

query = '''
SELECT tipped, count(*) AS tip_freq
    FROM <schemaname>.<nyctaxi_sample>
    GROUP BY tipped
    '''

    pd.read_sql(query, conn)

浏览:小费类分布Exploration: Tip class distribution

query = '''
    SELECT tip_class, count(*) AS tip_freq
    FROM <schemaname>.<nyctaxi_sample>
    GROUP BY tip_class
'''

tip_class_dist = pd.read_sql(query, conn)

浏览:按类绘制小费分布Exploration: Plot the tip distribution by class

tip_class_dist['tip_freq'].plot(kind='bar')

图 26

浏览:每日行程分布Exploration: Daily distribution of trips

query = '''
    SELECT CONVERT(date, dropoff_datetime) AS date, COUNT(*) AS c
    FROM <schemaname>.<nyctaxi_sample>
    GROUP BY CONVERT(date, dropoff_datetime)
'''

pd.read_sql(query,conn)

浏览:按徽章的行程分布Exploration: Trip distribution per medallion

query = '''
    SELECT medallion,count(*) AS c
    FROM <schemaname>.<nyctaxi_sample>
    GROUP BY medallion
'''

pd.read_sql(query,conn)

浏览:依据徽章和出租汽车执照的行程分布Exploration: Trip distribution by medallion and hack license

query = '''select medallion, hack_license,count(*) from <schemaname>.<nyctaxi_sample> group by medallion, hack_license'''
pd.read_sql(query,conn)

浏览:行程时间分布Exploration: Trip time distribution

query = '''select trip_time_in_secs, count(*) from <schemaname>.<nyctaxi_sample> group by trip_time_in_secs order by count(*) desc'''
pd.read_sql(query,conn)

浏览:行程距离分布Exploration: Trip distance distribution

query = '''select floor(trip_distance/5)*5 as tripbin, count(*) from <schemaname>.<nyctaxi_sample> group by floor(trip_distance/5)*5 order by count(*) desc'''
pd.read_sql(query,conn)

浏览:付款类型分布Exploration: Payment type distribution

query = '''select payment_type,count(*) from <schemaname>.<nyctaxi_sample> group by payment_type'''
pd.read_sql(query,conn)

验证功能化表的最终形式Verify the final form of the featurized table

query = '''SELECT TOP 100 * FROM <schemaname>.<nyctaxi_sample>'''
pd.read_sql(query,conn)

在 Azure 机器学习中构建模型Build models in Azure Machine Learning

我们现已准备好在 Azure 机器学习 中进行建模和模型部署。We are now ready to proceed to model building and model deployment in Azure Machine Learning. 数据已可用于之前识别的任意预测问题,即:The data is ready to be used in any of the prediction problems identified earlier, namely:

  1. 二元分类 :预测某个行程是否支付小费。Binary classification : To predict whether or not a tip was paid for a trip.
  2. 多类分类 :根据以前定义的类,预测小费支付范围。Multiclass classification : To predict the range of tip paid, according to the previously defined classes.
  3. 回归任务 :预测为行程支付的小费金额。Regression task : To predict the amount of tip paid for a trip.

若要开始建模练习,请登录到 Azure 机器学习(经典版) 工作区。To begin the modeling exercise, log in to your Azure Machine Learning (classic) workspace. 如果尚未创建机器学习工作区,请参阅创建 Azure 机器学习工作室(经典版)工作区If you have not yet created a machine learning workspace, see Create an Azure Machine Learning Studio (classic) workspace.

  1. 要开始使用 Azure 机器学习,请参阅什么是 Azure 机器学习工作室(经典版)?To get started with Azure Machine Learning, see What is Azure Machine Learning Studio (classic)?
  2. 登录 Azure 机器学习工作室(经典版)Log in to Azure Machine Learning Studio (classic).
  3. 机器学习工作室(经典版)主页上提供丰富的信息、视频、教程、指向模块参考链接及其他资源。The Machine Learning Studio (classic) Home page provides a wealth of information, videos, tutorials, links to the Modules Reference, and other resources. 有关 Azure 机器学习的详细信息,请参阅 Azure 机器学习文档中心For more information about Azure Machine Learning, see the Azure Machine Learning Documentation Center.

典型的训练实验由以下步骤组成:A typical training experiment consists of the following steps:

  1. 新建 + ”实验。Create a +NEW experiment.
  2. 将数据导入 Azure 机器学习工作室(经典版)。Get the data into Azure Machine Learning Studio (classic).
  3. 根据需要预处理、转换和操作数据。Pre-process, transform and manipulate the data as needed.
  4. 根据需要生成功能。Generate features as needed.
  5. 将数据拆分为训练/验证/测试数据集(或每个类具有单独的数据集)。Split the data into training/validation/testing datasets(or have separate datasets for each).
  6. 根据要解决的学习问题,选择一个或多个机器学习算法。Select one or more machine learning algorithms depending on the learning problem to solve. 例如,二元分类、多类分类、回归。For example, binary classification, multiclass classification, regression.
  7. 使用定型数据集训练一个或多个模型。Train one or more models using the training dataset.
  8. 使用定型模型对验证数据集评分。Score the validation dataset using the trained model(s).
  9. 评估模型来计算针对学习问题的相关指标。Evaluate the model(s) to compute the relevant metrics for the learning problem.
  10. 调整模型,并选择最佳模型进行部署。Tune the model(s) and select the best model to deploy.

在此练习中,我们已经探讨和设计了 Azure Synapse Analytics 中的数据,并确定了要引入 Azure 机器学习工作室(经典)的样本大小。In this exercise, we have already explored and engineered the data in Azure Synapse Analytics, and decided on the sample size to ingest in Azure Machine Learning Studio (classic). 下面是构建一个或多个预测模型的过程:Here is the procedure to build one or more of the prediction models:

  1. 使用“数据输入和输出”部分中可用的导入数据模块,将数据导入 Azure 机器学习工作室(经典)。Get the data into Azure Machine Learning Studio (classic) using the Import Data module, available in the Data Input and Output section. 有关详细信息,请参阅导入数据模块参考页。For more information, see the Import Data module reference page.

    Azure ML 导入数据

  2. 在“ 属性 ”面板中,选择“ Azure SQL 数据库 ”作为 数据源Select Azure SQL Database as the Data source in the Properties panel.

  3. 在“ 数据库服务器名称 ”字段中输入数据库 DNS 名称。Enter the database DNS name in the Database server name field. 格式:tcp:<your_virtual_machine_DNS_name>,1433Format: tcp:<your_virtual_machine_DNS_name>,1433

  4. 在相应字段中输入 数据库名称Enter the Database name in the corresponding field.

  5. 在“服务器用户帐户名”中输入 SQL 用户名 ,在“服务器用户帐户密码”中输入 密码Enter the SQL user name in the Server user account name , and the password in the Server user account password .

  6. 在“数据库查询”编辑文本区域中,粘贴提取必要数据库字段(包括任何计算字段,例如标签)的查询,并对数据向下采样至所需样本大小。In the Database query edit text area, paste the query that extracts the necessary database fields (including any computed fields such as the labels) and down samples the data to the desired sample size.

下图是二元分类实验直接从 Azure Synapse Analytics 数据库读取数据的一个示例(请记住,将表名称 nyctaxi_trip 和 nyctaxi_fare 替换为你在演练中使用的架构名称和表名称)。An example of a binary classification experiment reading data directly from the Azure Synapse Analytics database is in the figure below (remember to replace the table names nyctaxi_trip and nyctaxi_fare by the schema name and the table names you used in your walkthrough). 可以针对多类分类和回归问题构建类似实验。Similar experiments can be constructed for multiclass classification and regression problems.

Azure ML 训练

重要

在上一部分中提供的建模数据提取和采样查询示例中, 这三个建模练习的所有标签都包括在此查询中In the modeling data extraction and sampling query examples provided in previous sections, all labels for the three modeling exercises are included in the query . 每个建模练习的一个重要(必需)步骤是 排除 其他两个问题不需要的标签,以及任何其他的 目标泄漏An important (required) step in each of the modeling exercises is to exclude the unnecessary labels for the other two problems, and any other target leaks . 例如,使用二元分类时,使用标签 tipped 并排除字段 tip_classtip_amounttotal_amountFor example, when using binary classification, use the label tipped and exclude the fields tip_class , tip_amount , and total_amount . 后者是目标泄漏,因为它们指示支付的小费。The latter are target leaks since they imply the tip paid.

为了排除任何不需要的列或目标泄漏,可以使用选择数据集中的列模块或编辑元数据To exclude any unnecessary columns or target leaks, you may use the Select Columns in Dataset module or the Edit Metadata. 有关详细信息,请参阅选择数据集中的列编辑元数据参考页。For more information, see Select Columns in Dataset and Edit Metadata reference pages.

在 Azure 机器学习中部署模型Deploy models in Azure Machine Learning

模型已就绪时,即可轻松地从实验直接将其部署为 Web 服务。When your model is ready, you can easily deploy it as a web service directly from the experiment. 有关部署 Azure ML Web 服务的详细信息,请参阅部署 Azure 机器学习 Web 服务For more information about deploying Azure ML web services, see Deploy an Azure Machine Learning web service.

要部署新 Web 服务,需要:To deploy a new web service, you need to:

  1. 创建评分实验。Create a scoring experiment.
  2. 部署 Web 服务。Deploy the web service.

若要从 已完成 的训练实验中创建评分实验,请单击下方操作栏中的“ 创建评分实验 ”。To create a scoring experiment from a Finished training experiment, click CREATE SCORING EXPERIMENT in the lower action bar.

Azure 评分

Azure 机器学习将尝试根据训练实验的组件创建评分实验。Azure Machine Learning will attempt to create a scoring experiment based on the components of the training experiment. 特别是,它将:In particular, it will:

  1. 保存训练的模型,并删除模型训练模块。Save the trained model and remove the model training modules.
  2. 标识逻辑 输入端口 ,以表示预期输入数据架构。Identify a logical input port to represent the expected input data schema.
  3. 标识逻辑 输出端口 ,以表示预期 Web 服务输出架构。Identify a logical output port to represent the expected web service output schema.

创建评分实验后,请检查结果并根据需要进行调整。When the scoring experiment is created, review the results and make adjust as needed. 典型调整方式为:将输入数据集或查询替换为排除标签字段的数据集或查询,因为调用服务时,这些标签字段不会映射到架构。A typical adjustment is to replace the input dataset or query with one that excludes label fields, because these label fields will not be mapped to the schema when calling the service. 如果将输入数据集和/或查询大小减少到几个记录,刚好能够表示输入架构,这也是一个非常好的做法。It is also a good practice to reduce the size of the input dataset and/or query to a few records, enough to indicate the input schema. 对于输出端口,通常会使用选择数据集中的列模块在输出中排除所有输入字段,仅包括“评分标签”和“评分概率”。For the output port, it is common to exclude all input fields and only include the Scored Labels and Scored Probabilities in the output using the Select Columns in Dataset module.

下图提供评分实验示例。A sample scoring experiment is provided in the figure below. 准备部署时,请单击下方操作栏中的“ 发布 WEB 服务 ”按钮。When ready to deploy, click the PUBLISH WEB SERVICE button in the lower action bar.

Azure ML 发布

摘要Summary

简单概括一下我们在此演练教程中完成的任务,已创建 Azure 数据科学环境,并使用大型公共数据集,从而将它的用法贯穿整个数据团队科学过程,从数据采集到 Azure 机器学习 Web 服务的模型训练,再到部署等多项工作。To recap what we have done in this walkthrough tutorial, you have created an Azure data science environment, worked with a large public dataset, taking it through the Team Data Science Process, all the way from data acquisition to model training, and then to the deployment of an Azure Machine Learning web service.

许可证信息License information

此示例演练和及其附带脚本和 IPython notebook 是在 MIT 许可证下由 Microsoft 共享。This sample walkthrough and its accompanying scripts and IPython notebook(s) are shared by Microsoft under the MIT license. 如需更多详细信息,请查看 GitHub 上示例代码目录中的 LICENSE.txt 文件。Check the LICENSE.txt file in the directory of the sample code on GitHub for more details.

参考References