使用 HDInsight 中的 Apache Hive 分析航班延误数据Analyze flight delay data by using Apache Hive in HDInsight

Apache Hive 提供了通过类似于 SQL 的脚本语言(称为 HiveQL)运行 Apache Hadoop MapReduce 作业的方法,此方法可用于对大量数据进行汇总、查询和分析。Apache Hive provides a means of running Apache Hadoop MapReduce jobs through an SQL-like scripting language called HiveQL, which can be applied towards summarizing, querying, and analyzing large volumes of data.

Important

本文档中的步骤要求使用基于 Windows 的 HDInsight 群集。The steps in this document require a Windows-based HDInsight cluster. Linux 是 HDInsight 3.4 或更高版本上使用的唯一操作系统。Linux is the only operating system used on HDInsight version 3.4 or greater. 有关详细信息,请参阅 HDInsight 在 Windows 上停用For more information, see HDInsight retirement on Windows. 有关适用于基于 Linux 的群集的步骤,请参阅在 HDInsight (Linux) 中使用 Apache Hive 分析航班延误数据For steps that work with a Linux-based cluster, see Analyze flight delay data by using Apache Hive in HDInsight (Linux).

Azure HDInsight 的主要优势之一就是隔离数据存储和计算。One of the major benefits of Azure HDInsight is the separation of data storage and compute. HDInsight 将 Azure Blob 存储用于数据存储。HDInsight uses Azure Blob storage for data storage. 典型的作业包含三部分:A typical job involves three parts:

  1. 将数据存储在 Azure Blob 存储中。Store data in Azure Blob storage. 例如,将天气数据、传感器数据、Web 日志以及此示例中的航班延误数据保存到 Azure Blob 存储中。For example, weather data, sensor data, web logs, and in this case, flight delay data are saved into Azure Blob storage.
  2. 运行作业。Run jobs. 要处理数据时,可以运行 Windows PowerShell 脚本(或客户端应用程序)创建 HDInsight 群集、运行作业,然后删除该群集。When it is time to process the data, you run a Windows PowerShell script (or a client application) to create an HDInsight cluster, run jobs, and delete the cluster. 作业将输出数据保存到 Azure Blob 存储。The jobs save output data to Azure Blob storage. 甚至在删除该群集后,输出数据也会保留。The output data is retained even after the cluster is deleted. 这样,仅为已使用的内容付费。This way, you pay for only what you have consumed.
  3. 从 Azure Blob 存储检索输出,或在此教程中将数据导出到 Azure SQL 数据库。Retrieve the output from Azure Blob storage, or in this tutorial, export the data to an Azure SQL database.

下图演示了本教程的方案和结构:The following diagram illustrates the scenario and the structure of this tutorial:

HDI.FlightDelays.flow

请注意,图中的编号对应于章节标题。Note that the numbers in the diagram correspond to the section titles. M 代表主进程。M stands for the main process. A 代表附录中的内容。A stands for the content in the appendix.

本教程的主要部分说明如何使用单个 Windows PowerShell 脚本执行以下任务:The main portion of the tutorial shows you how to use one Windows PowerShell script to perform the following tasks:

  • 创建 HDInsight 群集。Create an HDInsight cluster.
  • 在群集上运行 Hive 作业,以计算机场的平均延迟。Run a Hive job on the cluster to calculate average delays at airports. 航班延误数据会存储在 Azure Blob 存储帐户中。The flight delay data is stored in an Azure Blob storage account.
  • 运行 Sqoop 作业将 Hive 作业输出导出至 Azure SQL 数据库。Run a Sqoop job to export the Hive job output to an Azure SQL database.
  • 删除 HDInsight 群集。Delete the HDInsight cluster.

在附录中,你可以找到有关上传航班延误数据、创建/上传 Hive 查询字符串和针对 Sqoop 作业准备 Azure SQL 数据库的说明。In the appendixes, you can find the instructions for uploading flight delay data, creating/uploading a Hive query string, and preparing the Azure SQL database for the Sqoop job.

Note

本文档中的步骤特定于基于 Windows 的 HDInsight 群集。The steps in this document are specific to Windows-based HDInsight clusters. 有关适用于基于 Linux 的群集的步骤,请参阅在 HDInsight (Linux) 中使用 Apache Hive 分析航班延误数据For steps that work with a Linux-based cluster, see Analyze flight delay data using Apache Hive in HDInsight (Linux).

先决条件Prerequisites

要阅读本教程,必须具备以下项:Before you begin this tutorial, you must have the following items:

本教程中使用的文件Files used in this tutorial

本教程使用来自 美国研究与技术创新管理部门 - 运输统计局或 RITA的航班准时表现数据。This tutorial uses the on-time performance of airline flight data from Research and Innovative Technology Administration, Bureau of Transportation Statistics or RITA. 数据的副本已上传至具有公共 Blob 访问权限的 Azure Blob 存储容器。A copy of the data has been uploaded to an Azure Blob storage container with the Public Blob access permission. PowerShell 脚本的一部分将数据从公共 blob 容器复制到群集的默认 blob 容器。A part of your PowerShell script copies the data from the public blob container to the default blob container of your cluster. HiveQL 脚本也会复制到同一 Blob 容器。The HiveQL script is also copied to the same Blob container. 如果想要了解如何将数据获取/上传到自己的存储帐户,以及如何创建/上传 HiveQL 脚本文件,请参阅附录 A附录 BIf you want to learn how to get/upload the data to your own Storage account, and how to create/upload the HiveQL script file, see Appendix A and Appendix B.

下表列出了本教程中使用的文件:The following table lists the files used in this tutorial:

文件Files 说明Description
wasb://flightdelay@hditutorialdata.blob.core.windows.net/flightdelays.hql Hive 作业所用的 HiveQL 脚本文件。The HiveQL script file used by the Hive job. 此脚本已上传到具有公共访问权限的 Azure Blob 存储帐户。This script has been uploaded to an Azure Blob storage account with the public access. 附录 B 提供了有关准备此文件以及将其上传到用户的 Azure Blob 存储帐户的说明。Appendix B has instructions on preparing and uploading this file to your own Azure Blob storage account.
wasb://flightdelay@hditutorialdata.blob.core.windows.net/2013Data Hive 作业的输入数据。Input data for the Hive job. 这些数据已上传到具有公共访问权限的 Azure Blob 存储帐户。The data has been uploaded to an Azure Blob storage account with the public access. 附录 A 提供了有关获取数据以及将数据上传到用户的 Azure Blob 存储帐户的说明。Appendix A has instructions on getting the data and uploading the data to your own Azure Blob storage account.
\tutorials\flightdelays\output\tutorials\flightdelays\output Hive 作业的输出路径。The output path for the Hive job. 默认容器用于存储输出数据。The default container is used for storing the output data.
\tutorials\flightdelays\jobstatus\tutorials\flightdelays\jobstatus 默认容器上的 Hive 作业状态文件夹。The Hive job status folder on the default container.

创建群集并运行 Hive/Sqoop 作业Create cluster and run Hive/Sqoop jobs

Hadoop MapReduce 属于批处理。Hadoop MapReduce is batch processing. 运行 Hive 作业时,最具成本效益的方法是为作业创建群集,并在作业完成之后删除作业。The most cost-effective way to run a Hive job is to create a cluster for the job, and delete the job after the job is completed. 以下脚本覆盖了整个过程。The following script covers the whole process. 有关创建 HDInsight 群集和运行 Hive 作业的详细信息,请参阅在 HDInsight 中创建 Apache Hadoop 群集将 Apache Hive 与 HDInsight 配合使用For more information on creating an HDInsight cluster and running Hive jobs, see Create Apache Hadoop clusters in HDInsight and Use Apache Hive with HDInsight.

使用 Azure PowerShell 运行 Hive 查询To run the Hive queries by Azure PowerShell

  1. 按照 附录 C中的说明,为 Sqoop 作业输出创建 Azure SQL 数据库和表。Create an Azure SQL database and the table for the Sqoop job output by using the instructions in Appendix C.

  2. 打开 Windows PowerShell ISE 并运行以下脚本:Open Windows PowerShell ISE, and run the following script:

    $subscriptionID = "<Azure Subscription ID>"
    $nameToken = "<Enter an Alias>"
    
    ###########################################
    # You must configure the following variables
    # for an existing Azure SQL Database
    ###########################################
    $existingSqlDatabaseServerName = "<Azure SQL Database Server>"
    $existingSqlDatabaseLogin = "<Azure SQL Database Server Login>"
    $existingSqlDatabasePassword = "<Azure SQL Database Server login password>"
    $existingSqlDatabaseName = "<Azure SQL Database name>"
    
    $localFolder = "E:\Tutorials\Downloads\" # A temp location for copying files.
    $azcopyPath = "C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy" # depends on the version, the folder can be different
    
    ###########################################
    # (Optional) configure the following variables
    ###########################################
    
    $namePrefix = $nameToken.ToLower() + (Get-Date -Format "MMdd")
    
    $resourceGroupName = $namePrefix + "rg"
    $location = "CHINA EAST"
    
    $HDInsightClusterName = $namePrefix + "hdi"
    $httpUserName = "admin"
    $httpPassword = "<Enter the Password>"
    
    $defaultStorageAccountName = $namePrefix + "store"
    $defaultBlobContainerName = $HDInsightClusterName # use the cluster name
    
    $existingSqlDatabaseTableName = "AvgDelays"
    $sqlDatabaseConnectionString = "jdbc:sqlserver://$existingSqlDatabaseServerName.database.chinacloudapi.cn;user=$existingSqlDatabaseLogin@$existingSqlDatabaseServerName;password=$existingSqlDatabaseLogin;database=$existingSqlDatabaseName"
    
    $hqlScriptFile = "/tutorials/flightdelays/flightdelays.hql"
    
    $jobStatusFolder = "/tutorials/flightdelays/jobstatus"
    
    ###########################################
    # Login
    ###########################################
    try{
        $acct = Get-AzureRmSubscription
    }
    catch{
        Connect-AzureRmAccount -EnvironmentName AzureChinaCloud
    }
    Select-AzureRmSubscription -SubscriptionID $subscriptionID
    
    ###########################################
    # Create a new HDInsight cluster
    ###########################################
    
    # Create ARM group
    New-AzureRmResourceGroup -Name $resourceGroupName -Location $location
    
    # Create the default storage account
    New-AzureRmStorageAccount -ResourceGroupName $resourceGroupName -Name $defaultStorageAccountName -Location $location -Type Standard_LRS
    
    # Create the default Blob container
    $defaultStorageAccountKey = (Get-AzureRmStorageAccountKey -ResourceGroupName $resourceGroupName -Name $defaultStorageAccountName)[0].Value
    $defaultStorageAccountContext = New-AzureStorageContext -StorageAccountName $defaultStorageAccountName -StorageAccountKey $defaultStorageAccountKey
    New-AzureStorageContainer -Name $defaultBlobContainerName -Context $defaultStorageAccountContext
    
    # Create the HDInsight cluster
    $pw = ConvertTo-SecureString -String $httpPassword -AsPlainText -Force
    $httpCredential = New-Object System.Management.Automation.PSCredential($httpUserName,$pw)
    
    New-AzureRmHDInsightCluster `
        -ResourceGroupName $resourceGroupName `
        -ClusterName $HDInsightClusterName `
        -Location $location `
        -ClusterType Hadoop `
        -OSType Windows `
        -ClusterSizeInNodes 2 `
        -HttpCredential $httpCredential `
        -DefaultStorageAccountName "$defaultStorageAccountName.blob.core.chinacloudapi.cn" `
        -DefaultStorageAccountKey $defaultStorageAccountKey `
        -DefaultStorageContainer $existingDefaultBlobContainerName
    
    ###########################################
    # Prepare the HiveQL script and source data
    ###########################################
    
    # Create the temp location
    New-Item -Path $localFolder -ItemType Directory -Force
    
    # Download the sample file from Azure Blob storage
    $context = New-AzureStorageContext -StorageAccountName "hditutorialdata" -Anonymous
    $blobs = Get-AzureStorageBlob -Container "flightdelay" -Context $context
    #$blobs | Get-AzureStorageBlobContent -Context $context -Destination $localFolder
    
    # Upload data to default container
    
    $azcopycmd = "cmd.exe /C '$azcopyPath\azcopy.exe' /S /Source:'$localFolder' /Dest:'https://$defaultStorageAccountName.blob.core.chinacloudapi.cn/$defaultBlobContainerName/tutorials/flightdelays' /DestKey:$defaultStorageAccountKey"
    
    Invoke-Expression -Command:$azcopycmd
    
    ###########################################
    # Submit the Hive job
    ###########################################
    Use-AzureRmHDInsightCluster -ClusterName $HDInsightClusterName -HttpCredential $httpCredential
    $response = Invoke-AzureRmHDInsightHiveJob `
                    -Files $hqlScriptFile `
                    -DefaultContainer $defaultBlobContainerName `
                    -DefaultStorageAccountName $defaultStorageAccountName `
                    -DefaultStorageAccountKey $defaultStorageAccountKey `
                    -StatusFolder $jobStatusFolder
    
    write-Host $response
    
    ###########################################
    # Submit the Sqoop job
    ###########################################
    $exportDir = "wasb://$defaultBlobContainerName@$defaultStorageAccountName.blob.core.chinacloudapi.cn/tutorials/flightdelays/output"
    
    $sqoopDef = New-AzureRmHDInsightSqoopJobDefinition `
                    -Command "export --connect $sqlDatabaseConnectionString --table $sqlDatabaseTableName --export-dir $exportDir --fields-terminated-by \001 "
    $sqoopJob = Start-AzureRmHDInsightJob `
                    -ResourceGroupName $resourceGroupName `
                    -ClusterName $hdinsightClusterName `
                    -HttpCredential $httpCredential `
                    -JobDefinition $sqoopDef #-Debug -Verbose
    
    Wait-AzureRmHDInsightJob `
        -ResourceGroupName $resourceGroupName `
        -ClusterName $HDInsightClusterName `
        -HttpCredential $httpCredential `
        -WaitTimeoutInSeconds 3600 `
        -Job $sqoopJob.JobId
    
    Get-AzureRmHDInsightJobOutput `
        -ResourceGroupName $resourceGroupName `
        -ClusterName $hdinsightClusterName `
        -HttpCredential $httpCredential `
        -DefaultContainer $existingDefaultBlobContainerName `
        -DefaultStorageAccountName $defaultStorageAccountName `
        -DefaultStorageAccountKey $defaultStorageAccountKey `
        -JobId $sqoopJob.JobId `
        -DisplayOutputType StandardError
    
    ###########################################
    # Delete the cluster
    ###########################################
    Remove-AzureRmHDInsightCluster -ResourceGroupName $resourceGroupName -ClusterName $hdinsightClusterName
    
  3. 连接到 SQL 数据库,并在 AvgDelays 表中按城市查看平均航班延迟:Connect to your SQL database and see average flight delays by city in the AvgDelays table:

    HDI.FlightDelays.AvgDelays.Dataset


附录 A - 将航班延迟数据上传到 Azure Blob 存储Appendix A - Upload flight delay data to Azure Blob storage

上传数据文件和 HiveQL 脚本文件(请参阅附录 B)需要进行规划。Uploading the data file and the HiveQL script files (see Appendix B) requires some planning. 思路是在创建 HDInsight 群集和运行 Hive 作业之前存储数据文件和 HiveQL 文件。The idea is to store the data files and the HiveQL file before creating an HDInsight cluster and running the Hive job. 可以使用两个选项:You have two options:

  • 使用由 HDInsight 群集用作默认文件系统的同一 Azure 存储帐户。Use the same Azure Storage account that will be used by the HDInsight cluster as the default file system. 由于 HDInsight 群集将具有存储帐户访问密钥,因此你无需进行任何其他更改。Because the HDInsight cluster will have the Storage account access key, you don't need to make any additional changes.
  • 使用与 HDInsight 群集默认文件系统不同的 Azure 存储帐户。Use a different Azure Storage account from the HDInsight cluster default file system. 如果选择了此项,则必须修改创建 HDInsight 群集和运行 Apache Hive/Sqoop 作业中的 Windows PowerShell 脚本的创建部分,以链接该存储帐户作为额外的存储帐户。If this is the case, you must modify the creation part of the Windows PowerShell script found in Create HDInsight cluster and run Apache Hive/Sqoop jobs to link the Storage account as an additional Storage account. 有关说明,请参阅在 HDInsight 中创建 Apache Hadoop 群集For instructions, see Create Apache Hadoop clusters in HDInsight. 这样,HDInsight 群集就会知道存储帐户的访问密钥。The HDInsight cluster then knows the access key for the Storage account.

Note

数据文件的 WASB 路径会在 HiveQL 脚本文件中进行硬编码。The Blob storage path for the data file is hard coded in the HiveQL script file. 必须相应地进行更新。You must update it accordingly.

下载航班数据To download the flight data

  1. 浏览到 美国研究与技术创新管理部门 - 运输统计局Browse to Research and Innovative Technology Administration, Bureau of Transportation Statistics.

  2. 在该页面上,选择以下值:On the page, select the following values:

    名称Name Value
    筛选年份Filter Year 20132013
    筛选期间Filter Period 1 月January
    字段Fields YearFlightDateUniqueCarrierCarrierFlightNumOriginAirportIDOriginOriginCityNameOriginStateDestAirportIDDestDestCityNameDestStateDepDelayMinutesArrDelayArrDelayMinutesCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelay(清除其他所有字段)Year, FlightDate, UniqueCarrier, Carrier, FlightNum, OriginAirportID, Origin, OriginCityName, OriginState, DestAirportID, Dest, DestCityName, DestState, DepDelayMinutes, ArrDelay, ArrDelayMinutes, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay (clear all other fields)
  3. 单击“下载”。Click Download.

  4. 将文件解压缩到 C:\Tutorials\FlightDelay\2013Data 文件夹。Unzip the file to the C:\Tutorials\FlightDelay\2013Data folder. 每个文件均为 CSV 文件且大小约为 60GB。Each file is a CSV file and is approximately 60GB in size.

  5. 将文件重命名为其包含的数据所对应的月份的名称。Rename the file to the name of the month that it contains data for. 例如,将包含 1 月份数据的文件命名为 January.csvFor example, the file containing the January data would be named January.csv.

  6. 重复步骤 2 和步骤 5 为 2013 年中的 12 个月分别下载一个对应的文件。Repeat steps 2 and 5 to download a file for each of the 12 months in 2013. 完成本教程至少需要一个文件。You will need a minimum of one file to run the tutorial.

将航班延迟数据上传到 Azure Blob 存储To upload the flight delay data to Azure Blob storage

  1. 准备参数:Prepare the parameters:

    变量名Variable Name 注释Notes
    $storageAccountName$storageAccountName 数据上传的目标 Azure 存储帐户。The Azure Storage account where you want to upload the data to.
    $blobContainerName$blobContainerName 数据上传的目标 Blob 容器。The Blob container where you want to upload the data to.
  2. 打开 Azure PowerShell ISE。Open Azure PowerShell ISE.

  3. 将以下脚本粘贴到脚本窗格中:Paste the following script into the script pane:

    [CmdletBinding()]
    Param(
    
        [Parameter(Mandatory=$True,
                    HelpMessage="Enter the Azure storage account name for creating a new HDInsight cluster. If the account doesn't exist, the script will create one.")]
        [String]$storageAccountName,
    
        [Parameter(Mandatory=$True,
                    HelpMessage="Enter the Azure blob container name for creating a new HDInsight cluster. If not specified, the HDInsight cluster name will be used.")]
        [String]$blobContainerName
    )
    
    #Region - Variables
    $localFolder = "C:\Tutorials\FlightDelay\2013Data"  # The source folder
    $destFolder = "tutorials/flightdelay/2013data"     #The blob name prefix for the files to be uploaded
    #EndRegion
    
    #Region - Connect to Azure subscription
    Write-Host "`nConnecting to your Azure subscription ..." -ForegroundColor Green
    try{Get-AzureRmContext}
    catch{Connect-AzureRmAccount -EnvironmentName AzureChinaCloud}
    #EndRegion
    
    #Region - Validate user input
    Write-Host "`nValidating the Azure Storage account and the Blob container..." -ForegroundColor Green
    # Validate the Storage account
    if (-not (Get-AzureRmStorageAccount|Where-Object{$_.StorageAccountName -eq $storageAccountName}))
    {
        Write-Host "The storage account, $storageAccountName, doesn't exist." -ForegroundColor Red
        exit
    }
    else{
        $resourceGroupName = (Get-AzureRmStorageAccount|Where-Object{$_.StorageAccountName -eq $storageAccountName}).ResourceGroupName
    }
    
    # Validate the container
    $storageAccountKey = (Get-AzureRmStorageAccountKey -StorageAccountName $storageAccountName -ResourceGroupName $resourceGroupName)[0].Value
    $storageContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
    
    if (-not (Get-AzureStorageContainer -Context $storageContext |Where-Object{$_.Name -eq $blobContainerName}))
    {
        Write-Host "The Blob container, $blobContainerName, doesn't exist" -ForegroundColor Red
        Exit
    }
    #EngRegion
    
    #Region - Copy the file from local workstation to Azure Blob storage
    if (test-path -Path $localFolder)
    {
        foreach ($item in Get-ChildItem -Path $localFolder){
            $fileName = "$localFolder\$item"
            $blobName = "$destFolder/$item"
    
            Write-Host "Copying $fileName to $blobName" -ForegroundColor Green
    
            Set-AzureStorageBlobContent -File $fileName -Container $blobContainerName -Blob $blobName -Context $storageContext
        }
    }
    else
    {
        Write-Host "The source folder on the workstation doesn't exist" -ForegroundColor Red
    }
    
    # List the uploaded files on HDInsight
    Get-AzureStorageBlob -Container $blobContainerName  -Context $storageContext -Prefix $destFolder
    #EndRegion
    
  4. F5 运行脚本。Press F5 to run the script.

如果选择使用其他方法上传文件,请确保文件路径是 tutorials/flightdelay/data。If you choose to use a different method for uploading the files, please make sure the file path is tutorials/flightdelay/data. 用于访问文件的语法是:The syntax for accessing the files is:

wasb://<ContainerName>@<StorageAccountName>.blob.core.chinacloudapi.cn/tutorials/flightdelay/data

路径 tutorials/flightdelay/data 是上传文件时创建的虚拟文件夹。The path tutorials/flightdelay/data is the virtual folder you created when you uploaded the files. 验证是否有 12 个文件,每个月对应一个文件。Verify that there are 12 files, one for each month.

Note

必须更新 Hive 查询,才能从新位置进行读取。You must update the Hive query to read from the new location.

必须配置容器访问权限,使其成为公用,或者将存储帐户绑定到 HDInsight 群集。You must either configure the container access permission to be public or bind the Storage account to the HDInsight cluster. 否则,Hive 查询字符串无法访问数据文件。Otherwise, the Hive query string will not be able to access the data files.


附录 B - 创建并上传 HiveQL 脚本Appendix B - Create and upload a HiveQL script

使用 Azure PowerShell,可以一次运行多个 HiveQL 语句,或者将 HiveQL 语句打包到一个脚本文件中。Using Azure PowerShell, you can run multiple HiveQL statements one at a time, or package the HiveQL statement into a script file. 本部分说明如何创建 HiveQL 脚本,以及如何使用 Azure PowerShell 将脚本上传到 Azure Blob 存储。This section shows you how to create a HiveQL script and upload the script to Azure Blob storage by using Azure PowerShell. Hive 要求 HiveQL 脚本必须存储在 Azure Blob 存储中。Hive requires the HiveQL scripts to be stored in Azure Blob storage.

HiveQL 脚本执行以下操作:The HiveQL script will perform the following:

  1. 删除 delays_raw 表(如果该表已存在)。Drop the delays_raw table, in case the table already exists.
  2. 创建 delays_raw 外部 Hive 表,并将该表指向航班延误文件所在的 Blob 存储位置。Create the delays_raw external Hive table pointing to the Blob storage location with the flight delay files. 此查询指定用“,”分隔字段并用“\n”终止行。This query specifies that fields are delimited by "," and that lines are terminated by "\n". 这在字段值包含逗号时将导致出现问题,因为 Hive 无法区分逗号是字段分隔符还是字段值的一部分(在 ORIGIN_CITY_NAME 和 DEST_CITY_NAME 的字段值中属于此情况)。This poses a problem when field values contain commas because Hive cannot differentiate between a comma that is a field delimiter and a one that is part of a field value (which is the case in field values for ORIGIN_CITY_NAME and DEST_CITY_NAME). 为了解决此问题,此查询创建 TEMP 列来保存未正确拆分到列中的数据。To address this, the query creates TEMP columns to hold data that is incorrectly split into columns.
  3. 删除 delays 表(如果该表已存在)。Drop the delays table, in case the table already exists.
  4. 创建 delays 表Create the delays table. 这适用于在进一步处理前清理数据。It is helpful to clean up the data before further processing. 此查询将从 delays_raw 表创建一个新表 delays。This query creates a new table, delays, from the delays_raw table. 请注意,将不会复制 TEMP 列(如前所述),并且将使用 substring 函数从数据中删除引号标记。Note that the TEMP columns (as mentioned previously) are not copied, and that the substring function is used to remove quotation marks from the data.
  5. 计算平均天气延迟,并按城市名对结果进行分组。Compute the average weather delay and groups the results by city name. 它还会将结果输出到 Blob 存储。It will also output the results to Blob storage. 请注意,查询将从数据中删除撇号,并且将排除 weather_delay 的值为 null 的行。Note that the query will remove apostrophes from the data and will exclude rows where the value for weather_delay is null. 由于本教程中稍后使用的 Sqoop 在默认情况下无法适当地处理这些值,因此这是必要的。This is necessary because Sqoop, used later in this tutorial, doesn't handle those values gracefully by default.

如需 HiveQL 命令的完整列表,请参阅 Apache Hive 数据定义语言For a full list of the HiveQL commands, see Apache Hive Data Definition Language. 每条 HiveQL 命令必须以分号结尾。Each HiveQL command must terminate with a semicolon.

创建 HiveQL 脚本文件To create a HiveQL script file

  1. 准备参数:Prepare the parameters:

    变量名Variable Name 注释Notes
    $storageAccountName$storageAccountName HiveQL 脚本上传的目标 Azure 存储帐户。The Azure Storage account where you want to upload the HiveQL script to.
    $blobContainerName$blobContainerName HiveQL 脚本上传的目标 Blob 容器。The Blob container where you want to upload the HiveQL script to.
  2. 打开 Azure PowerShell ISE。Open Azure PowerShell ISE.

  3. 将以下脚本复制并粘贴到脚本窗格中:Copy and paste the following script into the script pane:

    [CmdletBinding()]
    Param(
    
        # Azure Blob storage variables
        [Parameter(Mandatory=$True,
                    HelpMessage="Enter the Azure storage account name for creating a new HDInsight cluster. If the account doesn't exist, the script will create one.")]
        [String]$storageAccountName,
    
        [Parameter(Mandatory=$True,
                    HelpMessage="Enter the Azure blob container name for creating a new HDInsight cluster. If not specified, the HDInsight cluster name will be used.")]
        [String]$blobContainerName
    )
    
    #region - Define variables
    # Treat all errors as terminating
    $ErrorActionPreference = "Stop"
    
    # The HiveQL script file is exported as this file before it's uploaded to Blob storage
    $hqlLocalFileName = "e:\tutorials\flightdelay\flightdelays.hql"
    
    # The HiveQL script file will be uploaded to Blob storage as this blob name
    $hqlBlobName = "tutorials/flightdelay/flightdelays.hql"
    
    # These two constants are used by the HiveQL script file
    #$srcDataFolder = "tutorials/flightdelay/data"
    $dstDataFolder = "/tutorials/flightdelay/output"
    #endregion
    
    #Region - Connect to Azure subscription
    Write-Host "`nConnecting to your Azure subscription ..." -ForegroundColor Green
    try{Get-AzureRmContext}
    catch{Connect-AzureRmAccount -EnvironmentName AzureChinaCloud}
    #EndRegion
    
    #Region - Validate user input
    Write-Host "`nValidating the Azure Storage account and the Blob container..." -ForegroundColor Green
    # Validate the Storage account
    if (-not (Get-AzureRmStorageAccount|Where-Object{$_.StorageAccountName -eq $storageAccountName}))
    {
        Write-Host "The storage account, $storageAccountName, doesn't exist." -ForegroundColor Red
        exit
    }
    else{
        $resourceGroupName = (Get-AzureRmStorageAccount|Where-Object{$_.StorageAccountName -eq $storageAccountName}).ResourceGroupName
    }
    
    # Validate the container
    $storageAccountKey = (Get-AzureRmStorageAccountKey -StorageAccountName $storageAccountName -ResourceGroupName $resourceGroupName)[0].Value
    $storageContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
    
    if (-not (Get-AzureStorageContainer -Context $storageContext |Where-Object{$_.Name -eq $blobContainerName}))
    {
        Write-Host "The Blob container, $blobContainerName, doesn't exist" -ForegroundColor Red
        Exit
    }
    #EngRegion
    
    #region - Validate the file and file path
    
    # Check if a file with the same file name already exists on the workstation
    Write-Host "`nvalidating the folder structure on the workstation for saving the HQL script file ..."  -ForegroundColor Green
    if (test-path $hqlLocalFileName){
    
        $isDelete = Read-Host 'The file, ' $hqlLocalFileName ', exists.  Do you want to overwrite it? (Y/N)'
    
        if ($isDelete.ToLower() -ne "y")
        {
            Exit
        }
    }
    
    # Create the folder if it doesn't exist
    $folder = split-path $hqlLocalFileName
    if (-not (test-path $folder))
    {
        Write-Host "`nCreating folder, $folder ..." -ForegroundColor Green
    
        new-item $folder -ItemType directory
    }
    #end region
    
    #region - Write the Hive script into a local file
    Write-Host "`nWriting the Hive script into a file on your workstation ..." `
                -ForegroundColor Green
    
    $hqlDropDelaysRaw = "DROP TABLE delays_raw;"
    
    $hqlCreateDelaysRaw = "CREATE EXTERNAL TABLE delays_raw (" +
            "YEAR string, " +
            "FL_DATE string, " +
            "UNIQUE_CARRIER string, " +
            "CARRIER string, " +
            "FL_NUM string, " +
            "ORIGIN_AIRPORT_ID string, " +
            "ORIGIN string, " +
            "ORIGIN_CITY_NAME string, " +
            "ORIGIN_CITY_NAME_TEMP string, " +
            "ORIGIN_STATE_ABR string, " +
            "DEST_AIRPORT_ID string, " +
            "DEST string, " +
            "DEST_CITY_NAME string, " +
            "DEST_CITY_NAME_TEMP string, " +
            "DEST_STATE_ABR string, " +
            "DEP_DELAY_NEW float, " +
            "ARR_DELAY_NEW float, " +
            "CARRIER_DELAY float, " +
            "WEATHER_DELAY float, " +
            "NAS_DELAY float, " +
            "SECURITY_DELAY float, " +
            "LATE_AIRCRAFT_DELAY float) " +
        "ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' " +
        "LINES TERMINATED BY '\n' " +
        "STORED AS TEXTFILE " +
        "LOCATION 'wasb://flightdelay@hditutorialdata.blob.core.windows.net/2013Data';"
    
    $hqlDropDelays = "DROP TABLE delays;"
    
    $hqlCreateDelays = "CREATE TABLE delays AS " +
        "SELECT YEAR AS year, " +
            "FL_DATE AS flight_date, " +
            "substring(UNIQUE_CARRIER, 2, length(UNIQUE_CARRIER) -1) AS unique_carrier, " +
            "substring(CARRIER, 2, length(CARRIER) -1) AS carrier, " +
            "substring(FL_NUM, 2, length(FL_NUM) -1) AS flight_num, " +
            "ORIGIN_AIRPORT_ID AS origin_airport_id, " +
            "substring(ORIGIN, 2, length(ORIGIN) -1) AS origin_airport_code, " +
            "substring(ORIGIN_CITY_NAME, 2) AS origin_city_name, " +
            "substring(ORIGIN_STATE_ABR, 2, length(ORIGIN_STATE_ABR) -1)  AS origin_state_abr, " +
            "DEST_AIRPORT_ID AS dest_airport_id, " +
            "substring(DEST, 2, length(DEST) -1) AS dest_airport_code, " +
            "substring(DEST_CITY_NAME,2) AS dest_city_name, " +
            "substring(DEST_STATE_ABR, 2, length(DEST_STATE_ABR) -1) AS dest_state_abr, " +
            "DEP_DELAY_NEW AS dep_delay_new, " +
            "ARR_DELAY_NEW AS arr_delay_new, " +
            "CARRIER_DELAY AS carrier_delay, " +
            "WEATHER_DELAY AS weather_delay, " +
            "NAS_DELAY AS nas_delay, " +
            "SECURITY_DELAY AS security_delay, " +
            "LATE_AIRCRAFT_DELAY AS late_aircraft_delay " +
        "FROM delays_raw;"
    
    $hqlInsertLocal = "INSERT OVERWRITE DIRECTORY '$dstDataFolder' " +
        "SELECT regexp_replace(origin_city_name, '''', ''), " +
            "avg(weather_delay) " +
        "FROM delays " +
        "WHERE weather_delay IS NOT NULL " +
        "GROUP BY origin_city_name;"
    
    $hqlScript = $hqlDropDelaysRaw + $hqlCreateDelaysRaw + $hqlDropDelays + $hqlCreateDelays + $hqlInsertLocal
    
    $hqlScript | Out-File $hqlLocalFileName -Encoding ascii -Force
    #endregion
    
    #region - Upload the Hive script to the default Blob container
    Write-Host "`nUploading the Hive script to the default Blob container ..." -ForegroundColor Green
    
    # Create a storage context object
    $storageAccountKey = (Get-AzureRmStorageAccountKey -StorageAccountName $storageAccountName -ResourceGroupName $resourceGroupName)[0].Value
    $destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
    
    # Upload the file from local workstation to Blob storage
    Set-AzureStorageBlobContent -File $hqlLocalFileName -Container $blobContainerName -Blob $hqlBlobName -Context $destContext
    #endregion
    
    Write-host "`nEnd of the PowerShell script" -ForegroundColor Green
    

    该脚本中使用了以下变量:Here are the variables used in the script:

    • $hqlLocalFileName - 该脚本会先将 HiveQL 脚本文件保存在本地,然后才上传到 Blob 存储。$hqlLocalFileName - The script saves the HiveQL script file locally before uploading it to Blob storage. 这是文件名。This is the file name. 默认值是 C:\tutorials\flightdelay\flightdelays.hqlThe default value is C:\tutorials\flightdelay\flightdelays.hql.
    • $hqlBlobName - 这是 Azure Blob 存储中使用的 HiveQL 脚本文件 Blob 名称。$hqlBlobName - This is the HiveQL script file blob name used in the Azure Blob storage. 默认值是 tutorials/flightdelay/flightdelays.hql。The default value is tutorials/flightdelay/flightdelays.hql. 因为文件会直接写入 Azure Blob 存储,所以 Blob 名称的开头不是“/”。Because the file will be written directly to Azure Blob storage, there is NOT a "/" at the beginning of the blob name. 如果要从 Blob 存储访问文件,必须在文件名的开头添加“/”。If you want to access the file from Blob storage, you will need to add a "/" at the beginning of the file name.
    • $srcDataFolder 和 $dstDataFolder - = "tutorials/flightdelay/data" = "tutorials/flightdelay/output"$srcDataFolder and $dstDataFolder - = "tutorials/flightdelay/data" = "tutorials/flightdelay/output"

附录 C - 针对 Sqoop 作业输出准备 Azure SQL 数据库Appendix C - Prepare an Azure SQL database for the Sqoop job output

准备 SQL 数据库(将此部分与 Sqoop 脚本合并)To prepare the SQL database (merge this with the Sqoop script)

  1. 准备参数:Prepare the parameters:

    变量名Variable Name 注释Notes
    $sqlDatabaseServerName$sqlDatabaseServerName Azure SQL 数据库服务器的名称。The name of the Azure SQL database server. 不输入任何值会创建新的服务器。Enter nothing to create a new server.
    $sqlDatabaseUsername$sqlDatabaseUsername Azure SQL 数据库服务器登录名。The login name for the Azure SQL database server. 如果 $sqlDatabaseServerName 是现有的服务器,登录名和登录密码将用来向服务器进行身份验证。If $sqlDatabaseServerName is an existing server, the login and login password are used to authenticate with the server. 否则将其用于创建新的服务器。Otherwise they are used to create a new server.
    $sqlDatabasePassword$sqlDatabasePassword Azure SQL 数据库服务器登录密码。The login password for the Azure SQL database server.
    $sqlDatabaseLocation$sqlDatabaseLocation 只有在创建新的 Azure 数据库服务器时才会使用此值。This value is used only when you're creating a new Azure database server.
    $sqlDatabaseName$sqlDatabaseName Sqoop 作业的 AvgDelays 表的 SQL 数据库。The SQL database used to create the AvgDelays table for the Sqoop job. 保留空白会创建名为 HDISqoop 的数据库。Leaving it blank will create a database called HDISqoop. Sqooop 作业输出的表名称为 AvgDelays。The table name for the Sqoop job output is AvgDelays.
  2. 打开 Azure PowerShell ISE。Open Azure PowerShell ISE.

  3. 将以下脚本复制并粘贴到脚本窗格中:Copy and paste the following script into the script pane:

    [CmdletBinding()]
    Param(
    
        # Azure Resource group variables
        [Parameter(Mandatory=$True,
                HelpMessage="Enter the Azure resource group name. It will be created if it doesn't exist.")]
        [String]$resourceGroupName,
    
        # SQL database server variables
        [Parameter(Mandatory=$True,
                HelpMessage="Enter the Azure SQL Database Server Name. It will be created if it doesn't exist.")]
        [String]$sqlDatabaseServer,
    
        [Parameter(Mandatory=$True,
                HelpMessage="Enter the Azure SQL Database admin user.")]
        [String]$sqlDatabaseLogin,
    
        [Parameter(Mandatory=$True,
                HelpMessage="Enter the Azure SQL Database admin user password.")]
        [String]$sqlDatabasePassword,
    
        [Parameter(Mandatory=$True,
                HelpMessage="Enter the region to create the Database in.")]
        [String]$sqlDatabaseLocation,   #For example, China North.
    
        # SQL database variables
        [Parameter(Mandatory=$True,
                HelpMessage="Enter the database name. It will be created if it doesn't exist.")]
        [String]$sqlDatabaseName # specify the database name if you have one created. Otherwise use "" to have the script create one for you.
    )
    
    # Treat all errors as terminating
    $ErrorActionPreference = "Stop"
    
    #region - Constants and variables
    
    # IP address REST service used for retrieving external IP address and creating firewall rules
    [String]$ipAddressRestService = "https://bot.whatismyipaddress.com"
    [String]$fireWallRuleName = "FlightDelay"
    
    # SQL database variables
    [String]$sqlDatabaseMaxSizeGB = 10
    
    #SQL query string for creating AvgDelays table
    [String]$sqlDatabaseTableName = "AvgDelays"
    [String]$sqlCreateAvgDelaysTable = " CREATE TABLE [dbo].[$sqlDatabaseTableName](
                [origin_city_name] [nvarchar](50) NOT NULL,
                [weather_delay] float,
            CONSTRAINT [PK_$sqlDatabaseTableName] PRIMARY KEY CLUSTERED
            (
                [origin_city_name] ASC
            )
            )"
    #endregion
    
    #Region - Connect to Azure subscription
    Write-Host "`nConnecting to your Azure subscription ..." -ForegroundColor Green
    try{Get-AzureRmContext}
    catch{Connect-AzureRmAccount -EnvironmentName AzureChinaCloud}
    #EndRegion
    
    #region - Create and validate Azure resource group
    try{
        Get-AzureRmResourceGroup -Name $resourceGroupName
    }
    catch{
        New-AzureRmResourceGroup -Name $resourceGroupName -Location $sqlDatabaseLocation
    }
    
    #EndRegion
    
    #region - Create and validate Azure SQL database server
    try{
        Get-AzureRmSqlServer -ServerName $sqlDatabaseServer -ResourceGroupName $resourceGroupName}
    catch{
        Write-Host "`nCreating SQL Database server ..."  -ForegroundColor Green
    
        $sqlDatabasePW = ConvertTo-SecureString -String $sqlDatabasePassword -AsPlainText -Force
        $credential = New-Object System.Management.Automation.PSCredential($sqlDatabaseLogin,$sqlDatabasePW)
    
        $sqlDatabaseServer = (New-AzureRmSqlServer -ResourceGroupName $resourceGroupName -ServerName $sqlDatabaseServer -SqlAdministratorCredentials $credential -Location $sqlDatabaseLocation).ServerName
        Write-Host "`tThe new SQL database server name is $sqlDatabaseServer." -ForegroundColor Cyan
    
        Write-Host "`nCreating firewall rule, $fireWallRuleName ..." -ForegroundColor Green
        $workstationIPAddress = Invoke-RestMethod $ipAddressRestService
        New-AzureRmSqlServerFirewallRule -ResourceGroupName $resourceGroupName -ServerName $sqlDatabaseServer -FirewallRuleName "$fireWallRuleName-workstation" -StartIpAddress $workstationIPAddress -EndIpAddress $workstationIPAddress
    
        #To allow other Azure services to access the server add a firewall rule and set both the StartIpAddress and EndIpAddress to 0.0.0.0. Note that this allows Azure traffic from any Azure subscription to access the server.
        New-AzureRmSqlServerFirewallRule -ResourceGroupName $resourceGroupName -ServerName $sqlDatabaseServer -FirewallRuleName "$fireWallRuleName-Azureservices" -StartIpAddress "0.0.0.0" -EndIpAddress "0.0.0.0"
    }
    
    #endregion
    
    #region - Create and validate Azure SQL database
    
    try {
        Get-AzureRmSqlDatabase -ResourceGroupName $resourceGroupName -ServerName $sqlDatabaseServer -DatabaseName $sqlDatabaseName
    }
    catch {
        Write-Host "`nCreating SQL Database, $sqlDatabaseName ..."  -ForegroundColor Green
        New-AzureRMSqlDatabase -ResourceGroupName $resourceGroupName -ServerName $sqlDatabaseServer -DatabaseName $sqlDatabaseName -Edition "Standard" -RequestedServiceObjectiveName "S1"
    }
    
    #endregion
    
    #region -  Execute an SQL command to create the AvgDelays table
    
    Write-Host "`nCreating SQL Database table ..."  -ForegroundColor Green
    $conn = New-Object System.Data.SqlClient.SqlConnection
    $conn.ConnectionString = "Data Source=$sqlDatabaseServer.database.chinacloudapi.cn;Initial Catalog=$sqlDatabaseName;User ID=$sqlDatabaseLogin;Password=$sqlDatabasePassword;Encrypt=true;Trusted_Connection=false;"
    $conn.open()
    $cmd = New-Object System.Data.SqlClient.SqlCommand
    $cmd.connection = $conn
    $cmd.commandtext = $sqlCreateAvgDelaysTable
    $cmd.executenonquery()
    
    $conn.close()
    
    Write-host "`nEnd of the PowerShell script" -ForegroundColor Green
    

    Note

    该脚本使用具象状态传输 (REST) 服务 https://bot.whatismyipaddress.com 来检索外部 IP 地址。The script uses a representational state transfer (REST) service, https://bot.whatismyipaddress.com, to retrieve your external IP address. IP 地址用于创建 SQL 数据库服务器的防火墙规则。The IP address is used for creating a firewall rule for your SQL database server.

    该脚本中使用的某些变量:Here are some variables used in the script:

    • $ipAddressRestService - 默认值为 https://bot.whatismyipaddress.com。$ipAddressRestService - The default value is https://bot.whatismyipaddress.com. 这是用来获取外部 IP 地址的公共 IP 地址 REST 服务。It is a public IP address REST service for getting your external IP address. 可根据需要使用其他服务。You can use other services if you want. 使用此服务检索的外部 IP 地址将用于创建 Azure SQL 数据库服务器的防火墙规则,使用户能够从工作站访问数据库(通过 Windows PowerShell 脚本)。The external IP address retrieved through the service will be used to create a firewall rule for your Azure SQL database server, so that you can access the database from your workstation (by using a Windows PowerShell script).
    • $fireWallRuleName - 这是 Azure SQL 数据库服务器的防火墙规则名称。$fireWallRuleName - This is the name of the firewall rule for the Azure SQL database server. 默认名称为 FlightDelayThe default name is FlightDelay. 可根据需要对其进行重命名。You can rename it if you want.
    • $sqlDatabaseMaxSizeGB - 只有在创建新的 Azure SQL 数据库服务器时才会使用此值。$sqlDatabaseMaxSizeGB - This value is used only when you're creating a new Azure SQL database server. 默认值为 10GB。The default value is 10GB. 10GB 对于本教程来说已足够。10GB is sufficient for this tutorial.
    • $sqlDatabaseName - 只有在创建新的 Azure SQL 数据库时才会使用此值。$sqlDatabaseName - This value is used only when you're creating a new Azure SQL database. 默认值为 HDISqoop。The default value is HDISqoop. 如果将它重命名,则必须相应地更新 Sqoop Windows PowerShell 脚本。If you rename it, you must update the Sqoop Windows PowerShell script accordingly.
  4. F5 运行脚本。Press F5 to run the script.

  5. 验证脚本输出。Validate the script output. 确保已成功运行脚本。Make sure the script ran successfully.

后续步骤Next steps

现在你已了解了如何执行以下操作:将文件上传到 Azure Blob 存储、使用 Azure Blob 存储中的数据填充 Apache Hive 表、运行 Apache Hive 查询以及使用 Sqoop 将数据从 Apache Hadoop HDFS 导出到 Azure SQL 数据库。Now you understand how to upload a file to Azure Blob storage, how to populate an Apache Hive table by using the data from Azure Blob storage, how to run Hive queries, and how to use Sqoop to export data from Apache Hadoop HDFS to an Azure SQL database. 要了解更多信息,请参阅下列文章:To learn more, see the following articles: