教程:在 Azure HDInsight 中使用交互式查询提取、转换和加载数据Tutorial: Extract, transform, and load data using Interactive Query in Azure HDInsight

在本教程中,我们将下载公开发布的航班数据的原始 CSV 数据文件。In this tutorial, you download a raw CSV data file of publicly available flight data. 将该文件导入 HDInsight 群集存储,然后使用 Azure HDInsight 中的 Interactive Query 来转换数据。Import it into HDInsight cluster storage, and then transform the data using Interactive Query in Azure HDInsight. 数据转换完毕后,使用 Apache Sqoop 将数据加载到 Azure SQL 数据库。Once the data is transformed, you load that data into an Azure SQL database using Apache Sqoop.

本教程涵盖以下任务:This tutorial covers the following tasks:

  • 下载示例航班数据Download the sample flight data
  • 将数据上传到 HDInsight 群集Upload data to an HDInsight cluster
  • 使用交互式查询转换数据Transform the data using Interactive Query
  • 在 Azure SQL 数据库中创建表Create a table in an Azure SQL database
  • 使用 Sqoop 将数据导出到 Azure SQL 数据库Use Sqoop to export data to an Azure SQL database

先决条件Prerequisites

下载航班数据Download the flight data

  1. 浏览到美国研究与技术创新管理部门、运输统计局Browse to Research and Innovative Technology Administration, Bureau of Transportation Statistics.

  2. 在页面上,清除所有字段,然后选择以下值:On the page, clear all fields, and then select the following values:

    名称Name ValueValue
    筛选年份Filter Year 20192019
    筛选期间Filter Period 1 月January
    字段Fields Year、FlightDate、Reporting_Airline、DOT_ID_Reporting_Airline、Flight_Number_Reporting_Airline、OriginAirportID、Origin、OriginCityName、OriginState、DestAirportID、Dest、DestCityName、DestState、DepDelayMinutes、ArrDelay、ArrDelayMinutes、CarrierDelay、WeatherDelay、NASDelay、SecurityDelay、LateAircraftDelay。Year, FlightDate, Reporting_Airline, DOT_ID_Reporting_Airline, Flight_Number_Reporting_Airline, OriginAirportID, Origin, OriginCityName, OriginState, DestAirportID, Dest, DestCityName, DestState, DepDelayMinutes, ArrDelay, ArrDelayMinutes, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay.
  3. 选择“下载”。Select Download. 你将得到一个具有所选数据字段的 zip 文件。You get a .zip file with the data fields you selected.

将数据上传到 HDInsight 群集Upload data to an HDInsight cluster

可通过多种方式将数据上传到与 HDInsight 群集相关的存储。There are many ways to upload data to the storage associated with an HDInsight cluster. 本部分使用 scp 上传数据。In this section, you use scp to upload data. 若要了解上传数据的其他方式,请参阅将数据上传到 HDInsightTo learn about other ways to upload data, see Upload data to HDInsight.

  1. 将 .zip 文件上传到 HDInsight 群集头节点。Upload the .zip file to the HDInsight cluster head node. 编辑以下命令,将 FILENAME 替换为 .zip 文件的名称,将 CLUSTERNAME 替换为 HDInsight 群集的名称。Edit the command below by replacing FILENAME with the name of the .zip file, and CLUSTERNAME with the name of the HDInsight cluster. 然后打开命令提示符,将工作目录设置为文件位置,然后输入命令。Then open a command prompt, set your working directory to the file location, and then enter the command.

    scp FILENAME.zip sshuser@CLUSTERNAME-ssh.azurehdinsight.cn:FILENAME.zip
    

    如果出现提示,请输入“yes”或“no”以继续。Enter yes or no to continue if prompted. 键入时,文本在窗口中不可见。The text isn't visible in the window as you type.

  2. 上传完成后,使用 SSH 连接到群集。After the upload has finished, connect to the cluster by using SSH. 通过将 CLUSTERNAME 替换为 HDInsight 群集的名称来编辑以下命令。Edit the command below by replacing CLUSTERNAME with the name of the HDInsight cluster. 然后输入以下命令:Then enter the following command:

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  3. 建立 SSH 连接后设置环境变量。Set up environment variable once an SSH connection has been established. FILE_NAMESQL_SERVERNAMESQL_DATABASESQL_USERSQL_PASWORD 替换为适当的值。Replace FILE_NAME, SQL_SERVERNAME, SQL_DATABASE, SQL_USER, and SQL_PASWORD with the appropriate values. 然后输入该命令:Then enter the command:

    export FILENAME=FILE_NAME
    export SQLSERVERNAME=SQL_SERVERNAME
    export DATABASE=SQL_DATABASE
    export SQLUSER=SQL_USER
    export SQLPASWORD='SQL_PASWORD'
    
  4. 通过输入以下命令解压缩 .zip 文件:Unzip the .zip file by entering the command below:

    unzip $FILENAME.zip
    
  5. 在 HDInsight 存储上创建目录,然后输入以下命令将 .csv 文件复制到该目录:Create a directory on HDInsight storage, and then copy the .csv file to the directory by entering the command below:

    hdfs dfs -mkdir -p /tutorials/flightdelays/data
    hdfs dfs -put $FILENAME.csv /tutorials/flightdelays/data/
    

使用 Hive 查询转换数据Transform data using a Hive query

可通过多种方式在 HDInsight 群集上运行 Hive 作业。There are many ways to run a Hive job on an HDInsight cluster. 本部分使用 Beeline 运行 Hive 作业。In this section, you use Beeline to run a Hive job. 有关以其他方式运行 Hive 作业的信息,请参阅在 HDInsight 上使用 Apache HiveFor information on other methods of running a Hive job, see Use Apache Hive on HDInsight.

在 Hive 作业运行期间,请将 .csv 文件中的数据导入到名为“Delays”的 Hive 表中。As part of the Hive job, you import the data from the .csv file into a Hive table named Delays.

  1. 在 HDInsight 群集已有的 SSH 提示符中,使用以下命令创建并编辑名为“flightdelays.hql”的新文件:From the SSH prompt that you already have for the HDInsight cluster, use the following command to create, and edit a new file named flightdelays.hql:

    nano flightdelays.hql
    
  2. 将以下文本用作此文件的内容:Use the following text as the contents of this file:

    DROP TABLE delays_raw;
    -- Creates an external table over the csv file
    CREATE EXTERNAL TABLE delays_raw (
        YEAR string,
        FL_DATE string,
        UNIQUE_CARRIER string,
        CARRIER string,
        FL_NUM string,
        ORIGIN_AIRPORT_ID string,
        ORIGIN string,
        ORIGIN_CITY_NAME string,
        ORIGIN_CITY_NAME_TEMP string,
        ORIGIN_STATE_ABR string,
        DEST_AIRPORT_ID string,
        DEST string,
        DEST_CITY_NAME string,
        DEST_CITY_NAME_TEMP string,
        DEST_STATE_ABR string,
        DEP_DELAY_NEW float,
        ARR_DELAY_NEW float,
        CARRIER_DELAY float,
        WEATHER_DELAY float,
        NAS_DELAY float,
        SECURITY_DELAY float,
        LATE_AIRCRAFT_DELAY float)
    -- The following lines describe the format and location of the file
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/tutorials/flightdelays/data';
    
    -- Drop the delays table if it exists
    DROP TABLE delays;
    -- Create the delays table and populate it with data
    -- pulled in from the CSV file (via the external table defined previously)
    CREATE TABLE delays AS
    SELECT YEAR AS year,
        FL_DATE AS flight_date,
        substring(UNIQUE_CARRIER, 2, length(UNIQUE_CARRIER) -1) AS unique_carrier,
        substring(CARRIER, 2, length(CARRIER) -1) AS carrier,
        substring(FL_NUM, 2, length(FL_NUM) -1) AS flight_num,
        ORIGIN_AIRPORT_ID AS origin_airport_id,
        substring(ORIGIN, 2, length(ORIGIN) -1) AS origin_airport_code,
        substring(ORIGIN_CITY_NAME, 2) AS origin_city_name,
        substring(ORIGIN_STATE_ABR, 2, length(ORIGIN_STATE_ABR) -1)  AS origin_state_abr,
        DEST_AIRPORT_ID AS dest_airport_id,
        substring(DEST, 2, length(DEST) -1) AS dest_airport_code,
        substring(DEST_CITY_NAME,2) AS dest_city_name,
        substring(DEST_STATE_ABR, 2, length(DEST_STATE_ABR) -1) AS dest_state_abr,
        DEP_DELAY_NEW AS dep_delay_new,
        ARR_DELAY_NEW AS arr_delay_new,
        CARRIER_DELAY AS carrier_delay,
        WEATHER_DELAY AS weather_delay,
        NAS_DELAY AS nas_delay,
        SECURITY_DELAY AS security_delay,
        LATE_AIRCRAFT_DELAY AS late_aircraft_delay
    FROM delays_raw;
    
  3. 若要保存该文件,请按“Ctrl + X”,然后按“y”,然后输入 。To save the file, press Ctrl + X, then y, then enter.

  4. 若要启动 Hive 并运行 flightdelays.hql 文件,请使用以下命令:To start Hive and run the flightdelays.hql file, use the following command:

    beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -f flightdelays.hql
    
  5. flightdelays.hql 脚本完成运行后,使用以下命令打开交互式 Beeline 会话:After the flightdelays.hql script finishes running, use the following command to open an interactive Beeline session:

    beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http'
    
  6. 收到 jdbc:hive2://localhost:10001/> 提示时,使用以下查询从导入的航班延误数据中检索数据:When you receive the jdbc:hive2://localhost:10001/> prompt, use the following query to retrieve data from the imported flight delay data:

    INSERT OVERWRITE DIRECTORY '/tutorials/flightdelays/output'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    SELECT regexp_replace(origin_city_name, '''', ''),
        avg(weather_delay)
    FROM delays
    WHERE weather_delay IS NOT NULL
    GROUP BY origin_city_name;
    

    此查询会检索遇到天气延迟的城市的列表以及平均延迟时间,并将其保存到 /tutorials/flightdelays/output 中。This query retrieves a list of cities that experienced weather delays, along with the average delay time, and saves it to /tutorials/flightdelays/output. 稍后,Sqoop 会从该位置读取数据并将其导出到 Azure SQL 数据库。Later, Sqoop reads the data from this location and exports it to Azure SQL Database.

  7. 若要退出 Beeline,请在提示符处输入 !quitTo exit Beeline, enter !quit at the prompt.

创建 SQL 数据库表Create a SQL database table

可通过多种方式连接到 SQL 数据库并创建表。There are many ways to connect to SQL Database and create a table. 以下步骤从 HDInsight 群集使用 FreeTDSThe following steps use FreeTDS from the HDInsight cluster.

  1. 若要安装 FreeTDS,请使用以下命令从打开的 SSH 连接到群集:To install FreeTDS, use the following command from the open SSH connection to the cluster:

    sudo apt-get --assume-yes install freetds-dev freetds-bin
    
  2. 安装完成后,使用以下命令连接到 SQL 数据库。After the installation finishes, use the following command to connect to SQL Database.

    TDSVER=8.0 tsql -H $SQLSERVERNAME.database.chinacloudapi.cn -U $SQLUSER -p 1433 -D $DATABASE -P $SQLPASWORD
    

    将收到类似于以下文本的输出:You receive output similar to the following text:

    locale is "en_US.UTF-8"
    locale charset is "UTF-8"
    using default charset "UTF-8"
    Default database being set to <yourdatabase>
    1>
    
  3. 1> 提示符下,输入以下行:At the 1> prompt, enter the following lines:

    CREATE TABLE [dbo].[delays](
    [origin_city_name] [nvarchar](50) NOT NULL,
    [weather_delay] float,
    CONSTRAINT [PK_delays] PRIMARY KEY CLUSTERED
    ([origin_city_name] ASC))
    GO
    

    输入 GO 语句后,会评估前面的语句。When the GO statement is entered, the previous statements are evaluated. 此语句会创建一个名为“delays”且具有聚集索引的表。This statement creates a table named delays, with a clustered index.

    使用以下查询验证是否已创建该表:Use the following query to verify that the table has been created:

    SELECT * FROM information_schema.tables
    GO
    

    输出与以下文本类似:The output is similar to the following text:

    TABLE_CATALOG   TABLE_SCHEMA    TABLE_NAME      TABLE_TYPE
    databaseName       dbo             delays        BASE TABLE
    
  4. 1> 提示符下输入 exit 以退出 tsql 实用工具。Enter exit at the 1> prompt to exit the tsql utility.

使用 Apache Sqoop 将数据导出到 SQL 数据库Export data to SQL database using Apache Sqoop

在前面的部分中,已经在 /tutorials/flightdelays/output 复制了转换后的数据。In the previous sections, you copied the transformed data at /tutorials/flightdelays/output. 本部分使用 Sqoop 将数据从 /tutorials/flightdelays/output 导出到在 Azure SQL 数据库中创建的表。In this section, you use Sqoop to export the data from /tutorials/flightdelays/output to the table you created in Azure SQL database.

  1. 通过输入以下命令验证 Sqoop 是否可以查看 SQL 数据库:Verify that Sqoop can see your SQL database by entering the command below:

    sqoop list-databases --connect jdbc:sqlserver://$SQLSERVERNAME.database.chinacloudapi.cn:1433 --username $SQLUSER --password $SQLPASWORD
    

    此命令会返回数据库列表,其中包括此前创建的 delays 表所在的数据库。This command returns a list of databases, including the database in which you created the delays table earlier.

  2. 通过输入以下命令将数据从 /tutorials/flightdelays/output 导出到 delays 表:Export data from /tutorials/flightdelays/output to the delays table by entering the command below:

    sqoop export --connect "jdbc:sqlserver://$SQLSERVERNAME.database.chinacloudapi.cn:1433;database=$DATABASE" --username $SQLUSER --password $SQLPASWORD --table 'delays' --export-dir '/tutorials/flightdelays/output' --fields-terminated-by '\t' -m 1
    

    Sqoop 连接到包含 delays 表的数据库,并将数据从 /tutorials/flightdelays/output 目录导出到 delays 表。Sqoop connects to the database that contains the delays table, and exports data from the /tutorials/flightdelays/output directory to the delays table.

  3. Sqoop 命令完成后,使用 tsql 实用程序通过输入以下命令连接到数据库:After the sqoop command finishes, use the tsql utility to connect to the database by entering the command below:

    TDSVER=8.0 tsql -H $SQLSERVERNAME.database.chinacloudapi.cnt -U $SQLUSER -p 1433 -D $DATABASE -P $SQLPASWORD
    

    使用以下语句验证数据是否已导出到 delays 表:Use the following statements to verify that the data was exported to the delays table:

    SELECT * FROM delays
    GO
    

    会在表中看到一系列数据。You should see a listing of data in the table. 该表包括城市名称和该城市的平均航班延迟时间。The table includes the city name and the average flight delay time for that city.

    键入 exit 退出 tsql 实用程序。Type exit to exit the tsql utility.

清理资源Clean up resources

完成教程之后,可能想要删除该群集。After you complete the tutorial, you may want to delete the cluster. 有了 HDInsight,便可以将数据存储在 Azure 存储中,因此可以在群集不用时安全地删除群集。With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it is not in use. 此外,还需要支付 HDInsight 群集费用,即使未使用。You are also charged for an HDInsight cluster, even when it is not in use. 由于群集费用高于存储空间费用数倍,因此在不使用群集时将其删除可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use.

若要删除群集,请参阅使用浏览器、PowerShell 或 Azure CLI 删除 HDInsight 群集To delete a cluster, see Delete an HDInsight cluster using your browser, PowerShell, or the Azure CLI.

后续步骤Next steps

在本教程中,你获取了一个原始 CSV 数据文件,将其导入到 HDInsight 群集存储中,然后使用 Azure HDInsight 中的交互式查询转换数据。In this tutorial, you took a raw CSV data file, imported it into an HDInsight cluster storage, and then transformed the data using Interactive Query in Azure HDInsight. 继续学习下一篇教程,了解 Apache Hive Warehouse Connector。Advance to the next tutorial to learn about the Apache Hive Warehouse Connector.