从 Blob 存储创建 Hive 表和加载数据Create Hive tables and load data from Azure Blob Storage

本文介绍通用 Hive 查询,该查询可创建 Hive 表并从 Azure Blob 存储加载数据。This article presents generic Hive queries that create Hive tables and load data from Azure blob storage. 同时提供了一些如何对 Hive 表进行分区和使用优化行纵栏表 (ORC) 格式来增强查询性能的指南。Some guidance is also provided on partitioning Hive tables and on using the Optimized Row Columnar (ORC) formatting to improve query performance.

先决条件Prerequisites

本文假设用户具备以下条件:This article assumes that you have:

将数据上传到 Azure Blob 存储Upload data to Azure blob storage

如果按照设置 Azure 虚拟机以进行高级分析中提供的说明创建了 Azure 虚拟机,那么此脚本文件应已下载到了虚拟机上的“C:\Users\<user name>\Documents\Data Science Scripts”目录中。If you created an Azure virtual machine by following the instructions provided in Set up an Azure virtual machine for advanced analytics, this script file should have been downloaded to the C:\Users\<user name>\Documents\Data Science Scripts directory on the virtual machine. 这些 Hive 查询仅需要用户在相应的字段中提供数据架构和 Azure Blob 存储配置,即可开始进行提交。These Hive queries only require that you provide a data schema and Azure blob storage configuration in the appropriate fields to be ready for submission.

假设 Hive 表的数据是未压缩的表格格式,并且该数据已上传至 Hadoop 群集所使用的存储帐户的默认或其他容器。We assume that the data for Hive tables is in an uncompressed tabular format, and that the data has been uploaded to the default (or to an additional) container of the storage account used by the Hadoop cluster.

如果要针对 NYC 出租车行程数据进行练习,请执行以下操作:If you want to practice on the NYC Taxi Trip Data, you need to:

  • 下载 24 个 NYC 出租车行程数据文件(12 个行程文件和 12 个费用文件),download the 24 NYC Taxi Trip Data files (12 Trip files and 12 Fare files),
  • 将所有文件解压缩为 .csv 文件,然后unzip all files into .csv files, and then
  • 将它们上传到 Azure 存储帐户的默认设置(或适当的容器);在将 Azure 存储与 Azure HDInsight 群集配合使用主题中介绍了用于这类帐户的选项。upload them to the default (or appropriate container) of the Azure storage account; options for such an account appear at Use Azure storage with Azure HDInsight clusters topic. 可在此文章中找到将 .csv 文件上传到存储帐户的默认容器的流程。The process to upload the .csv files to the default container on the storage account can be found on this page.

如何提交 Hive 查询How to submit Hive queries

可通过以下方法提交 Hive 查询:Hive queries can be submitted by using:

  1. 在 Hadoop 群集的头节点中通过 Hadoop 命令行提交 Hive 查询Submit Hive queries through Hadoop Command Line in headnode of Hadoop cluster
  2. 使用 Hive 编辑器提交 Hive 查询Submit Hive queries with the Hive Editor
  3. 使用 Azure PowerShell 命令提交 Hive 查询Submit Hive queries with Azure PowerShell Commands

Hive 查询类似于 SQL。Hive queries are SQL-like. 如果熟悉 SQL,可能会发现适用于 SQL 用户的 Hive 备忘单很有用。If you are familiar with SQL, you may find the Hive for SQL Users Cheat Sheet useful.

提交 Hive 查询时,还可控制 Hive 查询的输出目标,无论是输出到屏幕上还是头节点上的本地文件,或是输入到 Azure blob。When submitting a Hive query, you can also control the destination of the output from Hive queries, whether it be on the screen or to a local file on the head node or to an Azure blob.

1.在 Hadoop 群集的头节点中通过 Hadoop 命令行提交 Hive 查询1. Submit Hive queries through Hadoop Command Line in headnode of Hadoop cluster

如果是复杂的 Hive 查询,在 Hadoop 群集的头节点中直接提交通常比使用 Hive 编辑器或 Azure PowerShell 脚本进行提交的处理速度更快。If the Hive query is complex, submitting it directly in the head node of the Hadoop cluster typically leads to faster turn around than submitting it with a Hive Editor or Azure PowerShell scripts.

登录到 Hadoop 群集的头节点,在头节点的桌面上打开 Hadoop 命令行,并输入命令 cd %hive_home%\binLog in to the head node of the Hadoop cluster, open the Hadoop Command Line on the desktop of the head node, and enter command cd %hive_home%\bin.

有三种方法可在 Hadoop 命令行中提交 Hive 查询:You have three ways to submit Hive queries in the Hadoop Command Line:

  • 直接提交directly
  • 使用“.hql”文件using '.hql' files
  • 使用 Hive 命令控制台with the Hive command console

直接在 Hadoop 命令行中提交 Hive 查询。Submit Hive queries directly in Hadoop Command Line.

可以运行类似 hive -e "<your hive query>; 的命令直接在 Hadoop 命令行中提交简单的 Hive 查询。You can run command like hive -e "<your hive query>; to submit simple Hive queries directly in Hadoop Command Line. 此处有一个示例,其中用红色框突出了提交 Hive 查询的命令,用红色框突出了 Hive 查询中的输出。Here is an example, where the red box outlines the command that submits the Hive query, and the green box outlines the output from the Hive query.

用于提交 Hive 查询的命令以及来自 Hive 查询的输出

在“.hql”文件中提交 Hive 查询Submit Hive queries in '.hql' files

如果 Hive 查询更加复杂并且具有多个行,那么在命令行或 Hive 命令控制台中编辑查询的方法并不可行。When the Hive query is more complicated and has multiple lines, editing queries in command line or Hive command console is not practical. 另一种方法是在 Hadoop 群集的头节点中使用文本编辑器,将 Hive 查询保存到头节点本地目录中的“.hql”文件中。An alternative is to use a text editor in the head node of the Hadoop cluster to save the Hive queries in a '.hql' file in a local directory of the head node. 然后可使用如下所示的 -f 参数将“.hql”文件中的 Hive 查询进行提交:Then the Hive query in the '.hql' file can be submitted by using the -f argument as follows:

hive -f "<path to the '.hql' file>"

.hql 文件中的 Hive 查询

取消在屏幕上打印 Hive 查询的进度状态Suppress progress status screen print of Hive queries

默认情况下,在 Hadoop 命令行中提交了 Hive 查询后,Map/Reduce 作业的进度会打印在屏幕上。By default, after Hive query is submitted in Hadoop Command Line, the progress of the Map/Reduce job is printed out on screen. 若要取消在屏幕上打印 Map/Reduce 作业的进度,可在命令行中使用参数 -S(“S”大写),如下所示:To suppress the screen print of the Map/Reduce job progress, you can use an argument -S ("S" in upper case) in the command line as follows:

hive -S -f "<path to the '.hql' file>"
hive -S -e "<Hive queries>"

在 Hive 命令控制台中提交 Hive 查询。Submit Hive queries in Hive command console.

也可以首先通过在 Hadoop 命令行中运行命令 hive 输入 Hive 命令控制台,然后在 Hive 命令控制台中提交 Hive 查询。You can also first enter the Hive command console by running command hive in Hadoop Command Line, and then submit Hive queries in Hive command console. 以下是示例。Here is an example. 在此示例中,两个红色框分别突出显示了用于输入 Hive 命令控制台的命令,以及在 Hive 命令控制台中提交的 Hive 查询。In this example, the two red boxes highlight the commands used to enter the Hive command console, and the Hive query submitted in Hive command console, respectively. 绿色框突出显示了 Hive 查询的输出。The green box highlights the output from the Hive query.

打开 Hive 命令控制台并输入命令,查看 Hive 查询输出

之前的示例会直接在屏幕上输出 Hive 查询结果。The previous examples directly output the Hive query results on screen. 此外,可以将输出写入到头节点上的本地文件,或者写入到 Azure blob。You can also write the output to a local file on the head node, or to an Azure blob. 然后,可以使用其他工具进一步对 Hive 查询输出进行分析。Then, you can use other tools to further analyze the output of Hive queries.

将 Hive 查询结果输出到本地文件。Output Hive query results to a local file. 要将 Hive 查询结果输出到头节点上的本地目录,必须按如下所示的那样在 Hadoop 命令行中提交 Hive 查询:To output Hive query results to a local directory on the head node, you have to submit the Hive query in the Hadoop Command Line as follows:

hive -e "<hive query>" > <local path in the head node>

在下面的示例中,Hive 查询的输出将写入到目录 C:\apps\temp 中的文件 hivequeryoutput.txt 中。In the following example, the output of Hive query is written into a file hivequeryoutput.txt in directory C:\apps\temp.

Hive 查询的输出

将 Hive 查询结果输出到 Azure blobOutput Hive query results to an Azure blob

还可以将 Hive 查询结果输出到 Azure blob,其位于 Hadoop 群集的默认容器中。You can also output the Hive query results to an Azure blob, within the default container of the Hadoop cluster. 以下是使用此方法的 Hive 查询:The Hive query for this is as follows:

insert overwrite directory wasb:///<directory within the default container> <select clause from ...>

在下面的示例中,Hive 查询的输出将写入到一个 blob 目录 queryoutputdir,该目录位于 Hadoop 群集的默认容器中。In the following example, the output of Hive query is written to a blob directory queryoutputdir within the default container of the Hadoop cluster. 此处,只需提供目录名称,无需提供 blob 名称。Here, you only need to provide the directory name, without the blob name. 如果同时提供目录名称和 blob 名称,会引发错误,例如:wasb:///queryoutputdir/queryoutput.txtAn error is thrown if you provide both directory and blob names, such as wasb:///queryoutputdir/queryoutput.txt.

Hive 查询的输出

如果使用 Azure 存储资源管理器打开 Hadoop 群集的默认容器,则可以看到 Hive 查询的输出,如下图中所示。If you open the default container of the Hadoop cluster using Azure Storage Explorer, you can see the output of the Hive query as shown in the following figure. 可以应用筛选器(红色框中突出显示的),以此来仅检索名称中具有指定字母的 blob。You can apply the filter (highlighted by red box) to only retrieve the blob with specified letters in names.

显示 Hive 查询输出的 Azure 存储资源管理器

2.使用 Hive 编辑器提交 Hive 查询2. Submit Hive queries with the Hive Editor

此外,还可以通过在 Web 浏览器中输入 URL(格式为 https://<Hadoop cluster name>.azurehdinsight.net/Home/HiveEditor)来使用查询控制台(Hive 编辑器)。You can also use the Query Console (Hive Editor) by entering a URL of the form https://<Hadoop cluster name>.azurehdinsight.net/Home/HiveEditor into a web browser. 必须先登录才能查看此控制台,因此需要使用 Hadoop 群集凭据进行登录。You must be logged in the see this console and so you need your Hadoop cluster credentials here.

3.使用 Azure PowerShell 命令提交 Hive 查询3. Submit Hive queries with Azure PowerShell Commands

也可以使用 PowerShell 来提交 Hive 查询。You can also use PowerShell to submit Hive queries. 有关说明,请参阅使用 PowerShell 提交 Hive 作业For instructions, see Submit Hive jobs using PowerShell.

创建 Hive 数据库和表Create Hive database and tables

Hive 查询在 GitHub 存储库中共享,并可从该处下载。The Hive queries are shared in the GitHub repository and can be downloaded from there.

以下是创建一个 Hive 表的 Hive 查询。Here is the Hive query that creates a Hive table.

create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
(
    field1 string,
    field2 int,
    field3 float,
    field4 double,
    ...,
    fieldN string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' lines terminated by '<line separator>'
STORED AS TEXTFILE LOCATION '<storage location>' TBLPROPERTIES("skip.header.line.count"="1");

以下是需要插入的字段以及其他配置的说明:Here are the descriptions of the fields that you need to plug in and other configurations:

  • <database name> :要创建的数据库的名称。<database name>: the name of the database that you want to create. 如果只想使用默认数据库,则可以省略 create database... 查询。If you just want to use the default database, the query create database... can be omitted.
  • <table name> :要在指定数据库中创建的表的名称。<table name>: the name of the table that you want to create within the specified database. 若要使用默认数据库,则可以通过 <table name> 直接引用表,无需 <database name>。If you want to use the default database, the table can be directly referred by <table name> without <database name>.
  • <field separator> :在要上传到 Hive 表的数据文件中分隔字段的分隔符。<field separator>: the separator that delimits fields in the data file to be uploaded to the Hive table.
  • <line separator> :在数据文件中分隔行的分隔符。<line separator>: the separator that delimits lines in the data file.
  • <storage location> :要将 Hive 表的数据保存到的 Azure 存储位置。<storage location>: the Azure storage location to save the data of Hive tables. 如果不指定 LOCATION <storage location> ,则数据库和表将默认存储在 Hive 群集默认容器的 hive/warehouse/ 目录中。If you do not specify LOCATION <storage location>, the database and the tables are stored in hive/warehouse/ directory in the default container of the Hive cluster by default. 如果要指定存储位置,该存储位置必须在数据库和表的默认容器中。If you want to specify the storage location, the storage location has to be within the default container for the database and tables. 此位置必须引用为与群集的默认容器相对的位置,格式为“wasb:///<directory 1>/”或“wasb:///<directory 1>/<directory 2>/”等。执行查询后,相对目录会创建在默认容器中。This location has to be referred as location relative to the default container of the cluster in the format of 'wasb:///<directory 1>/' or 'wasb:///<directory 1>/<directory 2>/', etc. After the query is executed, the relative directories are created within the default container.
  • TBLPROPERTIES("skip.header.line.count"="1") :如果数据文件具有标题行,则必须在 create table 查询的末尾处添加此属性。TBLPROPERTIES("skip.header.line.count"="1"): If the data file has a header line, you have to add this property at the end of the create table query. 否则,标题行将作为记录加载到表。Otherwise, the header line is loaded as a record to the table. 如果数据文件没有标题行,则可以在查询中省略此配置。If the data file does not have a header line, this configuration can be omitted in the query.

将数据加载到 Hive 表Load data to Hive tables

以下是将数据加载到 Hive 表的 Hive 查询。Here is the Hive query that loads data into a Hive table.

LOAD DATA INPATH '<path to blob data>' INTO TABLE <database name>.<table name>;
  • <path to blob data> :如果要上传到 Hive 表的 blob 文件位于 HDInsight Hadoop 群集的默认容器中,则 <path to blob data> 的格式应当为“wasb://<directory in this container>/<blob file name>”。<path to blob data>: If the blob file to be uploaded to the Hive table is in the default container of the HDInsight Hadoop cluster, the <path to blob data> should be in the format 'wasb://<directory in this container>/<blob file name>'. blob 文件也可以在 HDInsight Hadoop 群集的其他容器中。The blob file can also be in an additional container of the HDInsight Hadoop cluster. 在这种情况下, <path to blob data> 的格式应为 'wasb://<container name>@<storage account name>.blob.core.chinacloudapi.cn/<blob file name>'In this case, <path to blob data> should be in the format 'wasb://<container name>@<storage account name>.blob.core.chinacloudapi.cn/<blob file name>'.

    备注

    要上传到 Hive 表的 blob 数据必须位于 Hadoop 群集存储帐户的默认或其他容器中。The blob data to be uploaded to Hive table has to be in the default or additional container of the storage account for the Hadoop cluster. 否则,LOAD DATA 查询会失败,并声称它无法访问数据。Otherwise, the LOAD DATA query fails complaining that it cannot access the data.

高级主题:已分区表以及将 Hive 数据存储为 ORC 格式Advanced topics: partitioned table and store Hive data in ORC format

如果数据较大,对表进行分区则有利于仅需要对表的少量分区进行扫描的查询。If the data is large, partitioning the table is beneficial for queries that only need to scan a few partitions of the table. 例如,可以将网站的日志数据按照日期进行分区,这是合理的。For instance, it is reasonable to partition the log data of a web site by dates.

除了对 Hive 表进行分区以外,还有益于以优化行纵栏表 (ORC) 格式存储 Hive 数据。In addition to partitioning Hive tables, it is also beneficial to store the Hive data in the Optimized Row Columnar (ORC) format. 有关 ORC 格式设置的详细信息,请参阅使用 ORC 文件提高 Hive 读取、写入和处理数据时的性能For more information on ORC formatting, see Using ORC files improves performance when Hive is reading, writing, and processing data.

已分区表Partitioned table

以下是创建已分区表并将数据加载到其中的 Hive 查询。Here is the Hive query that creates a partitioned table and loads data into it.

CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<table name>
(field1 string,
...
fieldN string
)
PARTITIONED BY (<partitionfieldname> vartype) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>'
    lines terminated by '<line separator>' TBLPROPERTIES("skip.header.line.count"="1");
LOAD DATA INPATH '<path to the source file>' INTO TABLE <database name>.<partitioned table name>
    PARTITION (<partitionfieldname>=<partitionfieldvalue>);

查询已分区表时,建议在 where 子句的开头添加分区条件,这样能提高搜索效率。When querying partitioned tables, it is recommended to add the partition condition in the beginning of the where clause, which improves the search efficiency.

select
    field1, field2, ..., fieldN
from <database name>.<partitioned table name>
where <partitionfieldname>=<partitionfieldvalue> and ...;

将 Hive 数据存储为 ORC 格式Store Hive data in ORC format

不能直接将 Blob 存储中的数据加载到存储为 ORC 格式的 Hive 表中。You cannot directly load data from blob storage into Hive tables that is stored in the ORC format. 下面的步骤介绍了如何将 Azure blob 中的数据加载到存储为 ORC 格式的 Hive 表中。Here are the steps that the you need to take to load data from Azure blobs to Hive tables stored in ORC format.

创建外部表 STORED AS TEXTFILE,然后将 Blob 存储中的数据加载到此表。Create an external table STORED AS TEXTFILE and load data from blob storage to the table.

CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<external textfile table name>
(
    field1 string,
    field2 int,
    ...
    fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>'
    lines terminated by '<line separator>' STORED AS TEXTFILE
    LOCATION 'wasb:///<directory in Azure blob>' TBLPROPERTIES("skip.header.line.count"="1");

LOAD DATA INPATH '<path to the source file>' INTO TABLE <database name>.<table name>;

使用与步骤 1 中的外部表相同的架构和相同的字段分隔符创建内部表,并且将 Hive 数据存储为 ORC 格式。Create an internal table with the same schema as the external table in step 1, with the same field delimiter, and store the Hive data in the ORC format.

CREATE TABLE IF NOT EXISTS <database name>.<ORC table name>
(
    field1 string,
    field2 int,
    ...
    fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' STORED AS ORC;

从步骤 1 中的外部表选择数据,然后将其插入到 ORC 表Select data from the external table in step 1 and insert into the ORC table

INSERT OVERWRITE TABLE <database name>.<ORC table name>
    SELECT * FROM <database name>.<external textfile table name>;

备注

如果 TEXTFILE 表 <database name>.<external textfile table name>If the TEXTFILE table <database name>.<external textfile table name> 具有分区,则在步骤 3 中,SELECT * FROM <database name>.<external textfile table name> 命令将选择分区变量作为返回数据集中的字段。has partitions, in STEP 3, the SELECT * FROM <database name>.<external textfile table name> command selects the partition variable as a field in the returned data set. 将其插入到 <database name>.<ORC table name>Inserting it into the <database name>.<ORC table name> 中将失败,因为 <database name>.<ORC table name>fails since <database name>.<ORC table name> 没有将分区变量作为表架构中的字段。does not have the partition variable as a field in the table schema. 在这种情况下,需要专门选择要插入到 <database name>.<ORC table name> 的字段,In this case, you need to specifically select the fields to be inserted to <database name>.<ORC table name> 如下所示:as follows:

INSERT OVERWRITE TABLE <database name>.<ORC table name> PARTITION (<partition variable>=<partition value>)
    SELECT field1, field2, ..., fieldN
    FROM <database name>.<external textfile table name>
    WHERE <partition variable>=<partition value>;

将所有数据插入到 <database name>.<ORC table name> 后,使用以下查询时可以放心地删除 <external text file table name>It is safe to drop the <external text file table name> when using the following query after all data has been inserted into <database name>.<ORC table name>:

    DROP TABLE IF EXISTS <database name>.<external textfile table name>;

完成此过程后,应该拥有一个可用的表,此表中具有 ORC 格式的数据。After following this procedure, you should have a table with data in the ORC format ready to use.