使用 Hive 查询在 Hive 表中浏览数据Explore data in Hive tables with Hive queries

本文提供了用于在 HDInsight Hadoop群集的 Hive 表中浏览数据的示例 Hive 脚本。This article provides sample Hive scripts that are used to explore data in Hive tables in an HDInsight Hadoop cluster.

此任务是团队数据科学过程中的一个步骤。This task is a step in the Team Data Science Process.

必备条件Prerequisites

本文假设用户具备以下条件:This article assumes that you have:

数据浏览的示例 Hive 查询脚本Example Hive query scripts for data exploration

  1. 获取每个分区的观测值计数 SELECT <partitionfieldname>, count(*) from <databasename>.<tablename> group by <partitionfieldname>;Get the count of observations per partition SELECT <partitionfieldname>, count(*) from <databasename>.<tablename> group by <partitionfieldname>;

  2. 获取每一天的观测值计数 SELECT to_date(<date_columnname>), count(*) from <databasename>.<tablename> group by to_date(<date_columnname>);Get the count of observations per day SELECT to_date(<date_columnname>), count(*) from <databasename>.<tablename> group by to_date(<date_columnname>);

  3. 获取某分类列中的级别Get the levels in a categorical column
    SELECT distinct <column_name> from <databasename>.<tablename>

  4. 获取两个分类列组合中的级别数 SELECT <column_a>, <column_b>, count(*) from <databasename>.<tablename> group by <column_a>, <column_b>Get the number of levels in combination of two categorical columns SELECT <column_a>, <column_b>, count(*) from <databasename>.<tablename> group by <column_a>, <column_b>

  5. 获取数字列的分布Get the distribution for numerical columns
    SELECT <column_name>, count(*) from <databasename>.<tablename> group by <column_name>

  6. 从连接的两个表中提取记录Extract records from joining two tables

    SELECT
        a.<common_columnname1> as <new_name1>,
        a.<common_columnname2> as <new_name2>,
        a.<a_column_name1> as <new_name3>,
        a.<a_column_name2> as <new_name4>,
        b.<b_column_name1> as <new_name5>,
        b.<b_column_name2> as <new_name6>
    FROM
        (
        SELECT <common_columnname1>,
            <common_columnname2>,
            <a_column_name1>,
            <a_column_name2>,
        FROM <databasename>.<tablename1>
        ) a
        join
        (
        SELECT <common_columnname1>,
            <common_columnname2>,
            <b_column_name1>,
            <b_column_name2>,
        FROM <databasename>.<tablename2>
        ) b
        ON a.<common_columnname1>=b.<common_columnname1> and a.<common_columnname2>=b.<common_columnname2>
    

用于出租车行程数据方案的其他查询脚本Additional query scripts for taxi trip data scenarios

GitHub 存储库中也提供了特定于 NYC Taxi Trip Data(纽约出租车行程数据)方案的查询示例。Examples of queries that are specific to NYC Taxi Trip Data scenarios are also provided in GitHub repository. 这些查询已具有指定的数据架构,并准备好提交以运行。These queries already have data schema specified and are ready to be submitted to run.