对 Azure HDInsight Hive 表中的数据进行采样Sample data in Azure HDInsight Hive tables

本文介绍如何使用 Hive 查询向下采样存储在 Azure HDInsight Hive 表中的数据,以将其减至对于分析更易于管理的大小。This article describes how to down-sample data stored in Azure HDInsight Hive tables using Hive queries to reduce it to a size more manageable for analysis. 将介绍三种常用的采样方法:It cover three popularly used sampling methods:

  • 统一随机采样Uniform random sampling
  • 按组随机采样Random sampling by groups
  • 分层采样Stratified sampling

为什么对数据进行采样?Why sample your data? 如果计划要分析的数据集很大,通常最好是对数据进行向下采样,以将数据减至较小但具备代表性且更易于管理的规模。If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. 向下采样有利于数据理解、探索和功能设计。Down-sampling facilitates data understanding, exploration, and feature engineering. 它在团队数据科学过程中的作用是启用数据处理功能和机器学习模型的快速原型设计。Its role in the Team Data Science Process is to enable fast prototyping of the data processing functions and machine learning models.

此采样任务是团队数据科学流程 (TDSP) 中的一个步骤。This sampling task is a step in the Team Data Science Process (TDSP).

如何提交 Hive 查询How to submit Hive queries

可从 Hadoop 群集头节点上的 Hadoop 命令行控制台中提交 Hive 查询。Hive queries can be submitted from the Hadoop Command-Line console on the head node of the Hadoop cluster. 登录到 Hadoop 群集的头节点,打开 Hadoop 命令行控制台,并从那里提交 Hive 查询。Log into the head node of the Hadoop cluster, open the Hadoop Command-Line console, and submit the Hive queries from there. 有关在 Hadoop 命令行控制台中提交 Hive 查询的说明,请参阅如何提交 Hive 查询For instructions on submitting Hive queries in the Hadoop Command-Line console, see How to Submit Hive Queries.

统一随机采样Uniform random sampling

统一随机采样意味着数据集中的每一行都具有相同的采样机会。Uniform random sampling means that each row in the data set has an equal chance of being sampled. 通过将额外字段 rand() 添加到内部“select”查询中以及以该随机字段为条件的外部“select”查询中的数据集,可实现它。It can be implemented by adding an extra field rand() to the data set in the inner "select" query, and in the outer "select" query that condition on that random field.

下面是一个示例查询:Here is an example query:

SET sampleRate=<sample rate, 0-1>;
select
    field1, field2, …, fieldN
from
    (
    select
        field1, field2, …, fieldN, rand() as samplekey
    from <hive table name>
    )a
where samplekey<='${hiveconf:sampleRate}'

此处,<sample rate, 0-1> 指定用户要采样的记录部分。Here, <sample rate, 0-1> specifies the proportion of records that the users want to sample.

按组随机采样Random sampling by groups

采样分类数据时,建议包括或排除分类变量的某些值的所有实例。When sampling categorical data, you may want to either include or exclude all of the instances for some value of the categorical variable. 这一类采样称为“按组采样”。This sort of sampling is called "sampling by group". 例如,如果有一个分类变量“State”,其包含值例如 NY、MA、CA、NJ 和 PA,无论是否进行采样,都希望相同州的记录在一起 。For example, if you have a categorical variable "State", which has values such as NY, MA, CA, NJ, and PA, you want records from each state to be together, whether they are sampled or not.

下面是按组采样的示例查询:Here is an example query that samples by group:

SET sampleRate=<sample rate, 0-1>;
select
    b.field1, b.field2, …, b.catfield, …, b.fieldN
from
    (
    select
        field1, field2, …, catfield, …, fieldN
    from <table name>
    )b
join
    (
    select
        catfield
    from
        (
        select
            catfield, rand() as samplekey
        from <table name>
        group by catfield
        )a
    where samplekey<='${hiveconf:sampleRate}'
    )c
on b.catfield=c.catfield

分层采样Stratified sampling

如果获取的样本具有的分类值与父填充中该分类值所呈现的比率相同,则针对分类变量对随机采样进行分层。Random sampling is stratified with respect to a categorical variable when the samples obtained have categorical values that are present in the same ratio as they were in the parent population. 同样使用与上述示例,假设数据具有按州分组的以下观察值:NJ 具有 100 个观察值,NY 具有 60 个观察值,WA 具有 300 个观察值。Using the same example as above, suppose your data has the following observations by states: NJ has 100 observations, NY has 60 observations, and WA has 300 observations. 如果指定分层采样率为 0.5,那么获取的 NJ、NY 和 WA 的样本应该分别具有大约 50、30 和 150 个观察值。If you specify the rate of stratified sampling to be 0.5, then the sample obtained should have approximately 50, 30, and 150 observations of NJ, NY, and WA respectively.

下面是一个示例查询:Here is an example query:

SET sampleRate=<sample rate, 0-1>;
select
    field1, field2, field3, ..., fieldN, state
from
    (
    select
        field1, field2, field3, ..., fieldN, state,
        count(*) over (partition by state) as state_cnt,
          rank() over (partition by state order by rand()) as state_rank
      from <table name>
    ) a
where state_rank <= state_cnt*'${hiveconf:sampleRate}'

有关 Hive 中可用的更多高级采样方法的信息,请参阅 LanguageManual Sampling(LanguageManual 采样)。For information on more advanced sampling methods that are available in Hive, see LanguageManual Sampling.