如何在 Azure HDInsight 群集中使用 Apache Hive 复制How to use Apache Hive replication in Azure HDInsight clusters

在数据库和仓库的上下文中,复制是将实体从一个仓库复制到另一个仓库的过程。In the context of databases and warehouses, replication is the process of duplicating entities from one warehouse to another. 复制可应用于整个数据库,也可应用于更小的级别(例如表或分区)。Duplication can apply to an entire database or to a smaller level, such as a table or partition. 目标是让一个副本在基本实体发生更改时进行更改。The objective is to have a replica that changes whenever the base entity changes. Apache Hive 上的复制侧重于灾难恢复,提供单向的主副本复制。Replication on Apache Hive focuses on disaster recovery and offers unidirectional primary-copy replication. 在 HDInsight 群集中,可以使用 Hive 复制来单向复制 Hive 元存储和 Azure Data Lake Storage Gen2 上的关联基础数据湖。In HDInsight clusters, Hive Replication can be used to unidirectionally replicate the Hive metastore and the associated underlying data lake on Azure Data Lake Storage Gen2.

Hive 复制已经过多年的发展,更新的版本提供了更好的功能,速度更快,资源消耗更少。Hive Replication has evolved over the years with newer versions providing better functionality and being faster and less resource intensive. 在本文中,我们讨论在 HDInsight 3.6 和 HDInsight 4.0 群集类型中都支持的 Hive 复制 (Replv2)。In this article, we discuss Hive Replication (Replv2) which is supported in both HDInsight 3.6 and HDInsight 4.0 cluster types.

Replv2 的优点Advantages of Replv2

与使用了 Hive IMPORT-EXPORT 的第一版 Hive 复制相比,Hive ReplicationV2 或 Replv2 具有以下优点:Hive ReplicationV2 or (Replv2) has the following advantages over the first version of Hive replication that used Hive IMPORT-EXPORT:

  • 基于事件的增量复制Event-based incremental replication
  • 时间点复制Point-in-time replication
  • 降低了带宽要求Reduced bandwidth requirements
  • 中间副本数量减少Reduction in the number of intermediate copies
  • 保留了复制状态Replication state is maintained
  • 复制受到约束Constrained replication
  • 支持中心辐射型模型Support for a Hub and Spoke model
  • 支持 ACID 表(在 HDInsight 4.0 中)Support for ACID tables (in HDInsight 4.0)

复制阶段Replication phases

Hive 基于事件的复制是在主群集与辅助群集之间配置的。Hive event-based replication is configured between the primary and secondary clusters. 此复制包括两个不同的阶段:启动和增量运行。This replication consists of two distinct phases: bootstrapping and incremental runs.

启动Bootstrapping

启动将运行一次,用于将数据库的基本状态从主数据库复制到辅助数据库。Bootstrapping is intended to run once to replicate the base state of the databases from primary to secondary. 如果需要,你可以对启动进行配置,以便在需要启用复制的目标数据库中包括表的子集。You can configure bootstrapping, if needed, to include a subset of the tables in the targeted database where replication needs to be enabled.

增量运行Incremental runs

在启动之后进行的增量运行在主群集上自动进行,在这些增量运行期间生成的事件会在辅助群集上播放。After bootstrapping, incremental runs are automated on the primary cluster and events generated during these incremental runs are played back on the secondary cluster. 当辅助群集赶上主群集的进度后,辅助群集的事件会与主群集的事件保持一致。When the secondary cluster catches up with the primary cluster, the secondary becomes consistent with the primary's events.

复制命令Replication commands

Hive 提供了一组 REPL 命令(DUMPLOADSTATUS)来协调事件流。Hive offers a set of REPL commands – DUMP, LOAD, and STATUS - to orchestrate the flow of events. DUMP 命令生成主群集上所有 DDL/DML 事件的本地日志。The DUMP command generates a local log of all DDL/DML events on the primary cluster. LOAD 命令是一种将所记录的元数据和数据惰性复制到提取的复制转储输出中的方法,是在目标群集上执行的。The LOAD command is an approach to lazily copy metadata and data logged to the extracted replication dump output and is executed on the target cluster. STATUS 命令从目标群集运行,用于提供表明已成功复制最新复制负载的最新事件 ID。The STATUS command runs from the target cluster to provide latest event ID that the most recent replication load was successfully replicated.

设置复制源Set replication source

在开始复制之前,请确保要复制的数据库已设置为复制源。Before you start with replication, ensure the database that is to be replicated is set as the replication source. 你可以使用 DESC DATABASE EXTENDED <db_name> 命令来确定是否为参数 repl.source.for 设置了策略名称。You could use the DESC DATABASE EXTENDED <db_name> command to determine if the parameter repl.source.for is set with the policy name.

如果已计划策略,但未设置 repl.source.for 参数,则需先使用 ALTER DATABASE <db_name> SET DBPROPERTIES ('repl.source.for'='<policy_name>') 设置此参数。If the policy is scheduled and the repl.source.for parameter isn't set, then you need to first set this parameter using ALTER DATABASE <db_name> SET DBPROPERTIES ('repl.source.for'='<policy_name>').

ALTER DATABASE tpcds_orc SET DBPROPERTIES ('repl.source.for'='replpolicy1') 

将元数据转储到数据湖Dump metadata to the data lake

在启动阶段使用 REPL DUMP [database name]. => location / event_id 命令将相关元数据转储到 Azure Data Lake Storage Gen2 中。The REPL DUMP [database name]. => location / event_id command is used in the bootstrap phase to dump relevant metadata to Azure Data Lake Storage Gen2. event_id 指定将与之相关的元数据置于 Azure Data Lake Storage Gen2 中的最小事件。The event_id specifies the minimum event to which relevant metadata has been put in Azure Data Lake Storage Gen2.

repl dump tpcds_orc; 

示例输出:Example output:

dump_dirdump_dir last_repl_idlast_repl_id
/tmp/hive/repl/38896729-67d5-41b2-90dc-46eeed4c5dd0/tmp/hive/repl/38896729-67d5-41b2-90dc-46eeed4c5dd0 29252925

将数据加载到目标群集Load data to the target cluster

对于复制的启动和增量阶段,都使用 REPL LOAD [database name] FROM [ location ] { WITH ( ‘key1’=‘value1’{, ‘key2’=‘value2’} ) } 命令将数据加载到目标群集中。The REPL LOAD [database name] FROM [ location ] { WITH ( ‘key1’=‘value1’{, ‘key2’=‘value2’} ) } command is used to load data into target cluster for both the bootstrap and the incremental phases of replication. [database name] 可以与源同名,也可以是目标群集上的另一名称。The [database name] can be the same as the source or a different name on the target cluster. [location] 表示之前的 REPL DUMP 命令的输出中的位置。The [location] represents the location from the output of earlier REPL DUMP command. 这意味着目标群集应该能够与源群集通信。This means that the target cluster should be able to talk to the source cluster. 添加 WITH 子句主要是为了防止目标群集重启,以便进行复制。The WITH clause was primarily added to prevent a restart of the target cluster, allowing replication.

repl load tpcds_orc from '/tmp/hive/repl/38896729-67d5-41b2-90dc-46eeed4c5dd0'; 

输出最后一个复制的事件 IDOutput the last replicated event ID

REPL STATUS [database name] 命令在目标群集上执行,并输出最后一个复制的 event_idThe REPL STATUS [database name] command is executed on target clusters and outputs the last replicated event_id. 用户还可以使用此命令了解已将其目标群集复制成什么状态。The command also enables users to know what state their target cluster is been replicated to. 可以使用此命令的输出来构造用于增量复制的下一个 REPL DUMP 命令。You can use the output of this command to construct the next REPL DUMP command for incremental replication.

repl status tpcds_orc;

示例输出:Example output:

last_repl_idlast_repl_id
29252925

将相关数据和元数据转储到数据湖Dump relevant data and metadata to the data lake

使用 REPL DUMP [database name] FROM [event-id] { TO [event-id] } { LIMIT [number of events] } 命令将相关元数据和数据转储到 Azure Data Lake Storage 中。The REPL DUMP [database name] FROM [event-id] { TO [event-id] } { LIMIT [number of events] } command is used to dump relevant metadata and data to Azure Data Lake Storage. 此命令在增量阶段使用,在源仓库上运行。This command is used in the incremental phase and is run on the source warehouse. 对于增量阶段,FROM [event-id] 是必需的,event-id 的值可以通过在目标仓库上运行 REPL STATUS [database name] 命令来派生。The FROM [event-id] is required for the incremental phase, and the value of event-id can be derived by running the REPL STATUS [database name] command on the target warehouse.

repl dump tpcds_orc from 2925;

示例输出:Example output:

dump_dirdump_dir last_repl_idlast_repl_id
/tmp/hive/repl/38896729-67d5-41b2-90dc-466466agadd0/tmp/hive/repl/38896729-67d5-41b2-90dc-466466agadd0 29602960

Hive 复制过程Hive replication process

以下步骤是在 Hive 复制过程中发生的顺序事件。The following steps are the sequential events that take place during the Hive Replication process.

  1. 确保要复制的表已设置为特定策略的复制源。Ensure that the tables to replicated are set as the replication source for a certain policy.

  2. REPL_DUMP 命令将发布到主群集,其中包含关联的约束,例如数据库名称、事件 ID 范围和 Azure Data Lake Storage Gen2 存储 URL。The REPL_DUMP command is issued to the primary cluster with associated constraints like database name, event ID range, and Azure Data Lake Storage Gen2 storage URL.

  3. 系统将元存储中所有已跟踪事件的转储序列化为最新版本。The system serializes a dump of all tracked events from the metastore to the latest one. 此转储存储在主群集上由 REPL_DUMP 指定的 URL 处的 Azure Data Lake Storage Gen2 存储帐户中。This dump is stored in the Azure Data Lake Storage Gen2 storage account on the primary cluster at the URL that is specified by the REPL_DUMP.

  4. 主群集将复制元数据持久保存到主群集的 Azure Data Lake Storage Gen2 存储。The primary cluster persists the replication metadata to the primary cluster's Azure Data Lake Storage Gen2 storage. 该路径可在 Ambari 的 Hive 配置 UI 中进行配置。The path is configurable in the Hive Config UI in Ambari. 此过程提供元数据的存储路径,以及最新的已跟踪 DML/DDL 事件的 ID。The process provides the path where the metadata is stored and the ID of the latest tracked DML/DDL event.

  5. REPL_LOAD 命令是从辅助群集发出的。The REPL_LOAD command is issued from the secondary cluster. 此命令指向步骤 3 中配置的路径。The command points to the path configured in Step 3.

  6. 辅助群集读取元数据文件,其中包含在第 3 步创建的已跟踪事件。The secondary cluster reads the metadata file with tracked events that was created in Step 3. 确保辅助群集具有到主群集的 Azure Data Lake Storage Gen2 存储(其中存储了来自 REPL_DUMP 的已跟踪事件)的网络连接。Ensure that that the secondary cluster has network connectivity to the Azure Data Lake Storage Gen2 storage of the primary cluster where the tracked events from REPL_DUMP are stored.

  7. 辅助群集生成分布式复制 (DistCP) 计算。The secondary cluster spawns distributed copy (DistCP) compute.

  8. 辅助群集从主群集的存储复制数据。The secondary cluster copies data from the primary cluster's storage.

  9. 对辅助群集上的元存储进行更新。The metastore on the secondary cluster is updated.

  10. 将最后一个已跟踪事件 ID 存储在主元存储中。The last tracked event ID is stored in the primary metastore.

增量复制遵循相同的过程,它需要最后一个复制的事件 ID 作为输入。Incremental replication follows the same process, and it requires the last replicated event ID as input. 这会导致自上次复制事件以来的增量复制。This leads to an incremental copy since the last replication event. 增量复制通常以预定的频率自动执行,以实现所需的恢复点目标 (RPO)。Incremental replications are normally automated with a pre-determined frequency to achieve required recovery point objectives (RPO).

Hive 复制关系图

复制模式Replication patterns

复制通常配置为在主群集与辅助群集之间单向进行,在这种情况下,主群集用于满足读取和写入请求。Replication is normally configured in a unidirectional way between the primary and secondary, where the primary caters to read and write requests. 辅助群集仅用于满足读取请求。The secondary cluster caters to read requests only. 如果发生灾难,则允许在辅助群集上执行写入,但需要配置到主群集的反向复制。Writes are allowed on the secondary if there is a disaster, but reverse replication needs to be configured back to the primary.

Hive 复制模式

有许多适用于 Hive 复制的模式,包括“主 – 辅”模式、“中心辐射”模式和“中继”模式。There are many patterns suitable for Hive replication including Primary – Secondary, Hub and Spoke, and Relay.

在 HDInsight 中,“主动主群集 – 备用辅助群集”是一种常见的业务连续性和灾难恢复 (BCDR) 模式,HiveReplicationV2 可通过 VNet 对等互连将此模式与区域分隔的 HDInsight Hadoop 群集配合使用。In HDInsight Active Primary – Standby Secondary is a common business continuity and disaster recovery (BCDR) pattern and HiveReplicationV2 can use this pattern with regionally separated HDInsight Hadoop clusters with VNet peering. 可以使用对等互连到这两个群集的公用虚拟机来承载复制自动化脚本。A common virtual machine peered to both the clusters can be used to host the replication automation scripts. 若要详细了解可能的 HDInsight BCDR 模式,请参阅 HDInsight 业务连续性文档For more information about possible HDInsight BCDR patterns, refer to HDInsight business continuity documentation.

涉及企业安全性套餐的 Hive 复制Hive replication with Enterprise Security Package

如果在使用企业安全性套餐的 HDInsight Hadoop 群集上计划 Hive 复制,则必须考虑 Ranger 元存储和 Azure Active Directory 域服务 (AD DS) 的复制机制。In cases where Hive replication is planned on HDInsight Hadoop clusters with Enterprise Security Package, you have to factor in replication mechanisms for Ranger metastore and Azure Active Directory Domain Services (AD DS).

使用 Azure AD DS 副本集功能跨多个区域为每个 Azure AD 租户创建多个 Azure AD DS 副本集。Use the Azure AD DS replica sets feature to create more than one Azure AD DS replica set per Azure AD tenant across multiple regions. 每个单独的副本集需要在各自区域中通过 HDInsight VNet 进行对等互连。Each individual replica set needs to be peered with HDInsight VNets in their respective regions. 在此配置中,对 Azure AD DS 进行的更改(包括配置、用户标识和凭据、组、组策略对象、计算机对象以及其他更改)会应用于使用 Azure AD DS 复制的托管域中的所有副本集。In this configuration, changes to Azure AD DS, including configuration, user identity and credentials, groups, group policy objects, computer objects, and other changes are applied to all replica sets in the managed domain using Azure AD DS replication.

可使用 Ranger 导入-导出功能定期备份 Ranger 策略并将其从主群集复制到辅助群集。Ranger policies can be periodically backed up and replicated from the primary to the secondary using Ranger Import-Export functionality. 你可以选择复制所有或部分 Ranger 策略,具体取决于要在辅助群集上实现的授权级别。You can choose to replicate all or a subset of Ranger policies depending on the level of authorizations you are seeking to implement on the secondary cluster.

代码示例Sample code

下面的代码序列提供了一个示例,展示了如何在名为 tpcds_orc 的示例表上实现启动和增量复制。The following code sequence provides an example how bootstrapping and incremental replication can be implemented on a sample table called tpcds_orc.

  1. 将该表设置为复制策略的源。Set the table as the source for a replication policy.

    ALTER DATABASE tpcds_orc SET DBPROPERTIES ('repl.source.   for'='replpolicy1');
    
  2. 主群集上的启动转储。Bootstrap dump at the primary cluster.

    repl dump tpcds_orc with ('hive.repl.rootdir'='/tmpag/hiveag/replag'); 
    

    示例输出:Example output:

    dump_dirdump_dir last_repl_idlast_repl_id
    /tmpag/hiveag/replag/675d1bea-2361-4cad-bcbf-8680d305a27a/tmpag/hiveag/replag/675d1bea-2361-4cad-bcbf-8680d305a27a 29252925
  3. 辅助群集上的启动负载。Bootstrap load at the secondary cluster.

    repl load tpcds_orc from '/tmpag/hiveag/replag 675d1bea-2361-4cad-bcbf-8680d305a27a'; 
    
  4. 检查辅助群集上的 REPL 状态。Check the REPL status at the secondary cluster.

    repl status tpcds_orc; 
    
    last_repl_idlast_repl_id
    29252925
  5. 主群集上的增量转储。Incremental dump at the primary cluster.

    repl dump tpcds_orc from 2925 with ('hive.repl.rootdir'='/tmpag/hiveag/ replag');
    

    示例输出:Example output:

    dump_dirdump_dir last_repl_idlast_repl_id
    /tmpag/hiveag/replag/31177ff7-a40f-4f67-a613-3b64ebe3bb31/tmpag/hiveag/replag/31177ff7-a40f-4f67-a613-3b64ebe3bb31 29602960
  6. 辅助群集上的增量负载。Incremental load at secondary cluster.

    repl load tpcds_orc from '/tmpag/hiveag/replag/31177ff7-a40f-4f67-a613-3b64ebe3bb31';
    
  7. 检查辅助群集上的 REPL 状态。Check REPL status at secondary cluster.

    repl status tpcds_orc;
    
    last_repl_idlast_repl_id
    29602960

后续步骤Next steps

若要详细了解本文中所述的项,请参阅:To learn more about the items discussed in this article, see: