在 HDInsight 上为 Apache HBase 和 Apache Phoenix 设置备份与复制Set up backup and replication for Apache HBase and Apache Phoenix on HDInsight

Apache HBase 支持通过多种方法来防范数据丢失:Apache HBase supports several approaches for guarding against data loss:

  • 复制 hbase 文件夹Copy the hbase folder
  • 导出再导入Export then Import
  • 复制表Copy tables
  • 快照Snapshots
  • 复制Replication


Apache Phoenix 将其元数据存储在 HBase 表中,因此,每当备份 HBase 系统目录表时,就会备份这些元数据。Apache Phoenix stores its metadata in HBase tables, so that metadata is backed up when you back up the HBase system catalog tables.

以下部分介绍上述每种方法的使用方案。The following sections describe the usage scenario for each of these approaches.

复制 hbase 文件夹Copy the hbase folder

使用此方法时,将会复制所有 HBase 数据,而无法选择表或列系列的子集。With this approach, you copy all HBase data, without being able to select a subset of tables or column families. 后面所述的方法提供更高的控制度。Subsequent approaches provide greater control.

HDInsight 中的 HBase 使用创建群集时选择的默认存储:Azure 存储 Blob 或 Azure Data Lake Storage。HBase in HDInsight uses the default storage selected when creating the cluster, either Azure Storage blobs or Azure Data Lake Storage. 无论使用哪种存储,HBase 都会将其数据和元数据文件存储在以下路径:In either case, HBase stores its data and metadata files under the following path:

  • 在 Azure 存储帐户中,hbase 文件夹位于 Blob 容器的根目录:In an Azure Storage account the hbase folder resides at the root of the blob container:

  • 在 Azure Data Lake Storage 中,hbase 文件夹位于预配群集时指定的根路径下。In Azure Data Lake Storage the hbase folder resides under the root path you specified when provisioning a cluster. 此根路径通常包含一个 clusters 文件夹,而该文件夹包含一个与 HDInsight 群集同名的子文件夹:This root path typically has a clusters folder, with a subfolder named after your HDInsight cluster:


在上述任一情况下,hbase 文件夹都包含由 HBase 刷新到磁盘的所有数据,但可能不包含内存中数据。In either case, the hbase folder contains all the data that HBase has flushed to disk, but it may not contain the in-memory data. 只有在关闭群集之后,才能依赖此文件夹来准确表示 HBase 数据。Before you can rely on this folder as an accurate representation of the HBase data, you must shut down the cluster.

删除群集后,可将数据保留在原处,或将数据复制到新位置:After you delete the cluster, you can either leave the data in place, or copy the data to a new location:

  • 创建指向当前存储位置的新 HDInsight 实例。Create a new HDInsight instance pointing to the current storage location. 新实例是使用所有现有数据创建的。The new instance is created with all the existing data.

  • hbase 文件夹复制到其他 Azure 存储 blob 容器或 Data Lake Storage 位置,然后使用该数据启动新群集。Copy the hbase folder to a different Azure Storage blob container or Data Lake Storage location, and then start a new cluster with that data. 对于 Azure 存储,请使用 AzCopyFor Azure Storage, use AzCopy.

导出再导入Export then Import

在源 HDInsight 群集上,使用“导出”实用工具(HBase 已随附)将数据从源表导出到默认的附加存储。On the source HDInsight cluster, use the Export utility (included with HBase) to export data from a source table to the default attached storage. 然后,可将导出的文件夹复制到目标存储位置,并在目标 HDInsight 群集上运行“导入”实用工具。You can then copy the exported folder to the destination storage location, and run the Import utility on the destination HDInsight cluster.

若要导出表数据,请先通过 SSH 连接到源 HDInsight 群集的头节点,然后运行以下 hbase 命令:To export table data, first SSH into the head node of your source HDInsight cluster and then run the following hbase command:

hbase org.apache.hadoop.hbase.mapreduce.Export "<tableName>" "/<path>/<to>/<export>"

导出目录不能已存在。The export directory must not already exist. 表名称区分大小写。The table name is case-sensitive.

若要导入表数据,请通过 SSH 连接到目标 HDInsight 群集的头节点,然后运行以下 hbase 命令:To import table data, SSH into the head node of your destination HDInsight cluster and then run the following hbase command:

hbase org.apache.hadoop.hbase.mapreduce.Import "<tableName>" "/<path>/<to>/<export>"

该表必须已存在。The table must already exist.

指定默认存储或任何附加存储选项的完整导出路径。Specify the full export path to the default storage or to any of the attached storage options. 例如,在 Azure 存储中:For example, in Azure Storage:


在 Azure Data Lake Storage Gen2 中,语法为:In Azure Data Lake Storage Gen2, the syntax is:


此方法提供表级粒度。This approach offers table-level granularity. 还可以指定日期范围以包含相应的行,这样,便能以递增方式执行该过程。You can also specify a date range for the rows to include, which allows you to perform the process incrementally. 每个日期从 Unix 时期开始算起,以毫秒为单位。Each date is in milliseconds since the Unix epoch.

hbase org.apache.hadoop.hbase.mapreduce.Export "<tableName>" "/<path>/<to>/<export>" <numberOfVersions> <startTimeInMS> <endTimeInMS>

请注意,必须指定要导出的每行的版本数。Note that you have to specify the number of versions of each row to export. 若要在日期范围中包含所有版本,请将 <numberOfVersions> 设置为大于可能行版本数上限的值,例如 100000。To include all versions in the date range, set <numberOfVersions> to a value greater than your maximum possible row versions, such as 100000.

复制表Copy tables

CopyTable 实用工具将数据从源表逐行复制到架构与源相同的现有目标表。The CopyTable utility copies data from a source table, row by row, to an existing destination table with the same schema as the source. 目标表可以位于相同的群集中,或不同的 HBase 群集中。The destination table can be on the same cluster or a different HBase cluster. 表名称区分大小写。The table names are case-sensitive.

若要在群集中使用 CopyTable,请通过 SSH 连接到源 HDInsight 群集的头节点,然后运行以下 hbase 命令:To use CopyTable within a cluster, SSH into the head node of your source HDInsight cluster and then run this hbase command:

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=<destTableName> <srcTableName>

若要使用 CopyTable 复制不同群集中的表,请添加 peer 开关和目标群集的地址:To use CopyTable to copy to a table on a different cluster, add the peer switch with the destination cluster's address:

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=<destTableName> --peer.adr=<destinationAddress> <srcTableName>

目标地址由以下三个部分组成:The destination address is composed of the following three parts:

<destinationAddress> = <ZooKeeperQuorum>:<Port>:<ZnodeParent>
  • <ZooKeeperQuorum> 是逗号分隔的 Apache ZooKeeper 节点列表,例如:<ZooKeeperQuorum> is a comma-separated list of Apache ZooKeeper nodes, for example:


  • HDInsight 上的 <Port> 默认为 2181,<ZnodeParent>/hbase-unsecure,因此,完整的 <destinationAddress> 是:<Port> on HDInsight defaults to 2181, and <ZnodeParent> is /hbase-unsecure, so the complete <destinationAddress> would be:


请参阅本文中的手动收集 Apache ZooKeeper 仲裁列表,详细了解如何检索 HDInsight 群集的这些值。See Manually Collecting the Apache ZooKeeper Quorum List in this article for details on how to retrieve these values for your HDInsight cluster.

CopyTable 实用工具还支持使用参数来指定要复制的行的时间范围,以及指定要复制的表中的列系列子集。The CopyTable utility also supports parameters to specify the time range of rows to copy, and to specify the subset of column families in a table to copy. 若要查看 CopyTable 支持的参数的完整列表,请运行不带任何参数的 CopyTable:To see the complete list of parameters supported by CopyTable, run CopyTable without any parameters:

hbase org.apache.hadoop.hbase.mapreduce.CopyTable

CopyTable 将会扫描要复制到目标表的整个源表内容。CopyTable scans the entire source table content that will be copied over to the destination table. 因此,在 CopyTable 执行时,这可能会降低 HBase 群集的性能。This may reduce your HBase cluster's performance while CopyTable executes.


若要在表之间自动复制数据,请参阅 GitHub 上 Azure HBase 实用工具存储库中的 hdi_copy_table.sh 脚本。To automate the copying of data between tables, see the hdi_copy_table.sh script in the Azure HBase Utils repository on GitHub.

手动收集 Apache ZooKeeper 仲裁列表Manually collect the Apache ZooKeeper quorum List

如果两个 HDInsight 群集位于同一个虚拟网络中,如前所述,内部主机名解析会自动进行。When both HDInsight clusters are in the same virtual network, as described previously, internal host name resolution is automatic. 若要对通过 VPN 网关连接的两个不同虚拟网络中的 HDInsight 群集使用 CopyTable,需要提供仲裁中 Zookeeper 节点的主机 IP 地址。To use CopyTable for HDInsight clusters in two separate virtual networks connected by a VPN Gateway, you will need to provide the host IP addresses of the Zookeeper nodes in the quorum.

若要获取仲裁主机名,请运行以下 curl 命令:To acquire the quorum host names, run the following curl command:

    curl -u admin:<password> -X GET -H "X-Requested-By: ambari" "https://<clusterName>.azurehdinsight.cn/api/v1/clusters/<clusterName>/configurations?type=hbase-site&tag=TOPOLOGY_RESOLVED" | grep "hbase.zookeeper.quorum"

该 curl 命令检索包含 HBase 配置信息的 JSON 文档,而 grep 命令只返回“hbase.zookeeper.quorum”条目,例如:The curl command retrieves a JSON document with HBase configuration information, and the grep command returns only the "hbase.zookeeper.quorum" entry, for example:

    "hbase.zookeeper.quorum" : "zk0-hdizc2.54o2oqawzlwevlfxgay2500xtg.dx.internal.chinacloudapp.cn,zk4-hdizc2.54o2oqawzlwevlfxgay2500xtg.dx.internal.chinacloudapp.cn,zk3-hdizc2.54o2oqawzlwevlfxgay2500xtg.dx.internal.chinacloudapp.cn"

仲裁主机名称值为冒号右侧的整个字符串。The quorum host names value is the entire string to the right of the colon.

若要检索这些主机的 IP 地址,请针对上述列表中的每个主机使用以下 curl 命令:To retrieve the IP addresses for these hosts, use the following curl command for each host in the previous list:

    curl -u admin:<password> -X GET -H "X-Requested-By: ambari" "https://<clusterName>.azurehdinsight.cn/api/v1/clusters/<clusterName>/hosts/<zookeeperHostFullName>" | grep "ip"

在此 curl 命令中,<zookeeperHostFullName> 是 ZooKeeper 主机的完整 DNS 名称,例如 zk0-hdizc2.54o2oqawzlwevlfxgay2500xtg.dx.internal.chinacloudapp.cnIn this curl command, <zookeeperHostFullName> is the full DNS name of a ZooKeeper host, such as the example zk0-hdizc2.54o2oqawzlwevlfxgay2500xtg.dx.internal.chinacloudapp.cn. 该命令的输出包含指定主机的 IP 地址,例如:The output of the command contains the IP address for the specified host, for example:

100 "ip" : "",

收集仲裁中所有 ZooKeeper 节点的 IP 地址后,重新生成目标地址:After you collect the IP addresses for all ZooKeeper nodes in your quorum, rebuild the destination address:

<destinationAddress> = <Host_1_IP>,<Host_2_IP>,<Host_3_IP>:<Port>:<ZnodeParent>

在我们的示例中:In our example:

<destinationAddress> =,,


使用快照可为 HBase 数据存储中的数据创建时间点备份。Snapshots enable you to take a point-in-time backup of data in your HBase datastore. 快照的开销极小,并且在数秒内即可完成,因为快照操作实际上是一种元数据操作,只捕获该时刻存储中所有文件的名称。Snapshots have minimal overhead and complete within seconds, because a snapshot operation is effectively a metadata operation capturing the names of all files in storage at that instant. 创建快照时,不会复制实际数据。At the time of a snapshot, no actual data is copied. 快照依赖于 HDFS 中存储的数据不可变性质,其中的更新、删除和插入都以新数据表示。Snapshots rely on the immutable nature of the data stored in HDFS, where updates, deletes, and inserts are all represented as new data. 可以在同一群集上还原(克隆)快照,或者将快照导出到另一个群集。You can restore (clone) a snapshot on the same cluster, or export a snapshot to another cluster.

若要创建快照,请通过 SSH 连接到 HDInsight HBase 群集的头节点,然后启动 hbase shell:To create a snapshot, SSH in to the head node of your HDInsight HBase cluster and start the hbase shell:

hbase shell

在 hbase shell 中,结合表和此快照的名称使用 snapshot 命令:Within the hbase shell, use the snapshot command with the names of the table and of this snapshot:

snapshot '<tableName>', '<snapshotName>'

若要在 hbase shell 中按名称还原快照,请先禁用表,然后还原快照并重新启用表:To restore a snapshot by name within the hbase shell, first disable the table, then restore the snapshot and re-enable the table:

disable '<tableName>'
restore_snapshot '<snapshotName>'
enable '<tableName>'

若要将快照还原到新表,请使用 clone_snapshot:To restore a snapshot to a new table, use clone_snapshot:

clone_snapshot '<snapshotName>', '<newTableName>'

若要将某个快照导出到 HDFS 供另一个群集使用,请先根据前面所述创建该快照,然后使用 ExportSnapshot 实用工具。To export a snapshot to HDFS for use by another cluster, first create the snapshot as described previously and then use the ExportSnapshot utility. 请在与头节点建立的 SSH 会话中,而不是在 hbase shell 中运行此实用工具:Run this utility from within the SSH session to the head node, not within the hbase shell:

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot <snapshotName> -copy-to <hdfsHBaseLocation>

<hdfsHBaseLocation> 可以是源群集可访问的任何存储位置,应该指向目标群集所用的 hbase 文件夹。The <hdfsHBaseLocation> can be any of the storage locations accessible to your source cluster, and should point to the hbase folder used by your destination cluster. 例如,如果已将某个辅助 Azure 存储帐户附加到了源群集,并且使用该帐户可以访问目标群集的默认存储所用的容器,则可以使用以下命令:For example, if you have a secondary Azure Storage account attached to your source cluster, and that account provides access to the container used by the default storage of the destination cluster, you could use this command:

    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot 'Snapshot1' -copy-to 'wasbs://secondcluster@myaccount.blob.core.chinacloudapi.cn/hbase'

如果没有将辅助 Azure 存储帐户附加到源群集,或者源群集是本地群集(或非 HDI 群集),则在尝试访问 HDI 群集的存储帐户时,可能会遇到授权问题。If you don't have a secondary Azure Storage account attached to your source cluster or if your source cluster is an on-premises cluster (or non-HDI cluster), you might experience authorization issues when you try to access the storage account of your HDI cluster. 若要解决此问题,请将存储帐户的密钥指定为命令行参数,如以下示例所示。To resolve this, specify the key to your storage account as a command-line parameter as shown in the following example. 可以在 Azure 门户中获取存储帐户的密钥。You can get the key to your storage account in the Azure portal.

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -Dfs.azure.account.key.myaccount.blob.core.chinacloudapi.cn=mykey -snapshot 'Snapshot1' -copy-to 'wasbs://secondcluster@myaccount.blob.core.windows.net/hbase'

导出快照后,通过 SSH 连接到目标群集的头节点,然后根据前面所述使用 restore_snapshot 命令还原快照。After the snapshot is exported, SSH into the head node of the destination cluster and restore the snapshot by using the restore_snapshot command as described earlier.

快照提供执行 snapshot 命令时的表的完整备份。Snapshots provide a complete backup of a table at the time of the snapshot command. 快照不提供按时间范围执行增量快照的功能,也不允许指定要包含在快照中的列系列子集。Snapshots do not provide the ability to perform incremental snapshots by windows of time, nor to specify subsets of columns families to include in the snapshot.


HBase 复制使用异步机制自动将事务从源群集推送到目标群集,并且只会在源群集上产生极少的开销。HBase replication automatically pushes transactions from a source cluster to a destination cluster, using an asynchronous mechanism with minimal overhead on the source cluster. 在 HDInsight 中,可以在群集之间设置复制,其中:In HDInsight, you can set up replication between clusters where:

  • 源群集和目标群集位于同一虚拟网络中。The source and destination clusters are in the same virtual network.
  • 源群集和目标群集位于通过 VPN 网关连接的不同虚拟网络中,但两个群集位于相同的地理位置。The source and destinations clusters are in different virtual networks connected by a VPN gateway, but both clusters exist in the same geographic location.
  • 源群集和目标群集位于通过 VPN 网关连接的不同虚拟网络中,并且每个群集位于不同的地理位置。The source cluster and destinations clusters are in different virtual networks connected by a VPN gateway and each cluster exists in a different geographic location.

设置复制的一般步骤如下:The general steps to set up replication are:

  1. 在源群集上,创建表并填充数据。On the source cluster, create the tables and populate data.
  2. 在目标群集上,使用源表的架构创建空目标表。On the destination cluster, create empty destination tables with the source table's schema.
  3. 将目标群集注册为源群集的对等方。Register the destination cluster as a peer to the source cluster.
  4. 在所需的源表上启用复制。Enable replication on the desired source tables.
  5. 将源表中的现有数据复制到目标表。Copy existing data from the source tables to the destination tables.
  6. 复制功能会自动将源表中新数据的修改内容复制到目标表。Replication automatically copies new data modifications to the source tables into the destination tables.

若要在 HDInsight 上启用复制,请对运行中的源 HDInsight 群集应用脚本操作。To enable replication on HDInsight, apply a Script Action to your running source HDInsight cluster. 如需在群集中启用复制的演练,或要使用 Azure 资源管理模板对虚拟网络中创建的示例群集体验复制,请参阅配置 Apache HBase 复制For a walkthrough of enabling replication in your cluster, or to experiment with replication on sample clusters created in virtual networks using Azure Resource Management templates, see Configure Apache HBase replication. 该文章还包含有关启用 Phoenix 元数据复制的说明。That article also includes instructions for enabling replication of Phoenix metadata.

后续步骤Next steps