将默认 Hive 元存储数据库迁移到外部元存储数据库Migrate default Hive metastore DB to external metastore DB

本文介绍如何将元数据从 Hive 的默认元存储数据库迁移到 HDInsight 上的外部 SQL 数据库。This article shows how to migrate metadata from a default metastore DB for Hive to an external SQL Database on HDInsight.

迁移到外部元存储数据库的原因Why migrate to external metastore DB

  • 默认元存储数据库仅限于基本 SKU,无法处理生产规模工作负荷。Default metastore DB is limited to basic SKU and cannot handle production scale workloads.

  • 使用外部元存储数据库,用户能够通过添加新的 HDInsight 群集共享相同的元存储数据库来水平扩展 Hive 计算资源。External metastore DB enables customer to horizontally scale Hive compute resources by adding new HDInsight clusters sharing the same metastore DB.

  • 对于 HDInsight 3.6 到 4.0 的迁移,在升级 Hive 架构版本之前,必须将元数据迁移到外部元存储数据库。For HDInsight 3.6 to 4.0 migration, it is mandatory to migrate metadata to external metastore DB before upgrading the Hive schema version. 请参阅将工作负荷从 HDInsight 3.6 迁移到 HDInsight 4.0See migrating workloads from HDInsight 3.6 to HDInsight 4.0.

由于默认元存储数据库计算能力有限,因此,我们建议在迁移元数据时,尽量少在群集上进行其他作业。Because the default metastore DB has limited compute capacity, we recommend low utilization from other jobs on the cluster while migrating metadata.

源数据库和目标数据库都必须使用相同的 HDInsight 版本和相同的存储帐户。Source and target DBs must use the same HDInsight version and the same Storage Accounts. 如果将 HDInsight 版本从 3.6 升级到 4.0,请先完成本文中的步骤。If upgrading HDInsight versions from 3.6 to 4.0, complete the steps in this article first. 然后,按照此处的官方升级步骤执行操作。Then, follow the official upgrade steps here.

先决条件Prerequisites

请参阅将脚本操作应用到正在运行的群集See Script action to a running cluster.

此操作类似于将符号链接替换为其完整路径。The action is similar to replacing symlinks with their full paths.

属性Property ValueValue
Bash 脚本 URIBash script URI https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/hive-adl-expand-location-v01.sh
节点类型Node type(s) Head
parametersParameters """"

使用 sqlpackage 通过导出/导入实现迁移Migrate with Export/Import using sqlpackage

仅在 2020 年 10 月 15 日后创建的 HDInsight 群集通过使用 sqlpackage 支持 Hive 默认元存储数据库的 SQL 导出/导入。An HDInsight cluster created only after 2020-10-15 supports SQL Export/Import for the Hive default metastore DB by using sqlpackage.

  1. sqlpackage 安装到群集。Install sqlpackage to the cluster.

  2. 通过执行以下命令,将默认的元存储数据库导出到 BACPAC 文件。Export the default metastore DB to BACPAC file by executing the following command.

    wget "https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/hive_metastore_tool.py"
    SQLPACKAGE_FILE='/home/sshuser/sqlpackage/sqlpackage'  # replace with sqlpackage location
    TARGET_FILE='hive.bacpac'
    sudo python hive_metastore_tool.py --sqlpackagefile $SQLPACKAGE_FILE --targetfile $TARGET_FILE
    
  3. 保存 BACPAC 文件。Save the BACPAC file. 以下是一个选项。Below is an option.

    hdfs dfs -mkdir -p /bacpacs
    hdfs dfs -put $TARGET_FILE /bacpacs/
    
  4. 按照此处列出的步骤将 BACPAC 文件导入到新数据库。Import the BACPAC file to a new database with steps listed here.

  5. 新数据库已准备就绪,可以在新的 HDInsight 群集上配置为外部元存储数据库The new database is ready to be configured as external metastore DB on a new HDInsight cluster.

使用 Hive 脚本迁移Migrate using Hive script

2020 年 10 月 15 日之前创建的群集不支持默认元存储数据库的导出/导入。Clusters created before 2020-10-15 do not support export/import of the default metastore DB.

对于此类群集,请使用第二个群集和外部 Hive 元存储数据库,按照指南在存储帐户中复制 Hive 表执行操作。For such clusters, follow the guide Copy Hive tables across Storage Accounts, using a second cluster with an external Hive metastore DB. 第二个群集可以使用相同的存储帐户,但必须使用新的默认文件系统。The second cluster can use the same storage account but must use a new default filesystem.

“浅表”复制选项Option to "shallow" copy

使用上述指南对表进行“深层”复制时,存储消耗会加倍。Storage consumption would double when tables are "deep" copied using the above guide. 需要手动清理源存储容器中的数据。You need to manually clean the data in the source storage container. 相反,如果是非事务性表,则可以对表进行“浅表”复制。We can, instead, "shallow" copy the tables if they are non-transactional. 默认情况下,HDInsight 3.6 中的所有 Hive 表都是非事务性的,但在 HDInsight 4.0 中只有外部表是非事务性的。All Hive tables in HDInsight 3.6 are non-transactional by default, but only external tables are non-transactional in HDInsight 4.0. 必须对事务表进行深层复制。Transactional tables must be deep copied. 请按照以下步骤对非事务性表进行浅表复制:Follow these steps to shallow copy non-transactional tables:

  1. 在源群集的主头节点上执行脚本 hive-ddls.sh 以便为每个 Hive 表生成 DDL。Execute script hive-ddls.sh on the source cluster's primary headnode to generate the DDL for every Hive table.
  2. 将 DDL 写入名为 /tmp/hdi_hive_ddls.hql 的 Hive 本地脚本。The DDL is written to a local Hive script named /tmp/hdi_hive_ddls.hql. 在使用外部 Hive 元存储数据库的目标群集上执行此操作。Execute this on the target cluster that uses an external Hive metastore DB.

验证是否已导入所有 Hive 表Verify that all Hive tables are imported

以下命令使用元存储数据库上的 SQL 查询来打印所有的 Hive 表及其数据位置。The following command uses a SQL query on the metastore DB to print all Hive tables and their data locations. 比较新群集和旧群集之间的输出,以验证新元存储数据库中是否缺少任何表。Compare outputs between new and old clusters to verify that no tables are missing in the new metastore DB.

SCRIPT_FNAME='hive_metastore_tool.py'
SCRIPT="/tmp/$SCRIPT_FNAME"
wget -O "$SCRIPT" "https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/$SCRIPT_FNAME"
OUTPUT_FILE='/tmp/hivetables.csv'
QUERY="SELECT DBS.NAME, TBLS.TBL_NAME, SDS.LOCATION FROM SDS, TBLS, DBS WHERE TBLS.SD_ID = SDS.SD_ID AND TBLS.DB_ID = DBS.DB_ID ORDER BY DBS.NAME, TBLS.TBL_NAME ASC;"
sudo python "$SCRIPT" --query "$QUERY" > $OUTPUT_FILE

延伸阅读Further reading