将 Hive 工作负载迁移到 Azure 存储中的新帐户Hive workload migration to new account in Azure Storage

了解如何使用脚本操作,将存储帐户中的 Hive 表复制到 HDInsight。Learn how to use script actions to copy Hive tables across storage accounts in HDInsight. 当迁移到 Azure Data Lake Storage Gen2 时,这可能很有用。This may be useful when migrating to Azure Data Lake Storage Gen2.

若要在 HDInsight 4.0 上手动复制单个 Hive 表,请参阅 Hive 导出/导入To manually copy an individual Hive table on HDInsight 4.0, see Hive export/import.

先决条件Prerequisites

  • 具有以下配置的新的 HDInsight 群集:A new HDInsight cluster with following configurations:

  • 可访问原始群集和新群集的存储帐户。

    以下是一些选项:Here are some options:

    • 将目标存储帐户添加到原始群集。Add the target storage account to the original cluster.
    • 将原始存储帐户添加到新群集。Add the original storage account to the new cluster.
    • 将中间存储帐户添加到原始群集和新群集。Add an intermediary storage account to both the original and new clusters.

工作原理How it works

我们将运行一个脚本操作,用于将 Hive 表从原始群集导出到指定的 HDFS 目录。We'll run a script action to export Hive tables from the original cluster to a specified HDFS directory. 请参阅将脚本操作应用到正在运行的群集See Script action to a running cluster.

然后,我们将在新群集上运行另一个脚本操作,以从 HDFS 目录导入 Hive 表。Then, we'll run another script action on the new cluster to import the Hive tables from the HDFS directory.

此脚本会将表重新创建到新群集的默认文件系统。The script will re-create the tables to the new cluster's default filesystem. 本机表还会将其数据复制到存储中。Native tables will also copy their data in storage. 非本机表将仅按定义进行复制。Non-native tables will copy only by definition. 有关非本机表的详细信息,请参阅 Hive 存储处理程序See Hive Storage Handlers for details on non-native tables.

将保留不在 Hive 仓库目录中的外部表的路径。The path of external tables not in the Hive warehouse directory will be preserved. 其他表将复制到目标群集的默认 Hive 路径。Other tables will copy to the target cluster's default Hive path. 请参阅 Hive 属性 hive.metastore.warehouse.external.dirhive.metastore.warehouse.dirSee Hive properties hive.metastore.warehouse.external.dir and hive.metastore.warehouse.dir.

这些脚本将不会在目标群集中保留自定义文件权限。The scripts will not preserve custom file permissions in the target cluster.

备注

本指南支持复制与 Hive 数据库、表和分区相关的元数据对象。This guide supports copying metadata objects related to Hive databases, tables and partitions. 其他元数据对象必须手动重新创建。Other metadata objects must be re-created manually.

  • 对于 Views,Hive 在 HDInsight 4.0 上从 Hive 2.2.0 开始支持 SHOW VIEWS 命令。For Views, Hive supports SHOW VIEWS command as of Hive 2.2.0 on HDInsight 4.0. SHOW CREATE TABLE 用于视图定义。Use SHOW CREATE TABLE for view definition. 对于更早版本的 Hive,请查询元存储 SQL DB 以显示视图。For earlier versions of Hive, query the metastore SQL DB to show views.
  • 对于 Materialized Views,请使用命令 SHOW MATERIALIZED VIEWSDESCRIBE FORMATTEDCREATE MATERIALIZED VIEWFor Materialized Views, use commands SHOW MATERIALIZED VIEWS, DESCRIBE FORMATTED, and CREATE MATERIALIZED VIEW. 有关详细信息,请参阅具体化视图See Materialized views for details.
  • 对于 Constraints(在 HDInsight 4.0 上从 Hive 2.1.0 开始受支持),请使用 DESCRIBE EXTENDED 列出表的约束,并使用 ALTER TABLE 添加约束。For Constraints, supported as of Hive 2.1.0 on HDInsight 4.0, use DESCRIBE EXTENDED to list constraints for a table, and use ALTER TABLE to add constraints. 有关详细信息,请参阅修改表约束See Alter Table Constraints for details.

复制 Hive 表Copy Hive tables

  1. 使用以下字段在原始群集上应用“导出”脚本操作。Apply the "export" script action on the original cluster with the following fields.

    这将生成并执行中间 Hive 脚本。This will generate and execute intermediary Hive scripts. 它们将保存到指定的 <hdfs-export-path>They will save to the specified <hdfs-export-path>.

    或者,在手动执行之前,使用 --run-script=false 对它们进行自定义。Optionally, use --run-script=false to customize them before manually executing.

    属性Property ValueValue
    Bash 脚本 URIBash script URI https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/export-hive-data-v01.sh
    节点类型Node type(s) Head
    parametersParameters <hdfs-export-path> --run-script<hdfs-export-path> --run-script
    usage: generate Hive export and import scripts and export Hive data to specified HDFS path
           [--run-script={true,false}]
           hdfs-export-path
    
    positional arguments:
    
        hdfs-export-path      remote HDFS directory to write export data to
    
    optional arguments:
        --run-script={true,false}
                            whether to execute the generated Hive export script
                            (default: true)
    
  2. 成功完成导出后,请使用以下字段在新集群上应用“导入”脚本操作。After successful completion of export, apply the "import" script action on the new cluster with the following fields.

    属性Property ValueValue
    Bash 脚本 URIBash script URI https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/import-hive-data-v01.sh
    节点类型Node type(s) Head
    parametersParameters <hdfs-export-path>
    usage: download Hive import script from specified HDFS path and execute it
           hdfs-export-path
    
    positional arguments:
    
      hdfs-export-path      remote HDFS directory to download Hive import script from
    
    

验证Verification

在每个群集的主节点上以根用户 hive_contents.sh 的身份下载并运行脚本,然后比较输出文件 /tmp/hive_contents.out 的内容。Download and run the script as root user hive_contents.sh on the primary node of each cluster, and compare contents of output file /tmp/hive_contents.out. 请参阅使用 SSH 连接到 HDInsight (Apache Hadoop)See Connect to HDInsight (Apache Hadoop) using SSH.

清理其他存储使用量Cleanup additional storage usage

完成存储迁移并验证后,可以删除指定的 HDFS 导出路径中的数据。After storage migration is complete and verified, you can delete the data in the specified HDFS export path.

选项为使用 HDFS 命令 hdfs dfs -rm -ROption is to use HDFS command hdfs dfs -rm -R.

选项:减少其他存储使用量Option: reduce additional storage usage

由于 Hive,导出脚本操作可能会使存储使用量增加一倍。The export script action likely doubles the storage usage due to Hive. 但是,可以通过手动一次迁移一个数据库或表来限制其他存储使用量。However, it is possible to limit the additional storage usage by migrating manually, one database or table at a time.

  1. 指定 --run-script=false 以跳过执行生成的 Hive 脚本。Specify --run-script=false to skip execution of the generated Hive script. Hive 导出和导入脚本仍将保存到导出路径中。The Hive export and import scripts would still be saved to the export path.

  2. 逐数据库或逐表执行 Hive 导出和导入脚本的片段,手动清理每个迁移的数据库或表后面的导出路径。Execute snippets of the Hive export and import scripts database-by-database or table-by-table, manually cleaning up the export path after each migrated database or table.

后续步骤Next steps