将 Azure HDInsight 3.6 Hive 工作负荷迁移到 HDInsight 4.0Migrate Azure HDInsight 3.6 Hive workloads to HDInsight 4.0

本文档演示如何将 HDInsight 3.6 上的 Apache Hive 和 LLAP 工作负荷迁移到 HDInsight 4.0。This document shows how to migrate Apache Hive and LLAP workloads on HDInsight 3.6 to HDInsight 4.0. HDInsight 4.0 提供较新的 Hive 和 LLAP 功能,例如具体化视图和查询结果缓存。HDInsight 4.0 provides newer Hive and LLAP features such as materialized views and query result caching. 将工作负荷迁移到 HDInsight 4.0 时,可以使用 HDInsight 3.6 中所不能提供的许多较新 Hive 3 功能。When you migrate your workloads to HDInsight 4.0, you can use many newer features of Hive 3 that aren't available on HDInsight 3.6.

本文介绍以下主题:This article covers the following subjects:

  • 将 Hive 元数据迁移到 HDInsight 4.0Migration of Hive metadata to HDInsight 4.0
  • 安全迁移 ACID 和非 ACID 表Safe migration of ACID and non-ACID tables
  • 在不同的 HDInsight 版本之间保留 Hive 安全策略Preservation of Hive security policies across HDInsight versions
  • 从 HDInsight 3.6 迁移到 HDInsight 4.0 之后执行查询和调试Query execution and debugging from HDInsight 3.6 to HDInsight 4.0

Hive 的一项优势是能够将元数据导出到外部数据库(也称为 Hive 元存储)。One advantage of Hive is the ability to export metadata to an external database (referred to as the Hive Metastore). Hive 元存储负责存储表统计信息,包括表存储位置、列名称和表索引信息。The Hive Metastore is responsible for storing table statistics, including the table storage location, column names, and table index information. HDInsight 3.6 和 HDInsight 4.0 需要不同的元存储架构,并且不能共享单个元存储。HDInsight 3.6 and HDInsight 4.0 require different metastore schemas and can't share a single metastore. 安全升级 Hive 元存储的建议方法是在当前生产环境中升级元存储的副本,而不要升级原始元存储。The recommended way to upgrade the Hive metastore safely is to upgrade a copy instead of the original in the current production environment. 本文档要求原始群集和新群集有权访问同一存储帐户。This document requires the original and new clusters to have access to the same Storage Account. 因此,它不涉及将数据迁移到另一个区域。Therefore, it does not cover data migration to another region.

从外部元存储迁移Migrate from external metastore

1.在 HDInsight 3.6 中的 ACID 表上运行主要压缩1. Run major compaction on ACID tables in HDInsight 3.6

HDInsight 3.6 和 HDInsight 4.0 ACID 表以不同的方式理解 ACID 增量数据。HDInsight 3.6 and HDInsight 4.0 ACID tables understand ACID deltas differently. 在迁移之前唯一需要执行的操作是针对 3.6 版群集上的每个 ACID 表运行“主要”压缩。The only action required before migration is to run 'MAJOR' compaction against each ACID table on the 3.6 cluster. 有关压缩的详细信息,请参阅 Hive 语言手册See the Hive Language Manual for details on compaction.

2.复制 SQL 数据库2. Copy SQL database

创建外部元存储的新副本。Create a new copy of your external metastore. 如果使用外部元存储,创建元存储副本的安全且简便方法之一是使用 RESTORE 功能以不同的名称还原数据库If you're using an external metastore, one of the safe and easy ways to make a copy of the metastore is to restore the database with a different name using the RESTORE function. 请参阅在 Azure HDInsight 中使用外部元数据存储,以详细了解如何将外部元存储附加到 HDInsight 群集。See Use external metadata stores in Azure HDInsight to learn more about attaching an external metastore to an HDInsight cluster.

3.升级元存储架构3. Upgrade metastore schema

完成元存储的复制后,在现有 HDInsight 3.6 群集上运行脚本操作中的架构升级脚本,将新的元存储升级到 Hive 3 架构。Once the metastore copy is complete, run a schema upgrade script in Script Action on the existing HDInsight 3.6 cluster to upgrade the new metastore to Hive 3 schema. (此步骤不需要将新的元存储连接到群集。)这样,便可以将数据库附加为 HDInsight 4.0 元存储。(This step doesn't require the new metastore to be connected to a cluster.) This allows the database to be attached as HDInsight 4.0 metastore.

接下来使用下表中的值。Use the values in the table further below. 请将 SQLSERVERNAME DATABASENAME USERNAME PASSWORD 替换为 Hive 元存储副本的相应值,并以空格分隔。Replace SQLSERVERNAME DATABASENAME USERNAME PASSWORD with the appropriate values for the Hive metastore copy, separated by spaces. 在指定 SQL 服务器名称时,请勿包含“.database.chinacloudapi.cn”。Don't include ".database.chinacloudapi.cn" when specifying the SQL server name.

属性Property ValueValue
脚本类型Script type - Custom- Custom
名称Name Hive 升级Hive upgrade
Bash 脚本 URIBash script URI https://hdiconfigactions.blob.core.chinacloudapi.cn/hivemetastoreschemaupgrade/launch-schema-upgrade.sh
节点类型Node type(s) Head
parametersParameters SQLSERVERNAME DATABASENAME USERNAME PASSWORDSQLSERVERNAME DATABASENAME USERNAME PASSWORD

警告

将 HDInsight 3.6 元数据架构转换为 HDInsight 4.0 架构的升级过程不可逆。The upgrade which converts the HDInsight 3.6 metadata schema to the HDInsight 4.0 schema, cannot be reversed.

可以通过对数据库运行以下 SQL 查询来验证升级:You can verify the upgrade by running the following sql query against the database:

select * from dbo.version

4.部署新的 HDInsight 4.0 群集4. Deploy a new HDInsight 4.0 cluster

  1. 将升级后的元存储指定为新群集的 Hive 元存储。Specify the upgraded metastore as the new cluster's Hive metastore.

  2. 但是,只有群集有权访问所需的存储帐户之后,才能访问表中的实际数据。The actual data from the tables, however, isn't accessible until the cluster has access to the necessary storage accounts. 确保将 HDInsight 3.6 群集中的 Hive 表的存储帐户指定为新的 HDInsight 4.0 群集的主要或辅助存储帐户。Make sure that the Hive tables' Storage Accounts in the HDInsight 3.6 cluster are specified as either the primary or secondary Storage Accounts of the new HDInsight 4.0 cluster. 有关将存储帐户添加到 HDInsight 群集的详细信息,请参阅将其他存储帐户添加到 HDInsightFor more information about adding storage accounts to HDInsight clusters, see Add additional storage accounts to HDInsight.

5.使用 HDInsight 4.0 中的升级后工具完成迁移5. Complete migration with a post-upgrade tool in HDInsight 4.0

默认情况下,HDInsight 4.0 上的托管表必须符合 ACID 标准。Managed tables must be ACID-compliant on HDInsight 4.0, by default. 完成元存储迁移后,请运行升级后工具,使之前的非 ACID 托管表与 HDInsight 4.0 群集兼容。Once you've completed the metastore migration, run a post-upgrade tool to make previously non-ACID managed tables compatible with the HDInsight 4.0 cluster. 此工具将应用以下转换:This tool will apply the following conversion:

3.63.6 4.04.0
外部表External tables 外部表External tables
非 ACID 托管表Non-ACID managed tables 属性“external.table.purge”为“true”的外部表External tables with property 'external.table.purge'='true'
ACID 托管表ACID managed tables ACID 托管表ACID managed tables

使用 SSH shell 从 HDInsight 4.0 群集中执行 Hive 升级后工具:Execute the Hive post-upgrade tool from the HDInsight 4.0 cluster using the SSH shell:

  1. 使用 SSH 连接到群集头节点。Connect to your cluster headnode using SSH. 有关说明,请参阅使用 SSH 连接到 HDInsightFor instructions, see Connect to HDInsight using SSH

  2. 以 Hive 用户的身份运行 sudo su - hive,以打开登录 shellOpen a login shell as the Hive user by running sudo su - hive

  3. 从 shell 执行以下命令。Execute the following command from the shell.

    STACK_VERSION=$(hdp-select status hive-server2 | awk '{ print $3; }')
    /usr/hdp/$STACK_VERSION/hive/bin/hive --config /etc/hive/conf --service  strictmanagedmigration --hiveconf hive.strict.managed.tables=true -m automatic --modifyManagedTables
    

执行完工具后,Hive 仓库可供 HDInsight 4.0 使用。After the tool completes, your Hive warehouse will be ready for HDInsight 4.0.

从内部元存储迁移Migrate from internal metastore

如果 HDInsight 3.6 群集使用内部 Hive 元存储,请按照以下步骤运行脚本,该脚本将生成 Hive 查询,以从元存储中导出对象定义。If your HDInsight 3.6 cluster uses an internal Hive metastore, then follow the steps below to run a script, which generates Hive queries to export object definitions from the metastore.

HDInsight 3.6 和 4.0 群集必须使用同一存储帐户。The HDInsight 3.6 and 4.0 clusters must use the same Storage Account.

备注

  • 对于 ACID 表,将创建该表下数据的新副本。In the case of ACID tables, a new copy of the data underneath the table will be created.

  • 此脚本仅支持迁移 Hive 数据库、表和分区。This script supports migration of Hive databases, tables, and partitions, only. 其他元数据对象(如视图、UDF 和表约束)应手动进行复制。Other metadata objects, like Views, UDFs, and Table Constraints, are expected to be copied manually.

  • 此脚本完成后,将假定不再使用旧群集访问脚本中引用的任何表或数据库。Once this script is complete, it is assumed that the old cluster will no longer be used for accessing any of the tables or databases referred to in the script.

  • 所有托管表都将在 HDInsight 4.0 中变为事务性。All managed tables will become transactional in HDInsight 4.0. 可选择将数据导出到属性“external.table.purge”为“true”的外部表来保持表的非事务性。Optionally, keep the table non-transactional by exporting the data to an external table with the property 'external.table.purge'='true'. 例如,For example,

    create table tablename_backup like tablename;
    insert overwrite table tablename_backup select * from tablename;
    create external table tablename_tmp like tablename;
    insert overwrite table tablename_tmp select * from tablename;
    alter table tablename_tmp set tblproperties('external.table.purge'='true');
    drop table tablename;
    alter table tablename_tmp rename to tablename;
    
  1. 使用安全外壳 (SSH) 客户端连接到 HDInsight 3.6 群集。Connect to the HDInsight 3.6 cluster by using a Secure Shell (SSH) client.

  2. 在打开的 SSH 会话中,下载以下脚本文件,以生成名为 alltables.hql 的文件。From the open SSH session, download the following script file to generate a file named alltables.hql.

    wget https://hdiconfigactions.blob.core.chinacloudapi.cn/hivemetastoreschemaupgrade/exporthive_hdi_3_6.sh
    chmod 755 exporthive_hdi_3_6.sh
    
    • 对于没有 ESP 的常规 HDInsight 群集,只需执行 exporthive_hdi_3_6.shFor a regular HDInsight cluster, without ESP, simply execute exporthive_hdi_3_6.sh.

    • 对于使用 ESP 的群集,使用 kinit 并将参数修改为 beeline:运行以下命令,为具有完全 Hive 权限的 Azure AD 用户定义用户和域。For a cluster with ESP, kinit and modify the arguments to beeline: run the following, defining USER and DOMAIN for Azure AD user with full Hive permissions.

      USER="USER"  # replace USER
      DOMAIN="DOMAIN"  # replace DOMAIN
      DOMAIN_UPPER=$(printf "%s" "$DOMAIN" | awk '{ print toupper($0) }')
      kinit "$USER@$DOMAIN_UPPER"
      
      hn0=$(grep hn0- /etc/hosts | xargs | cut -d' ' -f4)
      BEE_CMD="beeline -u 'jdbc:hive2://$hn0:10001/default;principal=hive/_HOST@$DOMAIN_UPPER;auth-kerberos;transportMode=http' -n "$USER@$DOMAIN" --showHeader=false --silent=true --outputformat=tsv2 -e"
      ./exporthive_hdi_3_6.sh "$BEE_CMD"
      
  3. 退出 SSH 会话。Exit your SSH session. 然后输入一条 scp 命令,以在本地下载 alltables.hql。Then enter a scp command to download alltables.hql locally.

    scp sshuser@CLUSTERNAME-ssh.azurehdinsight.cn:alltables.hql c:/hdi
    
  4. 将 alltables.hql 上传到新的 HDInsight 群集。Upload alltables.hql to the new HDInsight cluster.

    scp c:/hdi/alltables.hql sshuser@CLUSTERNAME-ssh.azurehdinsight.cn:/home/sshuser/
    
  5. 然后使用 SSH 连接到新的 HDInsight 4.0 群集。Then use SSH to connect to the new HDInsight 4.0 cluster. 在此群集的 SSH 会话中运行以下代码:Run the following code from an SSH session to this cluster:

    不使用 ESP:Without ESP:

    beeline -u "jdbc:hive2://localhost:10001/;transportMode=http" -f alltables.hql
    

    使用 ESP:With ESP:

    USER="USER"  # replace USER
    DOMAIN="DOMAIN"  # replace DOMAIN
    DOMAIN_UPPER=$(printf "%s" "$DOMAIN" | awk '{ print toupper($0) }')
    kinit "$USER@$DOMAIN_UPPER"
    
    hn0=$(grep hn0- /etc/hosts | xargs | cut -d' ' -f4)
    beeline -u "jdbc:hive2://$hn0:10001/default;principal=hive/_HOST@$DOMAIN_UPPER;auth-kerberos;transportMode=http" -n "$USER@$DOMAIN" -f alltables.hql
    

此处不适用外部元存储迁移的升级后工具,因为 HDInsight 3.6 中的非 ACID 托管表会转换为 HDInsight 4.0 中的 ACID 托管表。The post-upgrade tool for external metastore migration does not apply here, since non-ACID managed tables from HDInsight 3.6 convert to ACID managed tables in HDInsight 4.0.

重要

HDInsight 4.0 中的托管表(包括从 3.6 迁移的表)不应由其他服务或应用程序(包括 HDInsight 3.6 群集)访问。Managed tables in HDInsight 4.0 (including tables migrated from 3.6) should not be accessed by other services or applications, including HDInsight 3.6 clusters.

跨 HDInsight 版本保护 HiveSecure Hive across HDInsight versions

从 HDInsight 3.6 开始,HDInsight 将使用 HDInsight 企业安全性套餐 (ESP) 来与 Azure Active Directory 相集成。Since HDInsight 3.6, HDInsight integrates with Azure Active Directory using HDInsight Enterprise Security Package (ESP). ESP 使用 Kerberos 和 Apache Ranger 来管理群集中特定资源的权限。ESP uses Kerberos and Apache Ranger to manage the permissions of specific resources within the cluster. 可使用以下步骤,将针对 HDInsight 3.6 中的 Hive 部署的 Ranger 策略迁移到 HDInsight 4.0:Ranger policies deployed against Hive in HDInsight 3.6 can be migrated to HDInsight 4.0 with the following steps:

  1. 在 HDInsight 3.6 群集中导航到 Ranger 服务管理器面板。Navigate to the Ranger Service Manager panel in your HDInsight 3.6 cluster.
  2. 导航到名为 HIVE 的策略,并将该策略导出到某个 JSON 文件。Navigate to the policy named HIVE and export the policy to a json file.
  3. 确保导出的策略 JSON 中引用的所有用户都存在于新群集中。Make sure that all users referred to in the exported policy json exist in the new cluster. 如果该策略 JSON 中引用的某个用户不存在于新群集中,请将该用户添加到新群集,或者从策略中删除引用。If a user is referred to in the policy json but doesn't exist in the new cluster, either add the user to the new cluster or remove the reference from the policy.
  4. 在 HDInsight 4.0 群集中导航到“Ranger 服务管理器”面板。Navigate to the Ranger Service Manager panel in your HDInsight 4.0 cluster.
  5. 导航到名为 HIVE 的策略,并导入步骤 2 中导出的 Ranger 策略 JSON。Navigate to the policy named HIVE and import the ranger policy json from step 2.

在测试应用中检查兼容性并根据需要修改代码Check compatibility and modify codes as needed in test app

迁移工作负荷(例如现有的程序和查询)时,请查看发行说明和文档中提到的更改,并根据需要应用更改。When migrating workloads such as existing programs and queries, please check the release notes and documentation for changes and apply changes as necessary. 如果 HDInsight 3.6 群集使用共享的 Spark 和 Hive 元存储,则需要使用 Hive 仓库连接器进行其他配置If your HDInsight 3.6 cluster is using a shared Spark and Hive metastore, additional configuration using Hive Warehouse Connector is required.

部署新应用用于生产Deploy new app for production

例如,若要切换到新群集,可以安装新的客户端应用程序并将其用作新的生产环境,或者,可以升级现有的客户端应用程序并切换到 HDInsight 4.0。To switch to the new cluster, e.g. you can install a new client application and use it as a new production environment, or you can upgrade your existing client application and switch to HDInsight 4.0.

将 HDInsight 4.0 切换到生产环境Switch HDInsight 4.0 to the production

如果在测试时元存储中产生了差异,则需要在切换之前更新更改。If differences were created in the metastore while testing, you'll need to update the changes just before switching. 在这种情况下,可以导出并导入元存储,然后再次升级。In this case, you can export & import the metastore and then upgrade again.

删除旧生产环境Remove the old production

确认发布已完成且完全正常运行后,可以删除版本 3.6 和以前的元存储。Once you've confirmed that the release is complete and fully operational, you can remove version 3.6 and the previous metastore. 在删除环境之前,请确保已迁移所有内容。Please make sure that everything is migrated before deleting the environment.

跨 HDInsight 版本执行查询Query execution across HDInsight versions

在 HDInsight 3.6 群集中可以通过两种方式执行和调试 Hive/LLAP 查询。There are two ways to execute and debug Hive/LLAP queries within an HDInsight 3.6 cluster. HiveCLI 提供命令行体验,而 Tez 视图/Hive 视图提供基于 GUI 的工作流。HiveCLI provides a command-line experience and the Tez view/Hive view provides a GUI-based workflow.

在 HDInsight 4.0 中,HiveCLI 已由 Beeline 取代。In HDInsight 4.0, HiveCLI has been replaced with Beeline. Tez 视图/Hive 视图提供基于 GUI 的工作流。The Tez view/Hive view provides a GUI-based workflow. HiveCLI 是 Hiveserver 1 的 thrift 客户端,Beeline 是用于访问 Hiveserver 2 的 JDBC 客户端。HiveCLI is a thrift client for Hiveserver 1, and Beeline is a JDBC client that provides access to Hiveserver 2. Beeline 还可用于连接任何其他兼容 JDBC 的数据库终结点。Beeline also can be used to connect to any other JDBC-compatible database endpoint. Beeline 可以现成地在 HDInsight 4.0 上使用,而无需进行任何安装。Beeline is available out-of-box on HDInsight 4.0 without any installation needed.

  1. 此处下载脚本,将脚本中的“azurehdinsight.cn”替换为“azurehdinsight.cn”,并使用名称 install-data-analytics-studio.sh 在本地保存。Download the script here, replace "azurehdinsight.cn" with "azurehdinsight.cn" in the script and save locally with the name install-data-analytics-studio.sh.

    将脚本文件上传到 Azure 存储帐户,完成后,Azure 门户中该文件的 URL 将会为 https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/install-data-analytics-studio.shUpload the script file to Azure storage account, the URL for the file in Azure portal would be https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/install-data-analytics-studio.sh after done.

    备注

    这些说明假定你已在 Azure 门户中创建了存储帐户,并将存储访问级别设为了“容器”。The instructions assume you have already created a storage account in the Azure portal and set the storage access level to "Container". 若要详细了解如何创建存储帐户并上传文件,请参阅创建存储帐户使用 Azure 门户上传、下载和列出 blobFor details on how to create storage account and upload files, see Create a storage account and Upload, download, and list blobs with the Azure portal.

  2. 将以下脚本以本地方式另存为 LaunchDASInstaller.sh,然后将其上传到 Azure 存储帐户。Save following scripts locally as LaunchDASInstaller.sh and then upload to Azure storage account. 请注意,脚本中的 <storage_account_name><container_name> 应替换为在步骤 1 中创建的实际名称。Please note that <storage_account_name> and <container_name> in script should be replaced with the real name created in Step 1.

    #!/bin/sh
    set -e
    set -x
    
    echo "Launching DAS installation setup script."
    
    sudo wget https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/install-data-analytics-studio.sh -O /tmp/install-das.sh
    sudo chmod +x /tmp/install-das.sh
    
    # Fire-and-forget the das installer to circumvent Ambari operation deadlock
    # We also make sure the script cleans up after itself
    
    nohup sh -c '/tmp/install-das.sh && rm -f /tmp/install-das.sh' &
    
    echo "Das installation launch succeeded."
    exit 0
    

    请记下文件 URL(在 Azure 门户中为 https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/LaunchDASInstaller.sh),以便在下一步使用。Make a note of the file URL which would be https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/LaunchDASInstaller.sh in Azure portal to use in the next step.

  3. 使用“头节点”作为执行的节点类型,针对群集启动某个脚本操作。Launch a script action against your cluster, with "Head nodes" as the node type for execution.

    转到 Azure 门户,导航到已创建的 HDInsight 群集。Go to the Azure portal, navigate to the HDInsight cluster you created.

    在默认视图中的“设置”下,选择“脚本操作”。From the default view, under Settings, select Script actions.

    在“脚本操作”页顶部,选择“+ 提交新项” 。From the top of the Script actions page, select + Submit new.

    对于“脚本类型”,请选择“自定义”。Select Custom for Script type. 然后,请提供名称并将在上一步保存的 URI https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/LaunchDASInstaller.sh 粘贴到标为“Bash 脚本 URI”的文本框中。Then provide the Name and paste the URI https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/LaunchDASInstaller.sh saved in last step into the textbox marked Bash Script URI. 最后,选择“创建”按钮将脚本应用到群集。Finally, select the Create button to apply the script to the cluster.

等待 5 到 10 分钟,然后使用以下 URL 启动 Data Analytics Studio: https://<clustername>.azurehdinsight.cn/das/Wait 5 to 10 minutes, then launch Data Analytics Studio by using this URL: https://<clustername>.azurehdinsight.cn/das/

安装 DAS 后,如果看不到在查询查看器中运行的查询,请执行以下步骤:Once DAS is installed, if you don't see the queries you’ve run in the queries viewer, do the following steps:

  1. 根据此 DAS 安装故障排除指南中所述,设置 Hive、Tez 和 DAS 的配置。Set the configurations for Hive, Tez, and DAS as described in this guide for troubleshooting DAS installation.
  2. 确保以下 Azure 存储目录配置为页 Blob,并且它们列在 fs.azure.page.blob.dirs 之下:Make sure that the following Azure storage directory configs are Page blobs, and that they're listed under fs.azure.page.blob.dirs:
    • hive.hook.proto.base-directory
    • tez.history.logging.proto-base-dir
  3. 在两个头节点上重启 HDFS、Hive、Tez 和 DAS。Restart HDFS, Hive, Tez, and DAS on both headnodes.

-->

后续步骤Next steps