将 Azure HDInsight 3.6 Hive 工作负荷迁移到 HDInsight 4.0Migrate Azure HDInsight 3.6 Hive workloads to HDInsight 4.0

与 HDInsight 3.6 相比,HDInsight 4.0 具有几项更为显著的优势。HDInsight 4.0 has several advantages over HDInsight 3.6. 此处提供了 HDInsight 4.0 中的新增功能概述Here is an overview of what's new in HDInsight 4.0.

本文介绍了将 Hive 工作负载从 HDInsight 3.6 迁移到 4.0 的步骤,包括This article covers steps to migrate Hive workloads from HDInsight 3.6 to 4.0, including

  • Hive 元存储复制和架构升级Hive metastore copy and schema upgrade
  • 安全迁移以与 ACID 兼容Safe migration for ACID compatibility
  • 保留 Hive 安全策略Preservation of Hive security policies

新旧 HDInsight 群集必须有权访问相同的存储帐户。The new and old HDInsight clusters must have access to the same Storage Accounts.

将 Hive 表迁移到新的存储帐户需要作为一个单独的步骤进行。Migration of Hive tables to a new Storage Account needs to be done as a separate step. 请参阅跨存储帐户迁移 HiveSee Hive Migration across Storage Accounts.

升级步骤Steps to upgrade

1.准备数据1. Prepare the data

  • HDInsight 3.6 默认情况下不支持 ACID 表。HDInsight 3.6 by default does not support ACID tables. 但是,如果存在 ACID 表,则对它们运行“MAJOR”压缩。If ACID tables are present, however, run 'MAJOR' compaction on them. 有关压缩的详细信息,请参阅 Hive 语言手册See the Hive Language Manual for details on compaction.

2.复制 SQL 数据库2. Copy the SQL database

  • 如果群集使用默认的 Hive 元存储,请根据本指南将元数据导出到外部元存储。If the cluster uses a default Hive metastore, follow this guide to export metadata to an external metastore. 然后,创建外部元存储的副本以进行升级。Then, create a copy of the external metastore for upgrade.

  • 如果群集使用外部 Hive 元存储,则创建一个它的副本。If the cluster uses an external Hive metastore, create a copy of it. 选项包括导出/导入时间点还原Options include export/import and point-in-time restore.

3.升级元存储架构3. Upgrade the metastore schema

此步骤使用 HDInsight 4.0 中的 Hive Schema Tool 来升级元存储架构。This step uses the Hive Schema Tool from HDInsight 4.0 to upgrade the metastore schema.

警告

此步骤是不可逆的。This step is not reversible. 请仅在元存储的副本上运行此步骤。Run this only on a copy of the metastore.

  1. 创建临时 HDInsight 4.0 群集以访问 4.0 Hive schematoolCreate a temporary HDInsight 4.0 cluster to access the 4.0 Hive schematool. 可以在此步骤中使用默认的 Hive 元存储You can use the default Hive metastore for this step.

  2. 从 HDInsight 4.0 群集执行 schematool 以升级目标 HDInsight 3.6 元存储:From the HDInsight 4.0 cluster, execute schematool to upgrade the target HDInsight 3.6 metastore:

    SERVER='servername.database.windows.net'  # replace with your SQL Server
    DATABASE='database'  # replace with your 3.6 metastore SQL Database
    USERNAME='username'  # replace with your 3.6 metastore username
    PASSWORD='password'  # replace with your 3.6 metastore password
    STACK_VERSION=$(hdp-select status hive-server2 | awk '{ print $3; }')
    /usr/hdp/$STACK_VERSION/hive/bin/schematool -upgradeSchema -url "jdbc:sqlserver://$SERVER;databaseName=$DATABASE;trustServerCertificate=false;encrypt=true;hostNameInCertificate=*.database.windows.net;" -userName "$USERNAME" -passWord "$PASSWORD" -dbType "mssql" --verbose
    

    备注

    此实用程序使用客户端 beeline/usr/hdp/$STACK_VERSION/hive/scripts/metastore/upgrade/mssql/upgrade-*.mssql.sql 中执行 SQL 脚本。This utility uses client beeline to execute SQL scripts in /usr/hdp/$STACK_VERSION/hive/scripts/metastore/upgrade/mssql/upgrade-*.mssql.sql.

    这些脚本中的 SQL 语法不一定与其他客户端工具兼容。SQL Syntax in these scripts is not necessarily compatible to other client tools. 例如,SSMSAzure 门户上的查询编辑器都要求在每个命令后输入关键字 GOFor example, SSMS and Query Editor on Azure Portal require keyword GO after each command.

    如果有脚本因资源容量或事务超时而失败,则纵向扩展 SQL 数据库。If any script fails due to resource capacity or transaction timeouts, scale up the SQL Database.

  3. 使用查询 select schema_version from dbo.version 验证最终版本。Verify the final version with query select schema_version from dbo.version.

    输出应与 HDInsight 4.0 群集中以下 bash 命令的输出一致。The output should match that of the following bash command from the HDInsight 4.0 cluster.

    grep . /usr/hdp/$(hdp-select --version)/hive/scripts/metastore/upgrade/mssql/upgrade.order.mssql | tail -n1 | rev | cut -d'-' -f1 | rev
    
  4. 删除临时的 HDInsight 4.0 群集。Delete the temporary HDInsight 4.0 cluster.

4.部署新的 HDInsight 4.0 群集4. Deploy a new HDInsight 4.0 cluster

创建新的 HDInsight 4.0 群集,选择升级后的 Hive 元存储和相同的存储帐户。Create a new HDInsight 4.0 cluster, selecting the upgraded Hive metastore and the same Storage Accounts.

  • 新群集不需要具有相同的默认文件系统。The new cluster doesn't require having the same default filesystem.

  • 如果元存储包含驻留在多个存储帐户中的表,则需要将这些存储帐户添加到新群集来访问这些表。If the metastore contains tables residing in multiple Storage Accounts, you need to add those Storage Accounts to the new cluster to access those tables. 请参阅将其他存储帐户添加到 HDInsightSee add additional Storage Accounts to HDInsight.

  • 如果 Hive 作业因存储无法访问而失败,请验证表位置是否在添加到群集的存储帐户中。If Hive jobs fail due to storage inaccessibility, verify that the table location is in a Storage Account added to the cluster.

    使用下面的 Hive 命令来标识表位置:Use the following Hive command to identify table location:

    SHOW CREATE TABLE ([db_name.]table_name|view_name);
    

5.转换表以符合 ACID5. Convert Tables for ACID Compliance

HDInsight 4.0 上的托管表必须符合 ACID 标准。Managed tables must be ACID-compliant on HDInsight 4.0. 在 HDInsight 4.0 上运行 strictmanagedmigration,以将所有非 ACID 托管表转换为具有属性 'external.table.purge'='true' 的外部表。Run strictmanagedmigration on HDInsight 4.0 to convert all non-ACID managed tables to external tables with property 'external.table.purge'='true'. 从头节点执行:Execute from the headnode:

sudo su - hive
STACK_VERSION=$(hdp-select status hive-server2 | awk '{ print $3; }')
/usr/hdp/$STACK_VERSION/hive/bin/hive --config /etc/hive/conf --service strictmanagedmigration --hiveconf hive.strict.managed.tables=true -m automatic --modifyManagedTables

跨 HDInsight 版本保护 HiveSecure Hive across HDInsight versions

HDInsight 可以选择使用 HDInsight 企业安全性套餐 (ESP) 与 Azure Active Directory 相集成。HDInsight optionally integrates with Azure Active Directory using HDInsight Enterprise Security Package (ESP). ESP 使用 Kerberos 和 Apache Ranger 来管理群集中特定资源的权限。ESP uses Kerberos and Apache Ranger to manage the permissions of specific resources within the cluster. 可使用以下步骤,将针对 HDInsight 3.6 中的 Hive 部署的 Ranger 策略迁移到 HDInsight 4.0:Ranger policies deployed against Hive in HDInsight 3.6 can be migrated to HDInsight 4.0 with the following steps:

  1. 在 HDInsight 3.6 群集中导航到 Ranger 服务管理器面板。Navigate to the Ranger Service Manager panel in your HDInsight 3.6 cluster.
  2. 导航到名为 HIVE 的策略,并将该策略导出到某个 JSON 文件。Navigate to the policy named HIVE and export the policy to a json file.
  3. 确保导出的策略 JSON 中引用的所有用户都存在于新群集中。Make sure that all users referred to in the exported policy json exist in the new cluster. 如果该策略 JSON 中引用的某个用户不存在于新群集中,请将该用户添加到新群集,或者从策略中删除引用。If a user is referred to in the policy json but doesn't exist in the new cluster, either add the user to the new cluster or remove the reference from the policy.
  4. 在 HDInsight 4.0 群集中导航到“Ranger 服务管理器”面板。Navigate to the Ranger Service Manager panel in your HDInsight 4.0 cluster.
  5. 导航到名为 HIVE 的策略,并导入步骤 2 中导出的 Ranger 策略 JSON。Navigate to the policy named HIVE and import the ranger policy json from step 2.

HDInsight 4.0 中可能需要更改应用程序的 Hive 变动Hive changes in HDInsight 4.0 that may require application changes

请参阅 HDInsight 4.0 公告,了解更多变动。Refer to HDInsight 4.0 Announcement for additional changes.

  1. 此处下载脚本,将脚本中的“azurehdinsight.net”替换为“azurehdinsight.cn”,并使用名称 install-data-analytics-studio.sh 在本地保存。Download the script here, replace "azurehdinsight.net" with "azurehdinsight.cn" in the script and save locally with the name install-data-analytics-studio.sh.

    将脚本文件上传到 Azure 存储帐户,完成后,Azure 门户中该文件的 URL 将会为 https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/install-data-analytics-studio.shUpload the script file to Azure storage account, the URL for the file in Azure portal would be https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/install-data-analytics-studio.sh after done.

    备注

    这些说明假定你已在 Azure 门户中创建了存储帐户,并将存储访问级别设为了“容器”。The instructions assume you have already created a storage account in the Azure portal and set the storage access level to "Container". 若要详细了解如何创建存储帐户并上传文件,请参阅创建存储帐户使用 Azure 门户上传、下载和列出 blobFor details on how to create storage account and upload files, see Create a storage account and Upload, download, and list blobs with the Azure portal.

  2. 将以下脚本以本地方式另存为 LaunchDASInstaller.sh,然后将其上传到 Azure 存储帐户。Save following scripts locally as LaunchDASInstaller.sh and then upload to Azure storage account. 请注意,脚本中的 <storage_account_name><container_name> 应替换为在步骤 1 中创建的实际名称。Please note that <storage_account_name> and <container_name> in script should be replaced with the real name created in Step 1.

    #!/bin/sh
    set -e
    set -x
    
    echo "Launching DAS installation setup script."
    
    sudo wget https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/install-data-analytics-studio.sh -O /tmp/install-das.sh
    sudo chmod +x /tmp/install-das.sh
    
    # Fire-and-forget the das installer to circumvent Ambari operation deadlock
    # We also make sure the script cleans up after itself
    
    nohup sh -c '/tmp/install-das.sh && rm -f /tmp/install-das.sh' &
    
    echo "Das installation launch succeeded."
    exit 0
    

    请记下文件 URL(在 Azure 门户中为 https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/LaunchDASInstaller.sh),以便在下一步使用。Make a note of the file URL which would be https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/LaunchDASInstaller.sh in Azure portal to use in the next step.

  3. 使用“头节点”作为执行的节点类型,针对群集启动某个脚本操作。Launch a script action against your cluster, with "Head nodes" as the node type for execution.

    转到 Azure 门户,导航到已创建的 HDInsight 群集。Go to the Azure portal, navigate to the HDInsight cluster you created.

    在默认视图中的“设置”下,选择“脚本操作”。From the default view, under Settings, select Script actions.

    在“脚本操作”页顶部,选择“+ 提交新项” 。From the top of the Script actions page, select + Submit new.

    对于“脚本类型”,请选择“自定义”。Select Custom for Script type. 然后,请提供 名称 并将在上一步保存的 URI https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/LaunchDASInstaller.sh 粘贴到标为“Bash 脚本 URI”的文本框中。Then provide the Name and paste the URI https://<storage_account_name>.blob.core.chinacloudapi.cn/<container_name>/LaunchDASInstaller.sh saved in last step into the textbox marked Bash Script URI. 最后,选择“创建”按钮将脚本应用到群集。Finally, select the Create button to apply the script to the cluster.

等待 5 到 10 分钟,然后使用以下 URL 启动 Data Analytics Studio: https://<clustername>.azurehdinsight.cn/das/Wait 5 to 10 minutes, then launch Data Analytics Studio by using this URL: https://<clustername>.azurehdinsight.cn/das/

安装 DAS 后,如果看不到在查询查看器中运行的查询,请执行以下步骤:Once DAS is installed, if you don't see the queries you’ve run in the queries viewer, do the following steps:

  1. 根据此 DAS 安装故障排除指南中所述,设置 Hive、Tez 和 DAS 的配置。Set the configurations for Hive, Tez, and DAS as described in this guide for troubleshooting DAS installation.
  2. 确保以下 Azure 存储目录配置为页 Blob,并且它们列在 fs.azure.page.blob.dirs 之下:Make sure that the following Azure storage directory configs are Page blobs, and that they're listed under fs.azure.page.blob.dirs:
    • hive.hook.proto.base-directory
    • tez.history.logging.proto-base-dir
  3. 在两个头节点上重启 HDFS、Hive、Tez 和 DAS。Restart HDFS, Hive, Tez, and DAS on both headnodes.

-->

延伸阅读Further reading