将本地 Apache Hadoop 群集迁移到 Azure HDInsight - 数据迁移最佳做法Migrate on-premises Apache Hadoop clusters to Azure HDInsight - data migration best practices

本文提供有关将数据迁移到 Azure HDInsight 的建议。This article gives recommendations for data migration to Azure HDInsight. 本文是帮助用户将本地 Apache Hadoop 系统迁移到 Azure HDInsight 的最佳做法系列教程中的其中一篇。It's part of a series that provides best practices to assist with migrating on-premises Apache Hadoop systems to Azure HDInsight.

将本地数据迁移到 AzureMigrate on-premises data to Azure

有两个主要选项可将数据从本地迁移到 Azure 环境:There are two main options to migrate data from on-premises to Azure environment:

  1. 使用 TLS 通过网络传输数据Transfer data over network with TLS
    1. 通过 Internet - 可以使用以下多个工具中的任意一个将数据通过 Internet 传输到 Azure 存储:Azure 存储资源管理器、AzCopy、Azure Powershell 和 Azure CLI。Over internet - You can transfer data to Azure storage over a regular internet connection using any one of several tools such as: Azure Storage Explorer, AzCopy, Azure Powershell, and Azure CLI. 有关详细信息,请参阅将数据移入和移出 Azure 存储See Moving data to and from Azure Storage for more information.
    2. Express Route - ExpressRoute 是一项 Azure 服务,允许在 Microsoft 数据中心与本地环境或共同租用设施中的基础结构之间创建专用连接。Express Route - ExpressRoute is an Azure service that lets you create private connections between Microsoft datacenters and infrastructure that’s on your premises or in a colocation facility. ExpressRoute 连接不通过公共 Internet,与通过 Internet 的典型连接相比,提供更高的安全性、可靠性、速度和更低的延迟。ExpressRoute connections do not go over the public Internet, and offer higher security, reliability, and speeds with lower latencies than typical connections over the Internet. 有关详细信息,请参阅创建和修改 ExpressRoute 线路For more information, see Create and modify an ExpressRoute circuit.
    3. Data Box 联机数据传输 - Data Box Edge 和 Data Box Gateway 是联机数据传输产品,它们用作网络存储网关来管理站点和 Azure 之间的数据。Data Box online data transfer - Data Box Edge and Data Box Gateway are online data transfer products that act as network storage gateways to manage data between your site and Azure. Data Box Edge 是一种本地网络设备,可将数据传入和传出 Azure,并使用支持人工智能 (AI) 的边缘计算来处理数据。Data Box Edge, an on-premises network device, transfers data to and from Azure and uses artificial intelligence (AI)-enabled edge compute to process data. Data Box Gateway 是具有存储网关功能的虚拟设备。Data Box Gateway is a virtual appliance with storage gateway capabilities. 有关详细信息,请参阅 Azure Data Box 文档 - 联机传输For more information, see Azure Data Box Documentation - Online Transfer.
  2. 脱机寄送数据Shipping data Offline
    1. Data Box 脱机数据传输 - Data Box、Data Box Disk 和 Data Box Heavy 设备可在网络不可用时将大量数据传输到 Azure。Data Box offline data transfer - Data Box, Data Box Disk, and Data Box Heavy devices help you transfer large amounts of data to Azure when the network isn’t an option. 这些脱机数据传输设备在组织和 Azure 数据中心之间往返运输。These offline data transfer devices are shipped between your organization and the Azure datacenter. 它们使用 AES 加密来帮助保护传输中的数据,还在上传后执行一个清理过程,从设备中删除你的数据。They use AES encryption to help protect your data in transit, and they undergo a thorough post-upload sanitization process to delete your data from the device. 有关详细信息,请参阅 Azure Data Box 文档 - 脱机传输For more information, see Azure Data Box Documentation - Offline Transfer.

下表根据数据量和网络带宽列出了大致的数据传输持续时间。The following table has approximate data transfer duration based on the data volume and network bandwidth. 如果数据迁移预计需要花费三周以上,请使用 Data Box。Use a Data box if the data migration is expected to take more than three weeks.

数据量Data Qty 网络带宽Network Bandwidth
ofof
45 Mbps (T3)45 Mbps (T3)
网络带宽Network Bandwidth
ofof
100 Mbps100 Mbps
网络带宽Network Bandwidth
ofof
1 Gbps1 Gbps
网络带宽Network Bandwidth
ofof
10 Gbps10 Gbps
1 TB1 TB 2 天2 days 1 天1 day 2 小时2 hours 14 分钟14 minutes
10 TB10 TB 22 天22 days 10 天10 days 1 天1 day 2 小时2 hours
35 TB35 TB 76 天76 days 34 天34 days 3 天3 days 8 小时8 hours
80 TB80 TB 173 天173 days 78 天78 days 8 天8 days 19 小时19 hours
100 TB100 TB 216 天216 days 97 天97 days 10 天10 days 1 天1 day
200 TB200 TB 1 年1 year 194 天194 days 19 天19 days 2 天2 days
500 TB500 TB 3 年3 years 1 年1 year 49 天49 days 5 天5 days
1 PB1 PB 6 年6 years 3 年3 years 97 天97 days 10 天10 days
2 PB2 PB 12 年12 years 5 年5 years 194 天194 days 19 天19 days

可以使用 Azure 的本机工具(例如 Apache Hadoop DistCp、Azure 数据工厂和 AzureCp)通过网络传输数据。Tools native to Azure, like Apache Hadoop DistCp, Azure Data Factory, and AzureCp, can be used to transfer data over the network. 也可以使用第三方工具 WANDisco 实现相同的目的。The third-party tool WANDisco can also be used for the same purpose. 使用 Apache Kafka Mirrormaker 和 Apache Sqoop 可以持续将数据从本地传输到 Azure 存储系统。Apache Kafka Mirrormaker and Apache Sqoop can be used for ongoing data transfer from on-premises to Azure storage systems.

使用 Apache Hadoop DistCp 时的性能注意事项Performance considerations when using Apache Hadoop DistCp

DistCp 是一个 Apache 项目,它使用 MapReduce 映射作业来传输数据、处理错误以及从这些错误中恢复。DistCp is an Apache project that uses a MapReduce Map job to transfer data, handle errors, and recover from those errors. 它将源文件列表分配到每个映射任务。It assigns a list of source files to each Map task. 然后,映射任务将其所有已分配的文件复制到目标。The Map task then copies all of its assigned files to the destination. 可通过多种方法来提高 DistCp 的性能。There are several techniques can improve the performance of DistCp.

增加映射器数目Increase the number of Mappers

DistCp 会尝试创建映射任务,使每个副本的字节数大致相同。DistCp tries to create map tasks so that each one copies roughly the same number of bytes. 默认情况下,DistCp 作业使用 20 个映射器。By default, DistCp jobs use 20 mappers. 对 Distcp 使用更多的映射器(在命令行中包含“m”参数)可在数据传输过程中提高并行度,减少数据传输的时长。Using more Mappers for Distcp (with the 'm' parameter at command line) increases parallelism during the data transfer process and decreases the length of the data transfer. 但是,增加映射器时需要注意两点:However, there are two things to consider while increasing the number of Mappers:

  1. DistCp 的最低粒度是一个文件。DistCp's lowest granularity is a single file. 指定超过源文件数目的映射器数目没有任何帮助,而且会浪费可用的群集资源。Specifying a number of Mappers more than the number of source files does not help and will waste the available cluster resources.
  2. 确定映射器数目时,请考虑群集上的可用 Yarn 内存。Consider the available Yarn memory on the cluster to determine the number of Mappers. 每个映射任务作为 Yarn 容器启动。Each Map task is launched as a Yarn container. 假设群集上没有其他繁重的工作负荷在运行,可通过以下公式确定映射器数目:m = (工作节点数 * 每个工作节点的 YARN 内存) / YARN 容器大小。Assuming that no other heavy workloads are running on the cluster, the number of Mappers can be determined by the following formula: m = (number of worker nodes * YARN memory for each worker node) / YARN container size. 但是,如果其他应用程序正在使用内存,请选择仅将一部分 YARN 内存用于 DistCp 作业。However, If other applications are using memory, then choose to only use a portion of YARN memory for DistCp jobs.

使用多个 DistCp 作业Use more than one DistCp job

当要移动的数据集大小超过 1 TB 时,请使用多个 DistCp 作业。When the size of the dataset to be moved is larger than 1 TB, use more than one DistCp job. 使用多个作业可以限制故障造成的影响。Using more than one job limits the impact of failures. 如果任一作业失败,你只需重启该特定作业,而无需重启所有作业。If any job fails, you only need to restart that specific job rather than all of the jobs.

考虑拆分文件Consider splitting files

如果大型文件较少,请考虑将它们拆分为 256-MB 的文件块,以通过更多的映射器获得更高的潜在并发性。If there are a small number of large files, then consider splitting them into 256-MB file chunks to get more potential concurrency with more Mappers.

使用“strategy”命令行参数Use the 'strategy' command-line parameter

考虑在命令行中使用 strategy = dynamic 参数。Consider using strategy = dynamic parameter in the command line. strategy 参数的默认值为 uniform size,在这种情况下,每个映射副本的字节数大致相同。The default value of the strategy parameter is uniform size, in which case each map copies roughly the same number of bytes. 如果将此参数更改为 dynamic,则列表文件将拆分为多个“块文件”。When this parameter is changed to dynamic, the listing file is split into several "chunk-files". 块文件的数目是映射数的倍数。The number of chunk-files is a multiple of the number of maps. 为每个映射任务分配一个块文件。Each map task is assigned one of the chunk-files. 处理块中的所有路径后,将删除当前块,并获取新块。After all the paths in a chunk are processed, the current chunk is deleted and a new chunk is acquired. 进程将会继续,直到没有其他可用的块。The process continues until no more chunks are available. 这种“动态”方法使快速映射任务能够使用比慢速映射任务更多的路径,从而加快 DistCp 作业的总体速度。This "dynamic" approach allows faster map-tasks to consume more paths than slower ones, thus speeding up the DistCp job overall.

增加线程数目Increase the number of threads

看看增大 -numListstatusThreads 参数是否能够提高性能。See if increasing the -numListstatusThreads parameter improves performance. 此参数控制用于生成文件列表的线程数,最大值为 40。This parameter controls the number of threads to use for building file listing and 40 is the maximum value.

使用输出提交器算法Use the output committer algorithm

看看传递参数 -Dmapreduce.fileoutputcommitter.algorithm.version=2 是否能够提高 DistCp 的性能。See if passing the parameter -Dmapreduce.fileoutputcommitter.algorithm.version=2 improves DistCp performance. 此输出提交器算法在将输出文件写入到目标方面做了优化。This output committer algorithm has optimizations around writing output files to the destination. 以下示例命令演示了不同参数的用法:The following command is an example that shows the usage of different parameters:

hadoop distcp -Dmapreduce.fileoutputcommitter.algorithm.version=2 -numListstatusThreads 30 -m 100 -strategy dynamic hdfs://nn1:8020/foo/bar wasb://<container_name>@<storage_account_name>.blob.core.chinacloudapi.cn/foo/

元数据迁移Metadata migration

Apache HiveApache Hive

可以使用脚本或数据库复制来迁移 Hive 元存储。The hive metastore can be migrated either by using the scripts or by using the DB Replication.

使用脚本迁移 Hive 元存储Hive metastore migration using scripts

  1. 从本地 Hive 元存储生成 Hive DDL。Generate the Hive DDLs from on premises Hive metastore. 可以使用包装器 bash 脚本完成此步骤。This step can be done using a wrapper bash script.
  2. 编辑生成的 DDL,将 HDFS URL 替换为 WASB/ADLS/ABFS URL。Edit the generated DDL to replace HDFS url with WASB/ADLS/ABFS URLs.
  3. 针对 HDInsight 群集中的元存储运行更新的 DDL。Run the updated DDL on the metastore from the HDInsight cluster.
  4. 确保本地与云之间的 Hive 元存储版本兼容。Make sure that the Hive metastore version is compatible between on-premises and cloud.

使用数据库复制迁移 Hive 元存储Hive metastore migration using DB replication

  • 在本地 Hive 元存储 DB 与 HDInsight 元存储 DB 之间设置数据库复制。Set up Database Replication between on-premises Hive metastore DB and HDInsight metastore DB.
  • 使用“Hive MetaTool”将 HDFS URL 替换为 WASB/ADLS/ABFS URL,例如:Use the "Hive MetaTool" to replace HDFS url with WASB/ADLS/ABFS urls, for example:
./hive --service metatool -updateLocation hdfs://nn1:8020/ wasb://<container_name>@<storage_account_name>.blob.core.chinacloudapi.cn/

Apache RangerApache Ranger

  • 将本地 Ranger 策略导出到 XML 文件。Export on-premises Ranger policies to xml files.
  • 使用 XSLT 等工具将基于 HDFS 的本地特定路径转换为 WASB/ADLS。Transform on premises specific HDFS-based paths to WASB/ADLS using a tool like XSLT.
  • 将策略导入到 HDInsight 上运行的 Ranger。Import the policies on to Ranger running on HDInsight.

后续步骤Next steps

阅读本系列教程的下一篇文章:Read the next article in this series: