使用 Azure 数据工厂将数据从本地 Hadoop 群集迁移到 Azure 存储Use Azure Data Factory to migrate data from an on-premises Hadoop cluster to Azure Storage

适用于: Azure 数据工厂 Azure Synapse Analytics(预览版)

Azure 数据工厂提供高性能、稳健且经济高效的机制用于将数据从本地 HDFS 大规模迁移到 Azure Blob 存储或 Azure Data Lake Storage Gen2。Azure Data Factory provides a performant, robust, and cost-effective mechanism for migrating data at scale from on-premises HDFS to Azure Blob storage or Azure Data Lake Storage Gen2.

数据工厂提供两种基本方法用于将数据从本地 HDFS 迁移到 Azure。Data Factory offers two basic approaches for migrating data from on-premises HDFS to Azure. 你可以根据自己的情况选择所需的方法。You can select the approach based on your scenario.

  • 数据工厂 DistCp 模式(建议):在数据工厂中,可以使用 DistCp(分布式复制)将文件按原样复制到 Azure Blob 存储(包括分阶段复制)或 Azure Data Lake Store Gen2。Data Factory DistCp mode (recommended): In Data Factory, you can use DistCp (distributed copy) to copy files as-is to Azure Blob storage (including staged copy) or Azure Data Lake Store Gen2. 使用与 DistCp 集成的数据工厂可以利用现有的强大群集来实现最佳复制吞吐量。Use Data Factory integrated with DistCp to take advantage of an existing powerful cluster to achieve the best copy throughput. 此外,还能受益于数据工厂提供的灵活计划功能和统一的监视体验。You also get the benefit of flexible scheduling and a unified monitoring experience from Data Factory. 根据数据工厂配置,复制活动会自动构造 DistCp 命令,将数据提交到 Hadoop 群集,然后监视复制状态。Depending on your Data Factory configuration, copy activity automatically constructs a DistCp command, submits the data to your Hadoop cluster, and then monitors the copy status. 建议使用数据工厂 DistCp 模式将数据从本地 Hadoop 群集迁移到 Azure。We recommend Data Factory DistCp mode for migrating data from an on-premises Hadoop cluster to Azure.
  • 数据工厂本机集成运行时模式:DistCp 并非在所有情况下都适用。Data Factory native integration runtime mode: DistCp isn't an option in all scenarios. 例如,在 Azure 虚拟网络环境中,DistCp 工具不支持使用 Azure 存储虚拟网络终结点的 Azure ExpressRoute 专用对等互连。For example, in an Azure Virtual Networks environment, the DistCp tool doesn't support Azure ExpressRoute private peering with an Azure Storage virtual network endpoint. 此外,在某些情况下,你不希望使用现有的 Hadoop 群集作为引擎来迁移数据,因此你不会在群集上施加繁重的负载,这可能影响现有 ETL 作业的性能。In addition, in some cases, you don't want to use your existing Hadoop cluster as an engine for migrating data so you don't put heavy loads on your cluster, which might affect the performance of existing ETL jobs. 可以改用数据工厂集成运行时的本机功能作为引擎,将数据从本地 HDFS 复制到 Azure。Instead, you can use the native capability of the Data Factory integration runtime as the engine that copies data from on-premises HDFS to Azure.

本文提供上述两种方法的以下信息:This article provides the following information about both approaches:

  • 性能Performance
  • 复制复原能力Copy resilience
  • 网络安全性Network security
  • 高级解决方案体系结构High-level solution architecture
  • 有关实现的最佳做法Implementation best practices

性能Performance

在数据工厂 DistCp 模式下,吞吐量与单独使用 DistCp 工具时相同。In Data Factory DistCp mode, throughput is the same as if you use the DistCp tool independently. 数据工厂 DistCp 模式可以最大限度地利用现有 Hadoop 群集的容量。Data Factory DistCp mode maximizes the capacity of your existing Hadoop cluster. 可以使用 DistCp 在群集之间或群集内部进行大规模复制。You can use DistCp for large inter-cluster or intra-cluster copying.

DistCp 使用 MapReduce 来影响数据分发、错误处理和恢复以及报告。DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting. 它将文件和目录列表扩展成任务映射的输入。It expands a list of files and directories into input for task mapping. 每个任务复制源列表中指定的文件分区。Each task copies a file partition that's specified in the source list. 使用与 DistCp 集成的数据工厂,可以生成管道来充分利用网络带宽、存储 IOPS 和带宽,以最大化环境的数据移动吞吐量。You can use Data Factory integrated with DistCp to build pipelines to fully utilize your network bandwidth, storage IOPS, and bandwidth to maximize data movement throughput for your environment.

数据工厂本机集成运行时模式还允许不同级别的并行度。Data Factory native integration runtime mode also allows parallelism at different levels. 可以使用并行度来充分利用网络带宽、存储 IOPS 和带宽,以最大化数据移动吞吐量。You can use parallelism to fully utilize your network bandwidth, storage IOPS, and bandwidth to maximize data movement throughput:

  • 单个复制活动可以利用可缩放的计算资源。A single copy activity can take advantage of scalable compute resources. 使用自承载集成运行时可以手动纵向扩展计算机或横向扩展到多个计算机(最多 4 个节点)。With a self-hosted integration runtime, you can manually scale up the machine or scale out to multiple machines (up to four nodes). 单个复制活动将在所有节点之间对其文件集进行分区。A single copy activity partitions its file set across all nodes.
  • 单个复制活动使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store by using multiple threads.
  • 数据工厂控制流可以同时启动多个复制活动。Data Factory control flow can start multiple copy activities in parallel. 例如,可以使用 For Each 循环For example, you can use a For Each loop.

有关详细信息,请参阅复制活动性能指南For more information, see the copy activity performance guide.

复原能力Resilience

在数据工厂 DistCp 模式下,可以使用不同的 DistCp 命令行参数(例如,-i 表示忽略失败;-update 表示当源文件和目标文件的大小不同时写入数据)来实现不同级别的复原能力。In Data Factory DistCp mode, you can use different DistCp command-line parameters (For example, -i, ignore failures or -update, write data when source file and destination file differ in size) for different levels of resilience.

在数据工厂本机集成运行时模式下,在单个复制活动运行中,数据工厂具有内置的重试机制。In the Data Factory native integration runtime mode, in a single copy activity run, Data Factory has a built-in retry mechanism. 它可以处理数据存储或底层网络中特定级别的暂时性故障。It can handle a certain level of transient failures in the data stores or in the underlying network.

执行从本地 HDFS 复制到 Blob 存储以及从本地 HDFS 到 Data Lake Store Gen2 的二元复制时,数据工厂会大范围地自动执行检查点设置。When doing binary copying from on-premises HDFS to Blob storage and from on-premises HDFS to Data Lake Store Gen2, Data Factory automatically performs checkpointing to a large extent. 如果某个复制活动运行失败或超时,在后续重试时(请确保重试计数 > 1),复制将从上一个失败点继续,而不是从头开始。If a copy activity run fails or times out, on a subsequent retry (make sure that retry count is > 1), the copy resumes from the last failure point instead of starting at the beginning.

网络安全性Network security

数据工厂默认通过 HTTPS 协议使用加密的连接将数据从本地 HDFS 传输到 Blob 存储或 Azure Data Lake Storage Gen2。By default, Data Factory transfers data from on-premises HDFS to Blob storage or Azure Data Lake Storage Gen2 by using an encrypted connection over HTTPS protocol. HTTPS 提供传输中数据加密,并可防止窃听和中间人攻击。HTTPS provides data encryption in transit and prevents eavesdropping and man-in-the-middle attacks.

如果你不希望通过公共 Internet 传输数据,可以通过 ExpressRoute 使用专用对等互连链路传输数据,以提高安全性。Alternatively, if you don't want data to be transferred over the public internet, for higher security, you can transfer data over a private peering link via ExpressRoute.

解决方案体系结构Solution architecture

此图描绘了如何通过公共 Internet 迁移数据:This image depicts migrating data over the public internet:

显示通过公用网络迁移数据的解决方案体系结构示意图

  • 在此体系结构中,将通过公共 Internet 使用 HTTPS 安全传输数据。In this architecture, data is transferred securely by using HTTPS over the public internet.
  • 我们建议在公用网络环境中使用数据工厂 DistCp 模式。We recommend using Data Factory DistCp mode in a public network environment. 可以利用现有的强大群集来实现最佳复制吞吐量。You can take advantage of a powerful existing cluster to achieve the best copy throughput. 此外,还能受益于数据工厂提供的灵活计划功能和统一的监视体验。You also get the benefit of flexible scheduling and unified monitoring experience from Data Factory.
  • 对于此体系结构,nted在企业防火墙后的 Windows 计算机上安装数据工厂自承载集成运行时,以便将 DistCp 命令提交到 Hadoop 群集并监视复制状态。For this architecture, you must install the Data Factory self-hosted integration runtime on a Windows machine behind a corporate firewall to submit the DistCp command to your Hadoop cluster and to monitor the copy status. 由于该计算机不是用于移动数据的引擎(仅用于控制),因此,该计算机的容量不会影响数据移动吞吐量。Because the machine isn't the engine that will move data (for control purpose only), the capacity of the machine doesn't affect the throughput of data movement.
  • 支持 DistCp 命令的现有参数。Existing parameters from the DistCp command are supported.

此图描绘了如何通过专用链路迁移数据:This image depicts migrating data over a private link:

显示通过专用网络迁移数据的解决方案体系结构示意图

  • 在此体系结构中,数据迁移是通过 Azure ExpressRoute 使用专用对等互连链路完成的。In this architecture, data is migrated over a private peering link via Azure ExpressRoute. 数据永远不会遍历公共 Internet。Data never traverses over the public internet.
  • DistCp 工具不支持使用 Azure 存储虚拟网络终结点的 ExpressRoute 专用对等互连。The DistCp tool doesn't support ExpressRoute private peering with an Azure Storage virtual network endpoint. 我们建议通过集成运行时使用数据工厂的本机功能来迁移数据。We recommend that you use Data Factory's native capability via the integration runtime to migrate the data.
  • 对于此体系结构,必须在 Azure 虚拟网络中的 Windows VM 上安装数据工厂自承载集成运行时。For this architecture, you must install the Data Factory self-hosted integration runtime on a Windows VM in your Azure virtual network. 可以手动纵向扩展 VM 或横向扩展到多个 VM,以充分利用网络和存储 IOPS 或带宽。You can manually scale up your VM or scale out to multiple VMs to fully utilize your network and storage IOPS or bandwidth.
  • 建议先对每个 Azure VM(装有数据工厂自承载集成运行时)使用以下配置:Standard_D32s_v3 大小,32 个 vCPU,128 GB 内存。The recommended configuration to start with for each Azure VM (with the Data Factory self-hosted integration runtime installed) is Standard_D32s_v3 with a 32 vCPU and 128 GB of memory. 可以在数据迁移过程中监视 VM 的 CPU 和内存使用率,以确定是否需要进一步扩展 VM 来提高性能,或缩减 VM 来降低成本。You can monitor the CPU and memory usage of the VM during data migration to see whether you need to scale up the VM for better performance or to scale down the VM to reduce cost.
  • 还可以通过将最多 4 个 VM 节点关联到一个自承载集成运行时进行横向扩展。You can also scale out by associating up to four VM nodes with a single self-hosted integration runtime. 针对自承载集成运行时运行的单个复制作业将自动为文件集分区,并利用所有 VM 节点来并行复制文件。A single copy job running against a self-hosted integration runtime automatically partitions the file set and makes use of all VM nodes to copy the files in parallel. 为实现高可用性,我们建议从两个 VM 节点着手,以避免在数据迁移过程中出现单一故障点。For high availability, we recommend that you start with two VM nodes to avoid a single-point-of-failure scenario during data migration.
  • 使用此体系结构时,可以实现初始快照数据迁移和增量数据迁移。When you use this architecture, initial snapshot data migration and delta data migration are available to you.

有关实现的最佳做法Implementation best practices

我们建议在实现数据迁移时遵循这些最佳做法。We recommend that you follow these best practices when you implement your data migration.

身份验证和凭据管理Authentication and credential management

初始快照数据迁移Initial snapshot data migration

在数据工厂 DistCp 模式下,可以创建一个复制活动来提交 DistCp 命令,并使用不同的参数来控制初始数据迁移行为。In Data Factory DistCp mode, you can create one copy activity to submit the DistCp command and use different parameters to control initial data migration behavior.

在数据工厂本机集成运行时模式下,我们建议使用数据分区,尤其是在迁移 10 TB 以上的数据时。In Data Factory native integration runtime mode, we recommend data partition, especially when you migrate more than 10 TB of data. 若要将数据分区,请使用 HDFS 中的文件夹名称。To partition the data, use the folder names on HDFS. 然后,每个数据工厂复制作业一次可以复制一个文件夹分区。Then, each Data Factory copy job can copy one folder partition at a time. 可以同时运行多个数据工厂复制作业,以获得更好的吞吐量。You can run multiple Data Factory copy jobs concurrently for better throughput.

如果网络或数据存储的暂时性问题导致任何复制作业失败,你可以重新运行失败的复制作业,以从 HDFS 中加载特定的分区。If any of the copy jobs fail due to network or data store transient issues, you can rerun the failed copy job to reload that specific partition from HDFS. 加载其他分区的其他复制作业不受影响。Other copy jobs that are loading other partitions aren't affected.

增量数据迁移Delta data migration

在数据工厂 DistCp 模式下,可以使用 DistCp 命令行参数 -update(表示当源文件和目标文件的大小不同时写入数据)来实现增量数据迁移。In Data Factory DistCp mode, you can use the DistCp command-line parameter -update, write data when source file and destination file differ in size, for delta data migration.

在数据工厂本机集成运行时模式下,识别 HDFS 中的新文件或已更改文件的最高效方法是使用时间分区命名约定。In Data Factory native integration mode, the most performant way to identify new or changed files from HDFS is by using a time-partitioned naming convention. 如果 HDFS 中的数据经过时间分区,并且文件或文件夹名称中包含时间切片信息(例如 /yyyy/mm/dd/file.csv),则管道可以轻松识别要增量复制的文件和文件夹。When your data in HDFS has been time-partitioned with time slice information in the file or folder name (for example, /yyyy/mm/dd/file.csv), your pipeline can easily identify which files and folders to copy incrementally.

或者,如果 HDFS 中的数据未经过时间分区,则数据工厂可使用文件的 LastModifiedDate 值来识别新文件或已更改的文件。Alternatively, if your data in HDFS isn't time-partitioned, Data Factory can identify new or changed files by using their LastModifiedDate value. 数据工厂扫描 HDFS 中的所有文件,仅复制上次修改时间戳大于某一组值的新文件和已更新的文件。Data Factory scans all the files from HDFS and copies only new and updated files that have a last modified timestamp that's greater than a set value.

如果 HDFS 中有大量文件,则初始文件扫描可能需要很长时间,而不管有多少个文件与筛选条件相匹配。If you have a large number of files in HDFS, the initial file scanning might take a long time, regardless of how many files match the filter condition. 在这种情况下,我们建议先使用用于初始快照迁移的同一分区将数据分区。In this scenario, we recommend that you first partition the data by using the same partition you used for the initial snapshot migration. 然后,文件扫描可以并行进行。Then, file scanning can occur in parallel.

估算价格Estimate price

假设以下管道可将数据从 HDFS 迁移到 Azure Blob 存储:Consider the following pipeline for migrating data from HDFS to Azure Blob storage:

显示定价管道的示意图

假设提供了以下信息:Let's assume the following information:

  • 总数据量为 1 PB。Total data volume is 1 PB.
  • 使用数据工厂本机集成运行时模式迁移数据。You migrate data by using the Data Factory native integration runtime mode.
  • 1 PB 划分为 1,000 个分区,每个复制操作移动一个分区。1 PB is divided into 1,000 partitions and each copy moves one partition.
  • 为每个复制活动配置了一个关联到 4 个计算机的自承载 集成运行时,可实现 500 MBps 的吞吐量。Each copy activity is configured with one self-hosted integration runtime that's associated with four machines and which achieves 500-MBps throughput.
  • ForEach 并发性设置为 4,聚合吞吐量为 2 GBps。ForEach concurrency is set to 4 and aggregate throughput is 2 GBps.
  • 完成迁移总共需要花费 146 小时。In total, it takes 146 hours to complete the migration.

下面是根据上述假设估算出的价格:Here's the estimated price based on our assumptions:

显示定价计算结果的表格

备注

这是一个虚构的定价示例。This is a hypothetical pricing example. 实际价格取决于环境中的实际吞吐量。Your actual pricing depends on the actual throughput in your environment. 不包括 Azure Windows VM(装有自承载集成运行时)的价格。The price for an Azure Windows VM (with self-hosted integration runtime installed) isn't included.

其他参考Additional references

后续步骤Next steps