使用 Azure 数据工厂将数据从 Amazon S3 迁移到 Azure 存储Use Azure Data Factory to migrate data from Amazon S3 to Azure Storage

适用于:是 Azure 数据工厂否 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory noAzure Synapse Analytics (Preview)

Azure 数据工厂提供高性能、稳健且经济高效的机制用于将数据从 Amazon S3 大规模迁移到 Azure Blob 存储或 Azure Data Lake Storage Gen2。Azure Data Factory provides a performant, robust, and cost-effective mechanism to migrate data at scale from Amazon S3 to Azure Blob Storage or Azure Data Lake Storage Gen2. 本文提供面向数据工程师和开发人员的以下信息:This article provides the following information for data engineers and developers:

  • 性能Performance
  • 复制复原能力Copy resilience
  • 网络安全Network security
  • 高级解决方案体系结构High-level solution architecture
  • 有关实现的最佳做法Implementation best practices

性能Performance

ADF 提供一个可在不同级别实现并行度的无服务器体系结构,使开发人员能够生成管道,以充分利用网络带宽以及存储 IOPS 和带宽将环境的数据移动吞吐量最大化。ADF offers a serverless architecture that allows parallelism at different levels, which allows developers to build pipelines to fully utilize your network bandwidth as well as storage IOPS and bandwidth to maximize data movement throughput for your environment.

客户已成功地将包含数亿个文件的 PB 量级的数据从 Amazon S3 迁移到 Azure Blob 存储,同时可以保持 2 GBps 或更高的吞吐量。Customers have successfully migrated petabytes of data consisting of hundreds of millions of files from Amazon S3 to Azure Blob Storage, with a sustained throughput of 2 GBps and higher.

性能

上图演示了如何通过不同的并行度实现极佳的数据移动速度:The picture above illustrates how you can achieve great data movement speeds through different levels of parallelism:

  • 单个复制活动可以利用可缩放的计算资源:使用 Azure Integration Runtime 时,能够以无服务器方式为每个复制活动指定最多 256 个 DIU;使用自承载集成运行时时,可以手动纵向扩展计算机或横向扩展为多个计算机(最多 4 个节点),单个复制活动会在所有节点之间将其文件集分区。A single copy activity can take advantage of scalable compute resources: when using Azure Integration Runtime, you can specify up to 256 DIUs for each copy activity in a serverless manner; when using self-hosted Integration Runtime, you can manually scale up the machine or scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file set across all nodes.
  • 单个复制活动使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store using multiple threads.
  • ADF 控制流可以同时启动多个复制活动(例如,使用 For Each 循环)。ADF control flow can start multiple copy activities in parallel, for example using For Each loop.

复原能力Resilience

在单个复制活动运行中,ADF 具有内置的重试机制,因此,它可以处理数据存储或底层网络中特定级别的暂时性故障。Within a single copy activity run, ADF has built-in retry mechanism so it can handle a certain level of transient failures in the data stores or in the underlying network.

执行从 S3 到 Blob 以及从 S3 到 ADLS Gen2 的二元复制时,ADF 会自动执行检查点设置。When doing binary copying from S3 to Blob and from S3 to ADLS Gen2, ADF automatically performs checkpointing. 如果某个复制活动运行失败或超时,在后续重试时,复制将从上一个失败点继续,而不是从头开始。If a copy activity run has failed or timed out, on a subsequent retry, the copy resumes from the last failure point instead of starting from the beginning.

网络安全Network security

ADF 默认通过 HTTPS 协议使用加密的连接将数据从 Amazon S3 传输到 Azure Blob 存储或 Azure Data Lake Storage Gen2。By default, ADF transfers data from Amazon S3 to Azure Blob Storage or Azure Data Lake Storage Gen2 using encrypted connection over HTTPS protocol. HTTPS 提供传输中数据加密,并可防止窃听和中间人攻击。HTTPS provides data encryption in transit and prevents eavesdropping and man-in-the-middle attacks.

如果你不希望通过公共 Internet 传输数据,可以通过 AWS Direct Connect 与 Azure Express Route 之间的专用对等互连链路传输数据,以此实现更高的安全性。Alternatively, if you do not want data to be transferred over public Internet, you can achieve higher security by transferring data over a private peering link between AWS Direct Connect and Azure Express Route. 请参阅下面的解决方案体系结构来了解如何实现此目的。Refer to the solution architecture below on how this can be achieved.

解决方案体系结构Solution architecture

通过公共 Internet 迁移数据:Migrate data over public Internet:

solution-architecture-public-network

  • 在此体系结构中,将通过公共 Internet 使用 HTTPS 安全传输数据。In this architecture, data is transferred securely using HTTPS over public Internet.
  • 源 Amazon S3 和目标 Azure Blob 存储或 Azure Data Lake Storage Gen2 配置为允许来自所有网络 IP 地址的流量。Both the source Amazon S3 as well as the destination Azure Blob Storage or Azure Data Lake Storage Gen2 are configured to allow traffic from all network IP addresses. 请参考下面的第二种体系结构来了解如何限制对特定 IP 范围的网络访问。Refer to the second architecture below on how you can restrict network access to specific IP range.
  • 可以轻松地以无服务器方式增大计算力,以充分利用网络和存储带宽,获得环境的最佳吞吐量。You can easily scale up the amount of horsepower in serverless manner to fully utilize your network and storage bandwidth so that you can get the best throughput for your environment.
  • 可以使用此体系结构实现初始快照迁移和增量数据迁移。Both initial snapshot migration and delta data migration can be achieved using this architecture.

通过专用链路迁移数据:Migrate data over private link:

solution-architecture-private-network

  • 在此体系结构中,数据迁移是通过 AWS Direct Connect 与 Azure Express Route 之间的专用对等互连链路完成的,因此,数据永远不会遍历公共 Internet。In this architecture, data migration is done over a private peering link between AWS Direct Connect and Azure Express Route such that data never traverses over public Internet. 它需要使用 AWS VPC 和 Azure 虚拟网络。It requires use of AWS VPC and Azure Virtual network.
  • 需要在 Azure 虚拟网络中的 Windows VM 上安装 ADF 自承载集成运行时才能实现此体系结构。You need to install ADF self-hosted integration runtime on a Windows VM within your Azure virtual network to achieve this architecture. 可以手动纵向扩展自承载 IR VM 或横向扩展到多个 VM(最多 4 个节点),以充分利用网络和存储 IOPS/带宽。You can manually scale up your self-hosted IR VMs or scale out to multiple VMs (up to 4 nodes) to fully utilize your network and storage IOPS/bandwidth.
  • 如果可以接受通过 HTTPS 传输数据,但你想要将源 S3 的网络访问权限锁定为特定的 IP 范围,你可以采用此体系结构的一种变体:删除 AWS VPC 并使用 HTTPS 替换专用链路。If it is acceptable to transfer data over HTTPS but you want to lock down network access to source S3 to a specific IP range, you can adopt a variation of this architecture by removing AWS VPC and replacing private link with HTTPS. 将在 Azure VM 上保留 Azure Virtual 和自承载 IR,因此,可以使用可公开路由的静态 IP 进行允许列表操作。You will want to keep Azure Virtual and self-hosted IR on Azure VM so you can have a static publicly routable IP for whitelisting purpose.
  • 可以使用此体系结构实现初始快照数据迁移和增量数据迁移。Both initial snapshot data migration and delta data migration can be achieved using this architecture.

有关实现的最佳做法Implementation best practices

身份验证和凭据管理Authentication and credential management

初始快照数据迁移Initial snapshot data migration

建议使用数据分区,尤其是在迁移 100 TB 以上的数据时。Data partition is recommended especially when migrating more than 100 TB of data. 若要将数据分区,请利用“前缀”设置按名称筛选 Amazon S3 中的文件夹和文件,然后,每个 ADF 复制作业一次可以复制一个分区。To partition the data, leverage the ‘prefix’ setting to filter the folders and files in Amazon S3 by name, and then each ADF copy job can copy one partition at a time. 可以并行运行多个 ADF 复制作业,以获得更好的吞吐量。You can run multiple ADF copy jobs concurrently for better throughput.

如果网络或数据存储的暂时性问题导致任何复制作业失败,你可以重新运行失败的复制作业,以再次从 AWS S3 中加载特定的分区。If any of the copy jobs fail due to network or data store transient issue, you can rerun the failed copy job to reload that specific partition again from AWS S3. 加载其他分区的所有其他复制作业不受影响。All other copy jobs loading other partitions will not be impacted.

增量数据迁移Delta data migration

识别 AWS S3 中的新文件或已更改文件的最高效方法是使用时间分区命名约定 - 如果 AWS S3 中的数据经过时间分区,并且文件或文件夹名称中包含时间切片信息(例如 /yyyy/mm/dd/file.csv),则管道可以轻松识别要增量复制的文件/文件夹。The most performant way to identify new or changed files from AWS S3 is by using time-partitioned naming convention – when your data in AWS S3 has been time partitioned with time slice information in the file or folder name (for example, /yyyy/mm/dd/file.csv), then your pipeline can easily identify which files/folders to copy incrementally.

或者,如果 AWS S3 中的数据未经过时间分区,则 ADF 可按文件的 LastModifiedDate 来识别新文件或更改的文件。Alternatively, If your data in AWS S3 is not time partitioned, ADF can identify new or changed files by their LastModifiedDate. ADF 的工作原理是扫描 AWS S3 中的所有文件,仅复制上次修改时间戳大于特定值的新文件和更新文件。The way it works is that ADF will scan all the files from AWS S3, and only copy the new and updated file whose last modified timestamp is greater than a certain value. 请注意,如果 S3 中有大量文件,则初始文件扫描可能需要很长时间,而不管有多少个文件与筛选条件相匹配。Be aware that if you have a large number of files in S3, the initial file scanning could take a long time regardless of how many files match the filter condition. 在这种情况下,我们建议先将数据分区,并使用同一“前缀”设置进行初始快照迁移,以便可以并行执行文件扫描。In this case you are suggested to partition the data first, using the same ‘prefix’ setting for initial snapshot migration, so that the file scanning can happen in parallel.

对于需要在 Azure VM 上安装自承载集成运行时的方案For scenarios that require self-hosted Integration runtime on Azure VM

无论是通过专用链路迁移数据,还是想要在 Amazon S3 防火墙中允许特定的 IP 范围,都需要在 Azure Windows VM 上安装自承载集成运行时。Whether you are migrating data over private link or you want to allow specific IP range on Amazon S3 firewall, you need to install self-hosted Integration runtime on Azure Windows VM.

  • 建议先对每个 Azure VM 使用以下配置:Standard_D32s_v3 大小,32 个 vCPU,128 GB 内存。The recommend configuration to start with for each Azure VM is Standard_D32s_v3 with 32 vCPU and 128-GB memory. 可以在数据迁移过程中持续监视 IR VM 的 CPU 和内存利用率,以确定是否需要进一步扩展 VM 来提高性能,或缩减 VM 来节省成本。You can keep monitoring CPU and memory utilization of the IR VM during the data migration to see if you need to further scale up the VM for better performance or scale down the VM to save cost.
  • 还可以通过将最多 4 个 VM 节点关联到一个自承载 IR 进行横向扩展。You can also scale out by associating up to 4 VM nodes with a single self-hosted IR. 针对自承载 IR 运行的单个复制作业将自动为文件集分区,并利用所有 VM 节点来并行复制文件。A single copy job running against a self-hosted IR will automatically partition the file set and leverage all VM nodes to copy the files in parallel. 为实现高可用性,我们建议从 2 个 VM 节点着手,以避免在数据迁移过程中出现单一故障点。For high availability, you are recommended to start with 2 VM nodes to avoid single point of failure during the data migration.

速率限制Rate limiting

最佳做法是使用有代表性的示例数据集执行性能 POC,以便能够确定适当的分区大小。As a best practice, conduct a performance POC with a representative sample dataset, so that you can determine an appropriate partition size.

一开始使用单个分区,以及使用默认 DIU 设置的单个复制活动。Start with a single partition and a single copy activity with default DIU setting. 逐渐增大 DIU 设置,直到达到网络的带宽限制或数据存储的 IOPS/带宽限制,或者达到单个复制活动允许的最大 DIU 数目 (256)。Gradually increase the DIU setting until you reach the bandwidth limit of your network or IOPS/bandwidth limit of the data stores, or you have reached the max 256 DIU allowed on a single copy activity.

接下来,逐渐增大并发复制活动的数目,直到达到环境的限制。Next, gradually increase the number of concurrent copy activities until you reach limits of your environment.

遇到 ADF 复制活动报告的限制错误时,请在 ADF 中减小并发性或 DIU 设置,或考虑提高网络和数据存储的带宽/IOPS 限制。When you encounter throttling errors reported by ADF copy activity, either reduce the concurrency or DIU setting in ADF, or consider increasing the bandwidth/IOPS limits of the network and data stores.

估算价格Estimating Price

备注

这是一个虚构的定价示例。This is a hypothetical pricing example. 实际价格取决于环境中的实际吞吐量。Your actual pricing depends on the actual throughput in your environment.

假设构造了以下管道用于将数据从 S3 迁移到 Azure Blob 存储:Consider the following pipeline constructed for migrating data from S3 to Azure Blob Storage:

pricing-pipeline

假设条件如下:Let us assume the following:

  • 总数据量为 2 PBTotal data volume is 2 PB
  • 使用第一种解决方案体系结构通过 HTTPS 迁移数据Migrating data over HTTPS using first solution architecture
  • 2 PB 划分为 1000 个分区,每个复制操作移动一个分区。2 PB is divided into 1 K partitions and each copy moves one partition
  • 为每个复制作业配置了 DIU=256,可实现 1 GBps 吞吐量Each copy is configured with DIU=256 and achieves 1 GBps throughput
  • ForEach 并发性设置为 2,聚合吞吐量为 2 GBpsForEach concurrency is set to 2 and aggregate throughput is 2 GBps
  • 完成迁移总共需要花费 292 小时In total, it takes 292 hours to complete the migration

下面是根据上述假设的估算出的价格:Here is the estimated price based on the above assumptions:

pricing-table

其他参考Additional references

模板Template

下面是模板,开始时它将由数亿个文件组成的 PB 级数据从 Amazon S3 迁移到 Azure Data Lake Storage Gen2。Here is the template to start with to migrate petabytes of data consisting of hundreds of millions of files from Amazon S3 to Azure Data Lake Storage Gen2.

后续步骤Next steps