复制活动性能和可伸缩性指南Copy activity performance and scalability guide

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

无论是要执行从 Data Lake 或企业数据仓库 (EDW) 到 Azure 的大规模数据迁移,还是要将数据从不同的源大规模引入到 Azure 以进行大数据分析,实现最佳性能和可伸缩性都至关重要。Whether you want to perform a large-scale data migration from data lake or enterprise data warehouse (EDW) to Azure, or you want to ingest data at scale from different sources into Azure for big data analytics, it is critical to achieve optimal performance and scalability. Azure 数据工厂提供高性能、可复原且经济高效的机制用于大规模引入数据,因此,想要生成高性能、可缩放数据引入管道的数据工程师非常适合使用数据工厂。Azure Data Factory provides a performant, resilient, and cost-effective mechanism to ingest data at scale, making it a great fit for data engineers looking to build highly performant and scalable data ingestion pipelines.

阅读本文后,能够回答以下问题:After reading this article, you will be able to answer the following questions:

  • 对于数据迁移和数据引入方案,使用 ADF 复制活动可以实现哪种程度的性能和可伸缩性?What level of performance and scalability can I achieve using ADF copy activity for data migration and data ingestion scenarios?

  • 应执行哪些步骤来优化 ADF 复制活动的性能?What steps should I take to tune the performance of ADF copy activity?

  • 可以利用哪些 ADF 性能优化设置来优化单个复制活动运行的性能?What ADF perf optimization knobs can I utilize to optimize performance for a single copy activity run?

  • 优化复制性能时,需要考虑 ADF 以外的其他哪些因素?What other factors outside ADF to consider when optimizing copy performance?

备注

如果你对常规复制活动不熟悉,在阅读本文前请参阅复制活动概述If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.

使用 ADF 可实现的复制性能和可伸缩性Copy performance and scalability achievable using ADF

ADF 提供一个可在不同级别实现并行度的无服务器体系结构,使开发人员能够生成管道,以充分利用网络带宽以及存储 IOPS 和带宽将环境的数据移动吞吐量最大化。ADF offers a serverless architecture that allows parallelism at different levels, which allows developers to build pipelines to fully utilize your network bandwidth as well as storage IOPS and bandwidth to maximize data movement throughput for your environment. 这意味着,可以通过度量源数据存储、目标数据存储提供的最小吞吐量,以及源与目标之间的网络带宽,来估算可实现的吞吐量。This means the throughput you can achieve can be estimated by measuring the minimum throughput offered by the source data store, the destination data store, and network bandwidth in between the source and destination. 下表根据环境的数据大小和带宽限制计算复制持续时间。The table below calculates the copy duration based on data size and the bandwidth limit for your environment.

数据大小/Data size /
bandwidthbandwidth
50 Mbps50 Mbps 100 Mbps100 Mbps 500 Mbps500 Mbps 1 Gbps1 Gbps 5 Gbps5 Gbps 10 Gbps10 Gbps 50 Gbps50 Gbps
1 GB1 GB 2.7 分钟2.7 min 1.4 分钟1.4 min 0.3 分钟0.3 min 0.1 分钟0.1 min 0.03 分钟0.03 min 0.01 分钟0.01 min 0.0 分钟0.0 min
10 GB10 GB 27.3 分钟27.3 min 13.7 分钟13.7 min 2.7 分钟2.7 min 1.3 分钟1.3 min 0.3 分钟0.3 min 0.1 分钟0.1 min 0.03 分钟0.03 min
100 GB100 GB 4.6 小时4.6 hrs 2.3 小时2.3 hrs 0.5 小时0.5 hrs 0.2 小时0.2 hrs 0.05 小时0.05 hrs 0.02 小时0.02 hrs 0.0 小时0.0 hrs
1 TB1 TB 46.6 小时46.6 hrs 23.3 小时23.3 hrs 4.7 小时4.7 hrs 2.3 小时2.3 hrs 0.5 小时0.5 hrs 0.2 小时0.2 hrs 0.05 小时0.05 hrs
10 TB10 TB 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 0.9 天0.9 days 0.2 天0.2 days 0.1 天0.1 days 0.02 天0.02 days
100 TB100 TB 194.2 天194.2 days 97.1 天97.1 days 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 1 天1 day 0.2 天0.2 days
1 PB1 PB 64.7 个月64.7 mo 32.4 个月32.4 mo 6.5 个月6.5 mo 3.2 个月3.2 mo 0.6 个月0.6 mo 0.3 个月0.3 mo 0.06 个月0.06 mo
10 PB10 PB 647.3 个月647.3 mo 323.6 个月323.6 mo 64.7 个月64.7 mo 31.6 个月31.6 mo 6.5 个月6.5 mo 3.2 个月3.2 mo 0.6 个月0.6 mo

ADF 副本可在不同的级别缩放:ADF copy is scalable at different levels:

ADF 副本的缩放方式

  • ADF 控制流可以同时启动多个复制活动(例如,使用 For Each 循环)。ADF control flow can start multiple copy activities in parallel, for example using For Each loop.
  • 单个复制活动可以利用可缩放的计算资源:使用 Azure Integration Runtime 时,能够以无服务器方式为每个复制活动指定最多 256 个 DIU;使用自承载集成运行时时,可以手动纵向扩展计算机或横向扩展为多个计算机(最多 4 个节点),单个复制活动会在所有节点之间将其文件集分区。A single copy activity can take advantage of scalable compute resources: when using Azure Integration Runtime, you can specify up to 256 DIUs for each copy activity in a serverless manner; when using self-hosted Integration Runtime, you can manually scale up the machine or scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file set across all nodes.
  • 单个复制活动并行使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store using multiple threads in parallel.

性能优化步骤Performance tuning steps

请执行以下步骤,通过复制活动优化 Azure 数据工厂服务的性能。Take these steps to tune the performance of your Azure Data Factory service with the copy activity.

  1. 选取测试数据集并建立基线。Pick up a test dataset and establish a baseline. 在开发阶段,通过对代表性数据示例使用复制活动来测试管道。During the development phase, test your pipeline by using the copy activity against a representative data sample. 选择的数据集应代表典型的数据模式(文件夹结构、文件模式、数据架构等),并且应足够大,以便能够评估复制性能,例如,复制活动需要 10 分钟或更长时间才能完成。The dataset you choose should represent your typical data patterns (folder structure, file pattern, data schema, and so on), and is big enough to evaluate copy performance, for example it takes 10 minutes or beyond for copy activity to complete. 按照复制活动监视收集执行详细信息和性能特征。Collect execution details and performance characteristics following copy activity monitoring.

  2. 如何最大化单个复制活动的性能How to maximize performance of a single copy activity:

    首先,我们建议使用单个复制活动来最大化性能。To start with, we recommend you to first maximize performance using a single copy activity.

    • 如果复制活动是在 Azure Integration Runtime 上执行的: 请使用“数据集成单元(DIU)”和“并行复制”设置的默认值开始。If the copy activity is being executed on an Azure Integration Runtime: start with default values for Data Integration Units (DIU) and parallel copy settings.

    • 如果复制活动是在自承载集成运行时上执行的: 我们建议使用专用的计算机来承载集成运行时,而不要使用承载数据存储的服务器。If the copy activity is being executed on a self-hosted Integration Runtime: we recommend that you use a dedicated machine separate from the server hosting the data store to host integration runtime. 一开始对并行复制设置使用默认值,并对自承载 IR 使用单个节点。Start with default values for parallel copy setting and using a single node for the self-hosted IR.

    执行性能测试运行,记下实现的性能,以及使用的实际值,例如 DIU 和并行复制数。Perform a performance test run, and take a note of the performance achieved as well as the actual values used like DIUs and parallel copies. 请参阅复制活动监视来了解如何收集运行结果以及所使用的性能设置,并了解如何排查复制活动的性能问题,以查明并解决瓶颈问题。Refer to copy activity monitoring on how to collect run results and performance settings used, and learn how to Troubleshoot copy activity performance to identify and resolve the bottleneck.

    按照故障排除和优化指南反复进行额外的性能测试运行。Iterate to conduct additional performance test runs following the troubleshooting and tuning guidance. 一旦单个复制活动运行无法获得更好的吞吐量,请考虑同时运行多个复制来最大化聚合吞吐量,请参阅步骤 3。Once single copy activity run cannot achieve better throughput, consider to maximize aggregate throughput by running multiple copies concurrently referring to step 3.

  3. 如何通过并行运行多项复制来最大化聚合吞吐量:How to maximize aggregate throughput by running multiple copies concurrently:

    最大化单个复制活动的性能后,如果尚未实现环境(网络、源数据存储和目标数据存储)的吞吐量上限,可以使用 ADF 控制流构造(例如 For Each 循环)并行运行多个复制活动。Now that you have maximized the performance of a single copy activity, if you have not yet achieved the throughput upper limits of your environment – network, source data store, and destination data store - you can run multiple copy activities in parallel using ADF control flow constructs such as For Each loop. 请参阅从多个容器复制文件将数据从 Amazon S3 迁移到 ADLS Gen2使用控制表进行批量复制解决方案模板,其中提供了一般性的示例。Refer to Copy files from multiple containers, Migrate data from Amazon S3 to ADLS Gen2, or Bulk copy with a control table solution templates as general example.

  4. 将配置扩展至整个数据集。Expand the configuration to your entire dataset. 对执行结果和性能满意时,可以扩展定义和管道以覆盖整个数据集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire dataset.

排查复制活动的性能问题Troubleshoot copy activity performance

按照性能优化步骤为你的方案规划并执行性能测试。Follow the Performance tuning steps to plan and conduct performance test for your scenario. 排查复制活动的性能问题了解如何排查 Azure 数据工厂中的每个复制活动运行性能问题。And learn how to troubleshoot each copy activity run's performance issue in Azure Data Factory from Troubleshoot copy activity performance.

复制性能优化功能Copy performance optimization features

Azure 数据工厂提供以下性能优化功能:Azure Data Factory provides the following performance optimization features:

数据集成单元Data Integration Units

数据集成单元是一种度量单位,代表单个单位在 Azure 数据工厂中的能力(包含 CPU、内存、网络资源分配)。A Data Integration Unit is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Azure Data Factory. 数据集成单元仅适用于 Azure Integration Runtime,而不适用于自承载集成运行时Data Integration Unit only applies to Azure integration runtime, but not self-hosted integration runtime. 了解详细信息Learn more.

自承载集成运行时可伸缩性Self-hosted integration runtime scalability

为了承载不断增加的并发工作负荷或获得更高的性能,你可以纵向扩展或横向扩展自承载集成运行时。To host increasing concurrent workload or to achieve higher performance, you can either scale up or scale out the Self-hosted Integration Runtime. 了解详细信息Learn more.

并行复制Parallel copy

你可以设置并行复制来指示希望复制活动使用的并行度。You can set parallel copy to indicate the parallelism that you want the copy activity to use. 可将此属性视为复制活动内,可从源并行读取或并行写入接收器数据存储的最大线程数。You can think of this property as the maximum number of threads within the copy activity that read from your source or write to your sink data stores in parallel. 了解详细信息Learn more.

暂存复制Staged copy

将数据从源数据存储复制到接收器数据存储时,可能会选择使用 Blob 存储作为过渡暂存存储。When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. 了解详细信息Learn more.

后续步骤Next steps

请参阅其他复制活动文章:See the other copy activity articles: