复制活动性能和可伸缩性指南Copy activity performance and scalability guide

适用于: Azure 数据工厂 Azure Synapse Analytics

有时,你需要执行从数据湖或企业数据仓库 (EDW) 到 Azure 的大规模数据迁移。Sometimes you want to perform a large-scale data migration from data lake or enterprise data warehouse (EDW), to Azure. 其他时候,你需要将来自不同源的大量数据引入 Azure 来进行大数据分析。Other times you want to ingest large amounts of data, from different sources into Azure, for big data analytics. 在每种情况下,实现最佳性能和可伸缩性都至关重要。In each case, it is critical to achieve optimal performance and scalability.

Azure 数据工厂 (ADF) 提供了一种机制用来引入数据。Azure Data Factory (ADF) provides a mechanism to ingest data. ADF 具有以下优势:ADF has the following advantages:

  • 处理大量数据Handles large amounts of data
  • 性能高Is highly performant
  • 经济高效Is cost-effective

这些优势使得 ADF 非常适合于想要构建高性能的可伸缩数据引入管道的数据工程师。These advantages make ADF an excellent fit for data engineers who want to build scalable data ingestion pipelines that are highly performant.

阅读本文后,能够回答以下问题:After reading this article, you will be able to answer the following questions:

  • 对于数据迁移和数据引入方案,使用 ADF 复制活动可以实现哪种程度的性能和可伸缩性?What level of performance and scalability can I achieve using ADF copy activity for data migration and data ingestion scenarios?
  • 应执行哪些步骤来优化 ADF 复制活动的性能?What steps should I take to tune the performance of ADF copy activity?
  • 可以利用哪些 ADF 性能优化设置来优化单个复制活动运行的性能?What ADF perf optimization knobs can I utilize to optimize performance for a single copy activity run?
  • 优化复制性能时,需要考虑 ADF 以外的其他哪些因素?What other factors outside ADF to consider when optimizing copy performance?

备注

如果你对常规复制活动不熟悉,在阅读本文前请参阅复制活动概述If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.

使用 ADF 可实现的复制性能和可伸缩性Copy performance and scalability achievable using ADF

ADF 提供一个可在不同级别实现并行度的无服务器体系结构。ADF offers a serverless architecture that allows parallelism at different levels.

通过此体系结构,可开发能最大程度地提高环境数据移动吞吐量的管道。This architecture allows you to develop pipelines that maximize data movement throughput for your environment. 这些管道充分利用以下资源:These pipelines fully utilize the following resources:

  • 源数据存储与目标数据存储之间的网络带宽Network bandwidth between the source and destination data stores
  • 源数据存储或目标数据存储的每秒输入/输出操作数 (IOPS) 和带宽Source or destination data store input/output operations per second (IOPS) and bandwidth

这种充分利用意味着你可通过测量以下资源可用的最小吞吐量来估计总体吞吐量:This full utilization means you can estimate the overall throughput by measuring the minimum throughput available with the following resources:

  • 源数据存储Source data store
  • 目标数据存储Destination data store
  • 源数据存储与目标数据存储之间的网络带宽Network bandwidth in between the source and destination data stores

下表计算了复制持续时间。The table below calculates the copy duration. 持续时间取决于环境的数据大小和网络/数据存储带宽限制。The duration is based on data size and the network/data store bandwidth limit for your environment.

 

数据大小/Data size /
bandwidthbandwidth
50 Mbps50 Mbps 100 Mbps100 Mbps 500 Mbps500 Mbps 1 Gbps1 Gbps 5 Gbps5 Gbps 10 Gbps10 Gbps 50 Gbps50 Gbps
1 GB1 GB 2.7 分钟2.7 min 1.4 分钟1.4 min 0.3 分钟0.3 min 0.1 分钟0.1 min 0.03 分钟0.03 min 0.01 分钟0.01 min 0.0 分钟0.0 min
10 GB10 GB 27.3 分钟27.3 min 13.7 分钟13.7 min 2.7 分钟2.7 min 1.3 分钟1.3 min 0.3 分钟0.3 min 0.1 分钟0.1 min 0.03 分钟0.03 min
100 GB100 GB 4.6 小时4.6 hrs 2.3 小时2.3 hrs 0.5 小时0.5 hrs 0.2 小时0.2 hrs 0.05 小时0.05 hrs 0.02 小时0.02 hrs 0.0 小时0.0 hrs
1 TB1 TB 46.6 小时46.6 hrs 23.3 小时23.3 hrs 4.7 小时4.7 hrs 2.3 小时2.3 hrs 0.5 小时0.5 hrs 0.2 小时0.2 hrs 0.05 小时0.05 hrs
10 TB10 TB 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 0.9 天0.9 days 0.2 天0.2 days 0.1 天0.1 days 0.02 天0.02 days
100 TB100 TB 194.2 天194.2 days 97.1 天97.1 days 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 1 天1 day 0.2 天0.2 days
1 PB1 PB 64.7 个月64.7 mo 32.4 个月32.4 mo 6.5 个月6.5 mo 3.2 个月3.2 mo 0.6 个月0.6 mo 0.3 个月0.3 mo 0.06 个月0.06 mo
10 PB10 PB 647.3 个月647.3 mo 323.6 个月323.6 mo 64.7 个月64.7 mo 31.6 个月31.6 mo 6.5 个月6.5 mo 3.2 个月3.2 mo 0.6 个月0.6 mo

ADF 副本可在不同的级别缩放:ADF copy is scalable at different levels:

ADF 副本的缩放方式

  • ADF 控制流可以并行启动多个复制活动(例如,使用 For Each 循环)。ADF control flow can start multiple copy activities in parallel, for example using For Each loop.

  • 单个复制活动可以利用可缩放的计算资源。A single copy activity can take advantage of scalable compute resources.

    • 使用 Azure 集成运行时 (IR) 时,能够以无服务器的方式为每个复制活动指定最多 256 个数据集成单元 (DIU)When using Azure integration runtime (IR), you can specify up to 256 data integration units (DIUs) for each copy activity, in a serverless manner.
    • 使用自承载 IR 时,可采用以下方法之一:When using self-hosted IR, you can take either of the following approaches:
      • 手动纵向扩展计算机。Manually scale up the machine.
      • 横向扩展到多台计算机(最多 4 个节点),单个复制活动将跨所有节点对其文件集进行分区。Scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file set across all nodes.
  • 单个复制活动并行使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store using multiple threads in parallel.

性能优化步骤Performance tuning steps

请执行以下步骤,通过复制活动优化 Azure 数据工厂服务的性能:Take the following steps to tune the performance of your Azure Data Factory service with the copy activity:

  1. 选取测试数据集并建立基线。Pick up a test dataset and establish a baseline.

    在开发过程中,通过对代表性的数据示例使用复制活动来测试管道。During development, test your pipeline by using the copy activity against a representative data sample. 你选择的数据集应表示典型的数据模式,并具有以下属性:The dataset you choose should represent your typical data patterns along the following attributes:

    • 文件夹结构Folder structure
    • 文件模式File pattern
    • 数据架构Data schema

    数据集应该足够大,以便评估复制性能。And your dataset should be big enough to evaluate copy performance. 合适的大小是复制活动至少需要 10 分钟才能完成。A good size takes at least 10 minutes for copy activity to complete. 按照复制活动监视收集执行详细信息和性能特征。Collect execution details and performance characteristics following copy activity monitoring.

  2. 如何最大化单个复制活动的性能How to maximize performance of a single copy activity:

    建议先使用单个复制活动来最大程度提高性能。We recommend you to first maximize performance using a single copy activity.

    • 如果复制活动在 Azure 集成运行时中执行If the copy activity is being executed on an Azure integration runtime:

      一开始对数据集成单位 (DIU)并行复制设置使用默认值。Start with default values for Data Integration Units (DIU) and parallel copy settings.

    • 如果复制活动在自承载集成运行时中执行If the copy activity is being executed on a self-hosted integration runtime:

      建议使用专用计算机托管 IR。We recommend that you use a dedicated machine to host IR. 计算机应与托管数据存储的服务器分开。The machine should be separate from the server hosting the data store. 一开始对并行复制设置使用默认值,并对自承载 IR 使用单个节点。Start with default values for parallel copy setting and using a single node for the self-hosted IR.

    执行性能测试运行。Conduct a performance test run. 记下实现的性能。Take a note of the performance achieved. 包括使用的实际值,例如 DIU 和并行副本。Include the actual values used, such as DIUs and parallel copies. 有关如何收集运行结果和所用性能设置,请参阅复制活动监视Refer to copy activity monitoring on how to collect run results and performance settings used. 了解如何排查复制活动的性能问题,以识别和解决瓶颈。Learn how to troubleshoot copy activity performance to identify and resolve the bottleneck.

    按照故障排除和优化指南反复进行额外的性能测试运行。Iterate to conduct additional performance test runs following the troubleshooting and tuning guidance. 一旦单个复制活动运行无法提高吞吐量,请考虑是否通过同时运行多个复制来尽量提高聚合吞吐量。Once single copy activity runs cannot achieve better throughput, consider whether to maximize aggregate throughput by running multiple copies concurrently. 下一编号项目中讨论了此选项。This option is discussed in the next numbered bullet.

  3. 如何通过并行运行多项复制来最大化聚合吞吐量:How to maximize aggregate throughput by running multiple copies concurrently:

    至此,你已将单个复制活动的性能最大化。By now you have maximized the performance of a single copy activity. 如果尚未达到环境的吞吐量上限,可并行运行多个复制活动。If you have not yet achieved the throughput upper limits of your environment, you can run multiple copy activities in parallel. 可使用 ADF 控制流构造并行运行。You can run in parallel by using ADF control flow constructs. 其中一种构造是 For Each 循环One such construct is the For Each loop. 有关详细信息,请参阅以下关于解决方案模板的文章:For more information, see the following articles about solution templates:

  4. 将配置扩展至整个数据集。Expand the configuration to your entire dataset.

    对执行结果和性能满意时,可以扩展定义和管道以覆盖整个数据集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire dataset.

排查复制活动的性能问题Troubleshoot copy activity performance

遵循性能优化步骤为方案规划并执行性能测试。Follow the Performance tuning steps to plan and conduct performance test for your scenario. 排查复制活动的性能问题了解如何排查 Azure 数据工厂中的每个复制活动运行性能问题。And learn how to troubleshoot each copy activity run's performance issue in Azure Data Factory from Troubleshoot copy activity performance.

复制性能优化功能Copy performance optimization features

Azure 数据工厂提供以下性能优化功能:Azure Data Factory provides the following performance optimization features:

数据集成单元Data Integration Units

数据集成单元 (DIU) 是一个度量值,表示单个单位在 Azure 数据工厂中的能力。A Data Integration Unit (DIU) is a measure that represents the power of a single unit in Azure Data Factory. 这种能力包含 CPU、内存和网络资源分配。Power is a combination of CPU, memory, and network resource allocation. DIU 仅适用于 Azure 集成运行时DIU only applies to Azure integration runtime. DIU 不适用于自承载集成运行时DIU does not apply to self-hosted integration runtime. 在此处了解更多信息Learn more here.

自承载集成运行时可伸缩性Self-hosted integration runtime scalability

你可能希望托管不断增长的并发工作负载。You might want to host an increasing concurrent workload. 或者,你可能希望在当前工作负载级别获得更高的性能。Or you might want to achieve higher performance in your present workload level. 可通过以下方法提高处理规模:You can enhance the scale of processing by the following approaches:

  • 可通过增加能在节点上运行的并发作业数来纵向扩展自承载 IR。You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.
    仅当节点的处理器和内存没有得到充分利用时,才能进行纵向扩展。Scale up works only if the processor and memory of the node are being less than fully utilized.
  • 可通过添加更多节点(计算机)来横向扩展自承载 IR。You can scale out the self-hosted IR, by adding more nodes (machines).

有关详细信息,请参阅:For more information, see:

并行复制Parallel copy

可设置 parallelCopies 属性来指示希望复制活动使用的并行度。You can set the parallelCopies property to indicate the parallelism you want the copy activity to use. 将此属性视为复制活动中的最大线程数。Think of this property as the maximum number of threads within the copy activity. 线程并行操作。The threads operate in parallel. 线程从源读取,或写入接收器数据存储。The threads either read from your source, or write to your sink data stores. 了解详细信息Learn more.

暂存复制Staged copy

数据复制操作可将数据直接发送到接收器数据存储。A data copy operation can send the data directly to the sink data store. 或者,可选择将 Blob 存储用作临时暂存存储。Alternatively, you can choose to use Blob storage as an interim staging store. 了解详细信息Learn more.

后续步骤Next steps

请参阅其他复制活动文章:See the other copy activity articles: