复制活动性能和可伸缩性指南Copy activity performance and scalability guide
Azure 数据工厂
Azure Synapse Analytics
有时,你需要执行从数据湖或企业数据仓库 (EDW) 到 Azure 的大规模数据迁移。Sometimes you want to perform a large-scale data migration from data lake or enterprise data warehouse (EDW), to Azure. 其他时候,你需要将来自不同源的大量数据引入 Azure 来进行大数据分析。Other times you want to ingest large amounts of data, from different sources into Azure, for big data analytics. 在每种情况下,实现最佳性能和可伸缩性都至关重要。In each case, it is critical to achieve optimal performance and scalability.
Azure 数据工厂 (ADF) 提供了一种机制用来引入数据。Azure Data Factory (ADF) provides a mechanism to ingest data. ADF 具有以下优势:ADF has the following advantages:
- 处理大量数据Handles large amounts of data
- 性能高Is highly performant
- 经济高效Is cost-effective
这些优势使得 ADF 非常适合于想要构建高性能的可伸缩数据引入管道的数据工程师。These advantages make ADF an excellent fit for data engineers who want to build scalable data ingestion pipelines that are highly performant.
阅读本文后,能够回答以下问题:After reading this article, you will be able to answer the following questions:
- 对于数据迁移和数据引入方案,使用 ADF 复制活动可以实现哪种程度的性能和可伸缩性?What level of performance and scalability can I achieve using ADF copy activity for data migration and data ingestion scenarios?
- 应执行哪些步骤来优化 ADF 复制活动的性能?What steps should I take to tune the performance of ADF copy activity?
- 可以利用哪些 ADF 性能优化设置来优化单个复制活动运行的性能?What ADF perf optimization knobs can I utilize to optimize performance for a single copy activity run?
- 优化复制性能时,需要考虑 ADF 以外的其他哪些因素?What other factors outside ADF to consider when optimizing copy performance?
备注
如果你对常规复制活动不熟悉,在阅读本文前请参阅复制活动概述。If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.
使用 ADF 可实现的复制性能和可伸缩性Copy performance and scalability achievable using ADF
ADF 提供一个可在不同级别实现并行度的无服务器体系结构。ADF offers a serverless architecture that allows parallelism at different levels.
通过此体系结构,可开发能最大程度地提高环境数据移动吞吐量的管道。This architecture allows you to develop pipelines that maximize data movement throughput for your environment. 这些管道充分利用以下资源:These pipelines fully utilize the following resources:
- 网络带宽Network bandwidth
- 每秒存储输入/输出操作数 (IOPS) 和带宽Storage input/output operations per second (IOPS) and bandwidth
这种充分利用意味着你可通过测量以下资源可用的最小吞吐量来估计总体吞吐量:This full utilization means you can estimate the overall throughput by measuring the minimum throughput available with the following resources:
- 源数据存储Source data store
- 目标数据存储Destination data store
- 源数据存储与目标数据存储之间的网络带宽Network bandwidth in between the source and destination data stores
下表计算了复制持续时间。The table below calculates the copy duration. 持续时间取决于环境的数据大小和带宽限制。The duration is based on data size and the bandwidth limit for your environment.
数据大小/Data size / bandwidthbandwidth |
50 Mbps50 Mbps | 100 Mbps100 Mbps | 500 Mbps500 Mbps | 1 Gbps1 Gbps | 5 Gbps5 Gbps | 10 Gbps10 Gbps | 50 Gbps50 Gbps |
---|---|---|---|---|---|---|---|
1 GB1 GB | 2.7 分钟2.7 min | 1.4 分钟1.4 min | 0.3 分钟0.3 min | 0.1 分钟0.1 min | 0.03 分钟0.03 min | 0.01 分钟0.01 min | 0.0 分钟0.0 min |
10 GB10 GB | 27.3 分钟27.3 min | 13.7 分钟13.7 min | 2.7 分钟2.7 min | 1.3 分钟1.3 min | 0.3 分钟0.3 min | 0.1 分钟0.1 min | 0.03 分钟0.03 min |
100 GB100 GB | 4.6 小时4.6 hrs | 2.3 小时2.3 hrs | 0.5 小时0.5 hrs | 0.2 小时0.2 hrs | 0.05 小时0.05 hrs | 0.02 小时0.02 hrs | 0.0 小时0.0 hrs |
1 TB1 TB | 46.6 小时46.6 hrs | 23.3 小时23.3 hrs | 4.7 小时4.7 hrs | 2.3 小时2.3 hrs | 0.5 小时0.5 hrs | 0.2 小时0.2 hrs | 0.05 小时0.05 hrs |
10 TB10 TB | 19.4 天19.4 days | 9.7 天9.7 days | 1.9 天1.9 days | 0.9 天0.9 days | 0.2 天0.2 days | 0.1 天0.1 days | 0.02 天0.02 days |
100 TB100 TB | 194.2 天194.2 days | 97.1 天97.1 days | 19.4 天19.4 days | 9.7 天9.7 days | 1.9 天1.9 days | 1 天1 day | 0.2 天0.2 days |
1 PB1 PB | 64.7 个月64.7 mo | 32.4 个月32.4 mo | 6.5 个月6.5 mo | 3.2 个月3.2 mo | 0.6 个月0.6 mo | 0.3 个月0.3 mo | 0.06 个月0.06 mo |
10 PB10 PB | 647.3 个月647.3 mo | 323.6 个月323.6 mo | 64.7 个月64.7 mo | 31.6 个月31.6 mo | 6.5 个月6.5 mo | 3.2 个月3.2 mo | 0.6 个月0.6 mo |
ADF 副本可在不同的级别缩放:ADF copy is scalable at different levels:
ADF 控制流可以并行启动多个复制活动(例如,使用 For Each 循环)。ADF control flow can start multiple copy activities in parallel, for example using For Each loop.
单个复制活动可以利用可缩放的计算资源。A single copy activity can take advantage of scalable compute resources.
- 使用 Azure 集成运行时 (IR) 时,能够以无服务器的方式为每个复制活动指定最多 256 个数据集成单元 (DIU)。When using Azure integration runtime (IR), you can specify up to 256 data integration units (DIUs) for each copy activity, in a serverless manner.
- 使用自承载 IR 时,可采用以下方法之一:When using self-hosted IR, you can take either of the following approaches:
- 手动纵向扩展计算机。Manually scale up the machine.
- 横向扩展到多台计算机(最多 4 个节点),单个复制活动将跨所有节点对其文件集进行分区。Scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file set across all nodes.
单个复制活动并行使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store using multiple threads in parallel.
性能优化步骤Performance tuning steps
请执行以下步骤,通过复制活动优化 Azure 数据工厂服务的性能:Take the following steps to tune the performance of your Azure Data Factory service with the copy activity:
选取测试数据集并建立基线。Pick up a test dataset and establish a baseline.
在开发过程中,通过对代表性的数据示例使用复制活动来测试管道。During development, test your pipeline by using the copy activity against a representative data sample. 你选择的数据集应表示典型的数据模式,并具有以下属性:The dataset you choose should represent your typical data patterns along the following attributes:
- 文件夹结构Folder structure
- 文件模式File pattern
- 数据架构Data schema
数据集应该足够大,以便评估复制性能。And your dataset should be big enough to evaluate copy performance. 合适的大小是复制活动至少需要 10 分钟才能完成。A good size takes at least 10 minutes for copy activity to complete. 按照复制活动监视收集执行详细信息和性能特征。Collect execution details and performance characteristics following copy activity monitoring.
如何最大化单个复制活动的性能 :How to maximize performance of a single copy activity :
建议先使用单个复制活动来最大程度提高性能。We recommend you to first maximize performance using a single copy activity.
如果复制活动在 Azure 集成运行时中执行 :If the copy activity is being executed on an Azure integration runtime:
一开始对数据集成单位 (DIU) 和并行复制设置使用默认值。Start with default values for Data Integration Units (DIU) and parallel copy settings.
如果复制活动在自承载集成运行时中执行 :If the copy activity is being executed on a self-hosted integration runtime:
建议使用专用计算机托管 IR。We recommend that you use a dedicated machine to host IR. 计算机应与托管数据存储的服务器分开。The machine should be separate from the server hosting the data store. 一开始对并行复制设置使用默认值,并对自承载 IR 使用单个节点。Start with default values for parallel copy setting and using a single node for the self-hosted IR.
执行性能测试运行。Conduct a performance test run. 记下实现的性能。Take a note of the performance achieved. 包括使用的实际值,例如 DIU 和并行副本。Include the actual values used, such as DIUs and parallel copies. 有关如何收集运行结果和所用性能设置,请参阅复制活动监视。Refer to copy activity monitoring on how to collect run results and performance settings used. 了解如何排查复制活动的性能问题,以识别和解决瓶颈。Learn how to troubleshoot copy activity performance to identify and resolve the bottleneck.
按照故障排除和优化指南反复进行额外的性能测试运行。Iterate to conduct additional performance test runs following the troubleshooting and tuning guidance. 一旦单个复制活动运行无法提高吞吐量,请考虑是否通过同时运行多个复制来尽量提高聚合吞吐量。Once single copy activity runs cannot achieve better throughput, consider whether to maximize aggregate throughput by running multiple copies concurrently. 下一编号项目中讨论了此选项。This option is discussed in the next numbered bullet.
如何通过并行运行多项复制来最大化聚合吞吐量:How to maximize aggregate throughput by running multiple copies concurrently:
至此,你已将单个复制活动的性能最大化。By now you have maximized the performance of a single copy activity. 如果尚未达到环境的吞吐量上限,可并行运行多个复制活动。If you have not yet achieved the throughput upper limits of your environment, you can run multiple copy activities in parallel. 可使用 ADF 控制流构造并行运行。You can run in parallel by using ADF control flow constructs. 其中一种构造是 For Each 循环。One such construct is the For Each loop. 有关详细信息,请参阅以下关于解决方案模板的文章:For more information, see the following articles about solution templates:
将配置扩展至整个数据集。Expand the configuration to your entire dataset.
对执行结果和性能满意时,可以扩展定义和管道以覆盖整个数据集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire dataset.
排查复制活动的性能问题Troubleshoot copy activity performance
遵循性能优化步骤为方案规划并执行性能测试。Follow the Performance tuning steps to plan and conduct performance test for your scenario. 从排查复制活动的性能问题了解如何排查 Azure 数据工厂中的每个复制活动运行性能问题。And learn how to troubleshoot each copy activity run's performance issue in Azure Data Factory from Troubleshoot copy activity performance.
复制性能优化功能Copy performance optimization features
Azure 数据工厂提供以下性能优化功能:Azure Data Factory provides the following performance optimization features:
- 数据集成单元Data Integration Units
- 自承载集成运行时可伸缩性Self-hosted integration runtime scalability
- 并行复制Parallel copy
- 暂存复制Staged copy
数据集成单元Data Integration Units
数据集成单元 (DIU) 是一个度量值,表示单个单位在 Azure 数据工厂中的能力。A Data Integration Unit (DIU) is a measure that represents the power of a single unit in Azure Data Factory. 这种能力包含 CPU、内存和网络资源分配。Power is a combination of CPU, memory, and network resource allocation. DIU 仅适用于 Azure 集成运行时。DIU only applies to Azure integration runtime. DIU 不适用于自承载集成运行时。DIU does not apply to self-hosted integration runtime. 在此处了解更多信息。Learn more here.
自承载集成运行时可伸缩性Self-hosted integration runtime scalability
你可能希望托管不断增长的并发工作负载。You might want to host an increasing concurrent workload. 或者,你可能希望在当前工作负载级别获得更高的性能。Or you might want to achieve higher performance in your present workload level. 可通过以下方法提高处理规模:You can enhance the scale of processing by the following approaches:
- 可通过增加能在节点上运行的并发作业数来纵向扩展自承载 IR。You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.
仅当节点的处理器和内存没有得到充分利用时,才能进行纵向扩展。Scale up works only if the processor and memory of the node are being less than fully utilized. - 可通过添加更多节点(计算机)来横向扩展自承载 IR。You can scale out the self-hosted IR, by adding more nodes (machines).
有关详细信息,请参阅:For more information, see:
- 复制活动性能优化功能:自承载集成运行时可伸缩性Copy activity performance optimization features: Self-hosted integration runtime scalability
- 创建和配置自承载集成运行时:扩展注意事项Create and configure a self-hosted integration runtime: Scale considerations
并行复制Parallel copy
可设置 parallelCopies
属性来指示希望复制活动使用的并行度。You can set the parallelCopies
property to indicate the parallelism you want the copy activity to use. 将此属性视为复制活动中的最大线程数。Think of this property as the maximum number of threads within the copy activity. 线程并行操作。The threads operate in parallel. 线程从源读取,或写入接收器数据存储。The threads either read from your source, or write to your sink data stores. 了解详细信息。Learn more.
暂存复制Staged copy
数据复制操作可将数据直接发送到接收器数据存储。A data copy operation can send the data directly to the sink data store. 或者,可选择将 Blob 存储用作临时暂存存储。Alternatively, you can choose to use Blob storage as an interim staging store. 了解详细信息。Learn more.
后续步骤Next steps
请参阅其他复制活动文章:See the other copy activity articles:
- 复制活动概述Copy activity overview
- 排查复制活动的性能问题Troubleshoot copy activity performance
- 复制活动性能优化功能Copy activity performance optimization features
- 使用 Azure 数据工厂将数据从 Data Lake 或数据仓库迁移到 AzureUse Azure Data Factory to migrate data from your data lake or data warehouse to Azure
- 将数据从 Amazon S3 迁移到 Azure 存储Migrate data from Amazon S3 to Azure Storage