复制活动性能和可伸缩性指南Copy activity performance and scalability guide

无论是要执行从 Data Lake 或企业数据仓库 (EDW) 到 Azure 的大规模数据迁移,还是要将数据从不同的源大规模引入到 Azure 以进行大数据分析,实现最佳性能和可伸缩性都至关重要。Whether you want to perform a large-scale data migration from data lake or enterprise data warehouse (EDW) to Azure, or you want to ingest data at scale from different sources into Azure for big data analytics, it is critical to achieve optimal performance and scalability. Azure 数据工厂提供高性能、可复原且经济高效的机制用于大规模引入数据,因此,想要生成高性能、可缩放数据引入管道的数据工程师非常适合使用数据工厂。Azure Data Factory provides a performant, resilient, and cost-effective mechanism to ingest data at scale, making it a great fit for data engineers looking to build highly performant and scalable data ingestion pipelines.

阅读本文后,能够回答以下问题:After reading this article, you will be able to answer the following questions:

  • 对于数据迁移和数据引入方案,使用 ADF 复制活动可以实现哪种程度的性能和可伸缩性?What level of performance and scalability can I achieve using ADF copy activity for data migration and data ingestion scenarios?

  • 应执行哪些步骤来优化 ADF 复制活动的性能?What steps should I take to tune the performance of ADF copy activity?

  • 可以利用哪些 ADF 性能优化设置来优化单个复制活动运行的性能?What ADF perf optimization knobs can I utilize to optimize performance for a single copy activity run?

  • 优化复制性能时,需要考虑 ADF 以外的其他哪些因素?What other factors outside ADF to consider when optimizing copy performance?

Note

如果你对常规复制活动不熟悉,在阅读本文前请参阅复制活动概述If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.

使用 ADF 可实现的复制性能和可伸缩性Copy performance and scalability achievable using ADF

ADF 提供一个可在不同级别实现并行度的无服务器体系结构,使开发人员能够生成管道,以充分利用网络带宽以及存储 IOPS 和带宽将环境的数据移动吞吐量最大化。ADF offers a serverless architecture that allows parallelism at different levels, which allows developers to build pipelines to fully utilize your network bandwidth as well as storage IOPS and bandwidth to maximize data movement throughput for your environment. 这意味着,可以通过度量源数据存储、目标数据存储提供的最小吞吐量,以及源与目标之间的网络带宽,来估算可实现的吞吐量。This means the throughput you can achieve can be estimated by measuring the minimum throughput offered by the source data store, the destination data store, and network bandwidth in between the source and destination. 下表根据环境的数据大小和带宽限制计算复制持续时间。The table below calculates the copy duration based on data size and the bandwidth limit for your environment.

数据大小/Data size /
bandwidthbandwidth
50 Mbps50 Mbps 100 Mbps100 Mbps 500 Mbps500 Mbps 1 Gbps1 Gbps 5 Gbps5 Gbps 10 Gbps10 Gbps 50 Gbps50 Gbps
1 GB1 GB 2.7 分钟2.7 min 1.4 分钟1.4 min 0.3 分钟0.3 min 0.1 分钟0.1 min 0.03 分钟0.03 min 0.01 分钟0.01 min 0.0 分钟0.0 min
10 GB10 GB 27.3 分钟27.3 min 13.7 分钟13.7 min 2.7 分钟2.7 min 1.3 分钟1.3 min 0.3 分钟0.3 min 0.1 分钟0.1 min 0.03 分钟0.03 min
100 GB100 GB 4.6 小时4.6 hrs 2.3 小时2.3 hrs 0.5 小时0.5 hrs 0.2 小时0.2 hrs 0.05 小时0.05 hrs 0.02 小时0.02 hrs 0.0 小时0.0 hrs
1 TB1 TB 46.6 小时46.6 hrs 23.3 小时23.3 hrs 4.7 小时4.7 hrs 2.3 小时2.3 hrs 0.5 小时0.5 hrs 0.2 小时0.2 hrs 0.05 小时0.05 hrs
10 TB10 TB 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 0.9 天0.9 days 0.2 天0.2 days 0.1 天0.1 days 0.02 天0.02 days
100 TB100 TB 194.2 天194.2 days 97.1 天97.1 days 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 1 天1 days 0.2 天0.2 days
1 PB1 PB 64.7 个月64.7 mo 32.4 个月32.4 mo 6.5 个月6.5 mo 3.2 个月3.2 mo 0.6 个月0.6 mo 0.3 个月0.3 mo 0.06 个月0.06 mo
10 PB10 PB 647.3 个月647.3 mo 323.6 个月323.6 mo 64.7 个月64.7 mo 31.6 个月31.6 mo 6.5 个月6.5 mo 3.2 个月3.2 mo 0.6 个月0.6 mo

ADF 副本可在不同的级别缩放:ADF copy is scalable at different levels:

ADF 副本的缩放方式

  • ADF 控制流可以同时启动多个复制活动(例如,使用 For Each 循环)。ADF control flow can start multiple copy activities in parallel, for example using For Each loop.
  • 单个复制活动可以利用可缩放的计算资源:使用 Azure Integration Runtime 时,能够以无服务器方式为每个复制活动指定最多 256 个 DIU;使用自承载集成运行时时,可以手动纵向扩展计算机或横向扩展为多个计算机(最多 4 个节点),单个复制活动会在所有节点之间将其文件集分区。A single copy activity can take advantage of scalable compute resources: when using Azure Integration Runtime, you can specify up to 256 DIUs for each copy activity in a serverless manner; when using self-hosted Integration Runtime, you can manually scale up the machine or scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file set across all nodes.
  • 单个复制活动并行使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store using multiple threads in parallel.

性能优化步骤Performance tuning steps

请执行以下步骤,通过复制活动优化 Azure 数据工厂服务的性能。Take these steps to tune the performance of your Azure Data Factory service with the copy activity.

  1. 选取测试数据集并建立基线。Pick up a test dataset and establish a baseline. 在开发阶段,通过对代表性数据示例使用复制活动来测试管道。During the development phase, test your pipeline by using the copy activity against a representative data sample. 选择的数据集应代表典型的数据模式(文件夹结构、文件模式、数据架构等),并且应足够大,以便能够评估复制性能,例如,复制活动需要 10 分钟或更长时间才能完成。The dataset you choose should represent your typical data patterns (folder structure, file pattern, data schema, etc.), and is big enough to evaluate copy performance, for example it takes 10 minutes or beyond for copy activity to complete. 按照复制活动监视收集执行详细信息和性能特征。Collect execution details and performance characteristics following copy activity monitoring.

  2. 如何最大化单个复制活动的性能How to maximize performance of a single copy activity:

    首先,我们建议使用单个复制活动来最大化性能。To start with, we recommend you to first maximize performance using a single copy activity.

    如果复制活动在 Azure Integration Runtime 中执行:If the copy activity is being executed on an Azure Integration Runtime:

    一开始对数据集成单位 (DIU)并行复制设置使用默认值。Start with default values for Data Integration Units (DIU) and parallel copy settings. 执行性能测试运行,记下实现的性能,以及用于 DIU 和并行复制的实际值。Perform a performance test run, and take a note of the performance achieved as well as the actual values used for DIUs and parallel copies. 有关如何收集运行结果和所用性能设置,请参阅复制活动监视Refer to copy activity monitoring on how to collect run results and performance settings used.

    现在,请进一步执行性能测试运行,每次将 DIU 设置值加倍。Now conduct additional performance test runs, each time doubling the value for DIU setting. 或者,如果你认为使用默认设置实现的性能远远低于预期,可以在后续测试运行中大幅提高 DIU 设置。Alternatively, if you think the performance achieved using the default setting is far below your expectation, you can increase the DIU setting more drastically in the subsequent test run.

    提高 DIU 设置时,复制活动应尽可接近完美地呈线性缩放。Copy activity should scale almost perfectly linearly as you increase the DIU setting. 如果已将 DIU 设置加倍,但吞吐量却未翻倍,可能表示存在两个问题:If by doubling the DIU setting you are not seeing the throughput double, two things could be happening:

    • 添加更多的 DIU 不能使运行的特定复制模式受益。The specific copy pattern you are running does not benefit from adding more DIUs. 尽管指定了更大的 DIU 值,但实际使用的 DIU 仍保持不变,因此,获得的吞吐量与以前相同。Even though you had specified a larger DIU value, the actual DIU used remained the same, and therefore you are getting the same throughput as before. 如果存在这种情况,请参考步骤 3 并行运行多项复制,以最大化聚合吞吐量。If this is the case, maximize aggregate throughput by running multiple copies concurrently referring step 3.
    • 添加更多 DIU(提高计算能力),从而促成了更高的数据提取、传输和加载速率后,源数据存储、目标数据存储或两者之间的网络已达到其瓶颈,因此可能已受到限制。By adding more DIUs (more horsepower) and thereby driving higher rate of data extraction, transfer, and loading, either the source data store, the network in between, or the destination data store has reached its bottleneck and possibly being throttled. 如果存在这种情况,请尝试联系数据存储管理员或网络管理员来提高上限,或降低 DIU 设置,直到限制不再发生。If this is the case, try contacting your data store administrator or your network administrator to raise the upper limit, or alternatively, reduce the DIU setting until throttling stops occurring.

    如果复制活动在自承载集成运行时中执行:If the copy activity is being executed on a self-hosted Integration Runtime:

    我们建议使用专用的计算机来承载集成运行时,而不要使用承载数据存储的服务器。We recommend that you use a dedicated machine separate from the server hosting the data store to host integration runtime.

    一开始对并行复制设置使用默认值,并对自承载 IR 使用单个节点。Start with default values for parallel copy setting and using a single node for the self-hosted IR. 执行性能测试运行,并记下实现的性能。Perform a performance test run and take a note of the performance achieved.

    若要实现更高的吞吐量,可以纵向扩展或横向扩展自承载 IR:If you would like to achieve higher throughput, you can either scale up or scale out the self-hosted IR:

    • 如果自承载 IR 节点上的 CPU 和可用内存未充分利用,但并发作业执行即将达到限制,应通过增加节点上可运行的并发作业数进行纵向扩展。If the CPU and available memory on the self-hosted IR node are not fully utilized, but the execution of concurrent jobs is reaching the limit, you should scale up by increasing the number of concurrent jobs that can run on a node. 有关说明,请参阅此文See here for instructions.
    • 另一方面,如果 CPU 利用率在自承载 IR 节点上较高,或者可用内存较低,你可以添加新节点,以帮助在多个节点之间横向扩展负载。If, on the other hand, the CPU is high on the self-hosted IR node or available memory is low, you can add a new node to help scale out the load across the multiple nodes. 有关说明,请参阅此文See here for instructions.

    纵向或横向扩展自承载 IR 的容量时,请重复性能测试运行,以确定是否能够不断地提高吞吐量。As you scale up or scale out the capacity of the self-hosted IR, repeat the performance test run to see if you are getting increasingly better throughput. 如果吞吐量不再改善,原因很可能是源数据存储、目标数据存储或两者之间的网络已达到其瓶颈,因此开始受到限制。If throughput stops improving, most likely either the source data store, the network in between, or the destination data store has reached its bottleneck and is starting to get throttled. 如果存在这种情况,请尝试联系数据存储管理员或网络管理员来提高上限,或者恢复为先前对自承载 IR 使用的缩放设置。If this is the case, try contacting your data store administrator or your network administrator to raise the upper limit, or alternatively, go back to your previous scaling setting for the self-hosted IR.

  3. 如何通过并行运行多项复制来最大化聚合吞吐量:How to maximize aggregate throughput by running multiple copies concurrently:

    最大化单个复制活动的性能后,如果尚未实现环境(网络、源数据存储和目标数据存储)的吞吐量上限,可以使用 ADF 控制流构造(例如 For Each 循环)并行运行多个复制活动。Now that you have maximized the performance of a single copy activity, if you have not yet achieved the throughput upper limits of your environment – network, source data store, and destination data store - you can run multiple copy activities in parallel using ADF control flow constructs such as For Each loop.

  4. 性能优化提示和优化功能。Performance tuning tips and optimization features. 在某些情况下,当你在 Azure 数据工厂中运行复制活动时,复制活动监视的顶部会显示“性能调优提示”消息,如以下示例所示。In some cases, when you run a copy activity in Azure Data Factory, you see a "Performance tuning tips" message on top of the copy activity monitoring, as shown in the following example. 该消息将告知针对给定复制运行识别到的瓶颈。The message tells you the bottleneck that was identified for the given copy run. 此外,它还会提供有关如何进行更改以提升复制吞吐量的指导。It also guides you on what to change to boost copy throughput. 性能优化提示目前提供如下建议:The performance tuning tips currently provide suggestions like:

    • 将数据加载到 Azure SQL 数据仓库时使用 PolyBase。Use PolyBase when you copy data into Azure SQL Data Warehouse.
    • 当数据存储端的资源造成瓶颈时,增加 Azure Cosmos DB 请求单位数或 Azure SQL 数据库 DTU(数据库吞吐量单位)数。Increase Azure Cosmos DB Request Units or Azure SQL Database DTUs (Database Throughput Units) when the resource on the data store side is the bottleneck.
    • 删除不必要的暂存副本。Remove the unnecessary staged copy.

    性能优化规则也将逐渐丰富。The performance tuning rules will be gradually enriched as well.

    示例:复制到 Azure SQL 数据库时的性能优化提示Example: Copy into Azure SQL Database with performance tuning tips

    在此示例中,在执行复制运行过程中,Azure 数据工厂将会告知接收器 Azure SQL 数据库已达到高 DTU 利用率,从而减慢了写入速度。In this sample, during a copy run, Azure Data Factory notices the sink Azure SQL Database reaches high DTU utilization, which slows down the write operations. 建议使用更多的 DTU 来提高 Azure SQL 数据库层。The suggestion is to increase the Azure SQL Database tier with more DTUs.

    包含性能优化提示的复制监视

    此外,还应注意下面的一些性能优化功能:In addition, the following are some performance optimization features you should be aware of:

  5. 将配置扩展至整个数据集。Expand the configuration to your entire dataset. 对执行结果和性能满意时,可以扩展定义和管道以覆盖整个数据集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire dataset.

复制性能优化功能Copy performance optimization features

Azure 数据工厂提供以下性能优化功能:Azure Data Factory provides the following performance optimization features:

数据集成单元Data Integration Units

数据集成单元是一种度量单位,代表单个单位在 Azure 数据工厂中的能力(包含 CPU、内存、网络资源分配)。A Data Integration Unit is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Azure Data Factory. 数据集成单元仅适用于 Azure Integration Runtime,而不适用于自承载集成运行时Data Integration Unit only applies to Azure integration runtime, but not self-hosted integration runtime.

计费公式为 (已用 DIU 数) * (复制持续时间) * (单价/DIU 小时数)。 You will be charged # of used DIUs * copy duration * unit price/DIU-hour. 此网页上提供了当前价格。See the current prices here. 可能会按订阅类型应用本地货币和不同的折扣。Local currency and separate discounting may apply per subscription type.

允许用来为复制活动运行提供支持的 DIU 数为 2 到 256 个The allowed DIUs to empower a copy activity run is between 2 and 256. 如果未指定该数目或者在 UI 中选择“自动”,则数据工厂将根据源-接收器对和数据模式动态应用最佳的 DIU 设置。If not specified or you choose “Auto” on the UI, Data Factory dynamically apply the optimal DIU setting based on your source-sink pair and data pattern. 下表列出了不同复制方案中使用的默认 DIU 数目:The following table lists the default DIUs used in different copy scenarios:

复制方案Copy scenario 服务决定的默认 DIU 数目Default DIUs determined by service
在基于文件的存储之间复制数据Copy data between file-based stores 4 到 32 个,具体取决于文件的数量和大小Between 4 and 32 depending on the number and size of the files
将数据复制到 Azure SQL 数据库或 Azure Cosmos DBCopy data to Azure SQL Database or Azure Cosmos DB 4 到 16 个,具体取决于接收器 Azure SQL 数据库或 Cosmos DB 的层(DTU/RU 数目)Between 4 and 16 depending on the sink Azure SQL Database's or Cosmos DB's tier (number of DTUs/RUs)
所有其他复制方案All the other copy scenarios 44

若要替代此默认值,请如下所示指定 dataIntegrationUnits 属性的值。To override this default, specify a value for the dataIntegrationUnits property as follows. 复制操作在运行时使用的实际 DIU 数等于或小于配置的值,具体取决于数据模式。The actual number of DIUs that the copy operation uses at run time is equal to or less than the configured value, depending on your data pattern.

监视活动运行时,可以在复制活动输出中查看用于每个复制运行的 DIU。You can see the DIUs used for each copy run in the copy activity output when you monitor an activity run. 有关详细信息,请参阅复制活动监视For more information, see Copy activity monitoring.

Note

目前,仅当将多个文件从 Azure Blob/ADLS Gen2/Amazon S3/Google 云存储/云 FTP/云 SFTP 或从启用分区选项的云关系数据存储(包括 Oracle/Netezza/Teradata)复制到任何其他云数据存储时,大于 4 的 DIU 设置才适用。Setting of DIUs larger than four currently applies only when you copy multiple files from Azure Blob/ADLS Gen2/Amazon S3/Google Cloud Storage/cloud FTP/cloud SFTP or from partition-option-enabled cloud relational data store (including Oracle/Netezza/Teradata) to any other cloud data stores.

示例:Example:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "BlobSink"
            },
            "dataIntegrationUnits": 32
        }
    }
]

并行复制Parallel copy

可使用 parallelCopies 属性指示要让复制活动使用的并行度。You can use the parallelCopies property to indicate the parallelism that you want the copy activity to use. 可将此属性视为复制活动内,可从源并行读取或并行写入接收器数据存储的最大线程数。You can think of this property as the maximum number of threads within the copy activity that can read from your source or write to your sink data stores in parallel.

对于每个复制活动运行,Azure 数据工厂确定用于将数据从源数据存储复制到目标数据存储的并行复制数。For each copy activity run, Azure Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store. 它使用的默认并行复制数取决于使用的源和接收器类型。The default number of parallel copies that it uses depends on the type of source and sink that you use.

复制方案Copy scenario 由服务确定的默认并行复制计数Default parallel copy count determined by service
在基于文件的存储之间复制数据Copy data between file-based stores 取决于文件大小以及用于在两个云数据存储之间复制数据的 DIU 数,或自承载集成运行时计算机的物理配置。Depends on the size of the files and the number of DIUs used to copy data between two cloud data stores, or the physical configuration of the self-hosted integration runtime machine.
从启用了分区选项的关系数据存储(包括 OracleNetezzaTeradataSAP TableSAP Open Hub)复制Copy from relational data store with partition option enabled (including Oracle, Netezza, Teradata, SAP Table, and SAP Open Hub) 44
将数据从任何源存储复制到 Azure 表存储Copy data from any source store to Azure Table storage 44
所有其他复制方案All other copy scenarios 11

Tip

在基于文件的存储之间复制数据时,默认行为通常可提供最佳吞吐量。When you copy data between file-based stores, the default behavior usually gives you the best throughput. 默认行为是根据源文件模式自动确定的。The default behavior is auto-determined based on your source file pattern.

若要控制托管数据存储的计算机上的负载或优化复制性能,可以替代默认值并为 parallelCopies 属性指定值。To control the load on machines that host your data stores, or to tune copy performance, you can override the default value and specify a value for the parallelCopies property. 该值必须是大于或等于 1 的整数。The value must be an integer greater than or equal to 1. 在运行时,为了获得最佳性能,复制活动使用小于或等于所设置的值。At run time, for the best performance, the copy activity uses a value that is less than or equal to the value that you set.

需要注意的要点:Points to note:

  • 在基于文件的存储之间复制数据时,parallelCopies 确定文件级别的并行度。When you copy data between file-based stores, parallelCopies determines the parallelism at the file level. 单个文件内的区块化会自动透明地在文件下进行。The chunking within a single file happens underneath automatically and transparently. 它旨在对给定源数据存储类型使用最佳区块大小,以并行独立方式将数据加载到 parallelCopiesIt's designed to use the best suitable chunk size for a given source data store type to load data in parallel and orthogonal to parallelCopies. 数据移动服务在运行时用于复制操作的并行复制的实际数量不超过所拥有的文件数。The actual number of parallel copies the data movement service uses for the copy operation at run time is no more than the number of files you have. 如果复制行为是 mergeFile,复制活动无法利用文件级别的并行度。If the copy behavior is mergeFile, the copy activity can't take advantage of file-level parallelism.
  • 将非基于文件的存储(将启用了数据分区的 OracleNetezzaTeradataSAP TableSAP Open Hub 连接器用作源时除外)中的数据复制到基于文件的存储时,数据移动服务将忽略 parallelCopies 属性。When you copy data from stores that are not file-based (except Oracle, Netezza, Teradata, SAP Table, and SAP Open Hub connector as source with data partitioning enabled) to stores that are file-based, the data movement service ignores the parallelCopies property. 即使指定了并行性,在此情况下也不适用。Even if parallelism is specified, it's not applied in this case.
  • parallelCopies 属性独立于 dataIntegrationUnitsThe parallelCopies property is orthogonal to dataIntegrationUnits. 前者跨所有数据集成单元进行计数。The former is counted across all the Data Integration Units.
  • parallelCopies 属性指定值时,请考虑源和接收器数据存储上的负载会增加。When you specify a value for the parallelCopies property, consider the load increase on your source and sink data stores. 另外请考虑到,如果复制活动由自承载集成运行时提供支持(例如,使用混合复制时),则自承载集成运行时的负载也会增加。Also consider the load increase to the self-hosted integration runtime if the copy activity is empowered by it, for example, for hybrid copy. 尤其在有多个活动或针对同一数据存储运行的相同活动有并发运行时,会发生这种负载增加的情况。This load increase happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store. 如果注意到数据存储或自承载集成运行时负载过重,请降低 parallelCopies 值以减轻负载。If you notice that either the data store or the self-hosted integration runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.

示例:Example:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "BlobSink"
            },
            "parallelCopies": 32
        }
    }
]

暂存复制Staged copy

将数据从源数据存储复制到接收器数据存储时,可能会选择使用 Blob 存储作为过渡暂存存储。When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. 暂存在以下情况下特别有用:Staging is especially useful in the following cases:

  • 通过 PolyBase 从各种数据存储将数据引入 SQL 数据仓库。You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL 数据仓库使用 PolyBase 作为高吞吐量机制,将大量数据加载到 SQL 数据仓库中。SQL Data Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data Warehouse. 源数据必须位于 Blob 存储中,并且它必须满足其他条件。The source data must be in Blob storage, and it must meet additional criteria. 从 Blob 存储以外的数据存储加载数据时,可通过过渡暂存 Blob 存储激活数据复制。When you load data from a data store other than Blob storage, you can activate data copying via interim staging Blob storage. 在这种情况下,Azure 数据工厂会执行所需的数据转换,确保其满足 PolyBase 的要求。In that case, Azure Data Factory performs the required data transformations to ensure that it meets the requirements of PolyBase. 然后,它使用 PolyBase 将数据高效加载到 SQL 数据仓库。Then it uses PolyBase to load data into SQL Data Warehouse efficiently. 有关详细信息,请参阅使用 PolyBase 将数据加载到 Azure SQL 数据仓库For more information, see Use PolyBase to load data into Azure SQL Data Warehouse.
  • 有时,通过速度慢的网络连接执行混合数据移动(即从本地数据存储复制到云数据存储)需要一段时间。Sometimes it takes a while to perform a hybrid data movement (that is, to copy from an on-premises data store to a cloud data store) over a slow network connection. 为了提高性能,可以使用暂存复制来压缩本地数据,缩短将数据移动到云中的暂存数据存储的时间。To improve performance, you can use staged copy to compress the data on-premises so that it takes less time to move data to the staging data store in the cloud. 然后,可先在暂存存储中解压缩数据,再将它们加载到目标数据存储。Then you can decompress the data in the staging store before you load into the destination data store.
  • 由于企业 IT 策略,不希望在防火墙中打开除端口 80 和端口 443 以外的端口。You don't want to open ports other than port 80 and port 443 in your firewall because of corporate IT policies. 例如,将数据从本地数据存储复制到 Azure SQL 数据库接收器或 Azure SQL 数据仓库接收器时,需要对 Windows 防火墙和公司防火墙激活端口 1433 上的出站 TCP 通信。For example, when you copy data from an on-premises data store to an Azure SQL Database sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication on port 1433 for both the Windows firewall and your corporate firewall. 在这种情况下,暂存复制可以利用自承载集成运行时首先在端口 443 上通过 HTTP 或 HTTPS 将数据复制到 Blob 存储暂存实例。In this scenario, staged copy can take advantage of the self-hosted integration runtime to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. 然后,它可以将数据从 Blob 暂存存储加载到 SQL 数据库或 SQL 数据仓库中。Then it can load the data into SQL Database or SQL Data Warehouse from Blob storage staging. 在此流中,不需要启用端口 1433。In this flow, you don't need to enable port 1433.

暂存复制的工作原理How staged copy works

激活暂存功能时,首先将数据从源数据存储复制到暂存 Blob 存储(自带)。When you activate the staging feature, first the data is copied from the source data store to the staging Blob storage (bring your own). 然后,将数据从暂存数据存储复制到接收器数据存储。Next, the data is copied from the staging data store to the sink data store. Azure 数据工厂自动管理两阶段流。Azure Data Factory automatically manages the two-stage flow for you. 数据移动完成后,Azure 数据工厂还将清除暂存存储中的临时数据。Azure Data Factory also cleans up temporary data from the staging storage after the data movement is complete.

暂存复制

使用暂存存储激活数据移动时,可指定是否要先压缩数据,再将数据从源数据存储移动到过渡数据存储或暂存数据存储,然后先解压缩数据,再将数据从过渡数据存储或暂存数据移动到接收器数据存储。When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before you move data from the source data store to an interim or staging data store and then decompressed before you move data from an interim or staging data store to the sink data store.

目前,无论是否使用暂存复制,都无法在通过不同自承载 IR 连接的两个数据存储之间复制数据。Currently, you can't copy data between two data stores that are connected via different Self-hosted IRs, neither with nor without staged copy. 对于这种情况,可以配置两个显式链接的复制活动,将数据从源复制到暂存存储,然后从暂存存储复制到接收器。For such scenario, you can configure two explicitly chained copy activity to copy from source to staging then from staging to sink.

配置Configuration

在复制活动中配置 enableStaging 设置,指定在将数据加载到目标数据存储之前是否要在 Blob 存储中暂存。Configure the enableStaging setting in the copy activity to specify whether you want the data to be staged in Blob storage before you load it into a destination data store. enableStaging 设置为 TRUE 时,请指定下表中列出的其他属性。When you set enableStaging to TRUE, specify the additional properties listed in the following table. 如果未指定,则还需要创建 Azure 存储或存储共享访问签名链接服务供暂存用。You also need to create an Azure Storage or Storage shared access signature-linked service for staging if you don’t have one.

属性Property 说明Description 默认值Default value 必须Required
enableStagingenableStaging 指定是否要通过过渡暂存存储复制数据。Specify whether you want to copy data via an interim staging store. FalseFalse No
linkedServiceNamelinkedServiceName 指定 AzureStorage 链接服务的名称,这指用作过渡暂存存储的存储实例。Specify the name of an AzureStorage linked service, which refers to the instance of Storage that you use as an interim staging store.

无法使用具有共享访问签名的存储通过 PolyBase 将数据加载到 SQL 数据仓库。You can't use Storage with a shared access signature to load data into SQL Data Warehouse via PolyBase. 可在其他任何情况下使用它。You can use it in all other scenarios.
不适用N/A enableStaging 设置为 TRUE 时,则为是Yes, when enableStaging is set to TRUE
pathpath 指定要包含此暂存数据的 Blob 存储路径。Specify the Blob storage path that you want to contain the staged data. 如果不提供路径,该服务将创建容器以存储临时数据。If you don't provide a path, the service creates a container to store temporary data.

只在使用具有共享访问签名的存储时,或者要求临时数据位于特定位置时才指定路径。Specify a path only if you use Storage with a shared access signature, or you require temporary data to be in a specific location.
不适用N/A No
enableCompressionenableCompression 指定是否应先压缩数据,再将数据复制到目标。Specifies whether data should be compressed before it's copied to the destination. 此设置可减少传输的数据量。This setting reduces the volume of data being transferred. FalseFalse No

Note

若使用暂存复制并启用压缩,则不支持对暂存 Blob 链接服务的服务主体或 MSI 身份验证。If you use staged copy with compression enabled, the service principal or MSI authentication for staging blob linked service isn't supported.

以下是具有上表所述属性的复制活动的示例定义:Here's a sample definition of a copy activity with the properties that are described in the preceding table:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "SqlSource",
            },
            "sink": {
                "type": "SqlSink"
            },
            "enableStaging": true,
            "stagingSettings": {
                "linkedServiceName": {
                    "referenceName": "MyStagingBlob",
                    "type": "LinkedServiceReference"
                },
                "path": "stagingcontainer/path",
                "enableCompression": true
            }
        }
    }
]

暂存复制计费影响Staged copy billing impact

基于两个步骤进行计费:复制持续时间和复制类型。You're charged based on two steps: copy duration and copy type.

  • 在云复制期间(将数据从一个云数据存储复制到另一个云数据存储,两个阶段均由 Azure Integration Runtime 提供支持)使用暂存时,需要支付 [步骤 1 和步骤 2 的复制持续时间总和] x [云复制单元价格]。When you use staging during a cloud copy, which is copying data from a cloud data store to another cloud data store, both stages empowered by Azure integration runtime, you're charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price].
  • 在混合复制期间(将数据从本地数据存储复制到云数据存储,一个阶段由自承载集成运行时提供支持)使用暂存时,需要支付 [混合复制持续时间] x [混合复制单元价格] + [云复制持续时间] x [云复制单元价格]。When you use staging during a hybrid copy, which is copying data from an on-premises data store to a cloud data store, one stage empowered by a self-hosted integration runtime, you're charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price].

参考References

下面是有关一些受支持数据存储的性能监视和优化参考:Here are performance monitoring and tuning references for some of the supported data stores:

后续步骤Next steps

请参阅其他复制活动文章:See the other copy activity articles: