复制活动性能优化功能Copy activity performance optimization features

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

本文概述可在 Azure 数据工厂中利用的复制活动性能优化功能。This article outlines the copy activity performance optimization features that you can leverage in Azure Data Factory.

数据集成单元Data Integration Units

数据集成单元是一种度量单位,代表单个单位在 Azure 数据工厂中的能力(包含 CPU、内存、网络资源分配)。A Data Integration Unit is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Azure Data Factory. 数据集成单元仅适用于 Azure Integration Runtime,而不适用于自承载集成运行时Data Integration Unit only applies to Azure integration runtime, but not self-hosted integration runtime.

允许用来为复制活动运行提供支持的 DIU 数为 2 到 256 个The allowed DIUs to empower a copy activity run is between 2 and 256. 如果未指定该数目或者在 UI 中选择“自动”,则数据工厂将根据源-接收器对和数据模式动态应用最佳的 DIU 设置。If not specified or you choose "Auto" on the UI, Data Factory dynamically applies the optimal DIU setting based on your source-sink pair and data pattern. 下表列出了不同复制方案中支持的 DIU 范围和默认行为:The following table lists the supported DIU ranges and default behavior in different copy scenarios:

复制方案Copy scenario 支持的 DIU 范围Supported DIU range 服务决定的默认 DIU 数目Default DIUs determined by service
文件存储之间Between file stores - 从/向单一文件复制:2-4- Copy from or to single file: 2-4
- 从/向多个文件复制:2-256 个,具体取决于文件的数量和大小- Copy from and to multiple files: 2-256 depending on the number and size of the files

例如,如果从包含 4 个大文件的文件夹复制数据并选择保留层次结构,则最大有效 DIU 为 16;选择合并文件时,最大有效 DIU 为 4。For example, if you copy data from a folder with 4 large files and choose to preserve hierarchy, the max effective DIU is 16; when you choose to merge file, the max effective DIU is 4.
4 到 32 个,具体取决于文件的数量和大小Between 4 and 32 depending on the number and size of the files
从文件存储到非文件存储From file store to non-file store - 从单个文件复制:2-4- Copy from single file: 2-4
- 从多个文件复制:2-256 个,具体取决于文件的数量和大小- Copy from multiple files: 2-256 depending on the number and size of the files

例如,如果从包含 4 个大文件的文件夹复制数据,则最大有效 DIU 为 16。For example, if you copy data from a folder with 4 large files, the max effective DIU is 16.
- 复制到 Azure SQL 数据库或 Azure Cosmos DB:4-16 个,具体取决于接收器层 (DTU/RU) 和源文件模式- Copy into Azure SQL Database or Azure Cosmos DB: between 4 and 16 depending on the sink tier (DTUs/RUs) and source file pattern
- 使用 PolyBase 或 COPY 语句复制到 Azure Synapse Analytics:2- Copy into Azure Synapse Analytics using PolyBase or COPY statement: 2
- 其他方案:4- Other scenario: 4
从非文件存储到文件存储From non-file store to file store - 从启用分区选项的数据存储复制(包括 Oracle/Netezza/Teradata):写入文件夹时为 2-256 个,写入单个文件时为 2-4 个。- Copy from partition-option-enabled data stores (including Oracle/Netezza/Teradata): 2-256 when writing to a folder, and 2-4 when writing to one single file. 请注意,每个源数据分区最多可以使用 4 个 DIU。Note per source data partition can use up to 4 DIUs.
- 其他方案:2-4- Other scenarios: 2-4
- 从 REST 或 HTTP 复制:1- Copy from REST or HTTP: 1
- 使用 UNLOAD 从 Amazon Redshift 复制:2- Copy from Amazon Redshift using UNLOAD: 2
- 其他方案:4- Other scenario: 4
非文件存储之间Between non-file stores - 从启用分区选项的数据存储复制(包括 Oracle/Netezza/Teradata):写入文件夹时为 2-256 个,写入单个文件时为 2-4 个。- Copy from partition-option-enabled data stores (including Oracle/Netezza/Teradata): 2-256 when writing to a folder, and 2-4 when writing to one single file. 请注意,每个源数据分区最多可以使用 4 个 DIU。Note per source data partition can use up to 4 DIUs.
- 其他方案:2-4- Other scenarios: 2-4
- 从 REST 或 HTTP 复制:1- Copy from REST or HTTP: 1
- 其他方案:4- Other scenario: 4

可以在复制活动监视视图或活动输出中查看用于每个复制运行的 DIU。You can see the DIUs used for each copy run in the copy activity monitoring view or activity output. 有关详细信息,请参阅复制活动监视For more information, see Copy activity monitoring. 若要替代此默认值,请按如下所示指定 dataIntegrationUnits 属性的值。To override this default, specify a value for the dataIntegrationUnits property as follows. 复制操作在运行时使用的实际 DIU 数等于或小于配置的值,具体取决于数据模式。The actual number of DIUs that the copy operation uses at run time is equal to or less than the configured value, depending on your data pattern.

计费公式为 (已用 DIU 数) * (复制持续时间) * (单价/DIU 小时数)。You will be charged # of used DIUs * copy duration * unit price/DIU-hour. 此网页上提供了当前价格。See the current prices here. 可能会按订阅类型应用本地货币和不同的折扣。Local currency and separate discounting may apply per subscription type.

示例:Example:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStorageSink"
            },
            "dataIntegrationUnits": 128
        }
    }
]

自承载集成运行时可伸缩性Self-hosted integration runtime scalability

若要实现更高的吞吐量,可以纵向扩展或横向扩展自承载 IR:If you would like to achieve higher throughput, you can either scale up or scale out the Self-hosted IR:

  • 如果自承载 IR 节点上的 CPU 和可用内存未充分利用,但并发作业执行即将达到限制,应通过增加节点上可运行的并发作业数进行纵向扩展。If the CPU and available memory on the Self-hosted IR node are not fully utilized, but the execution of concurrent jobs is reaching the limit, you should scale up by increasing the number of concurrent jobs that can run on a node. 有关说明,请参阅此文See here for instructions.
  • 另一方面,如果 CPU 利用率在自承载 IR 节点上较高,或者可用内存较低,你可以添加新节点,以便在多个节点中横向扩展负载。If on the other hand, the CPU is high on the Self-hosted IR node or available memory is low, you can add a new node to help scale out the load across the multiple nodes. 有关说明,请参阅此文See here for instructions.

请注意,在以下情况下,单个复制活动执行可以利用多个自承载 IR 节点:Note in the following scenarios, single copy activity execution can leverage multiple Self-hosted IR nodes:

  • 从基于文件的存储中复制数据。使用的节点数取决于文件的数量和大小。Copy data from file-based stores, depending on the number and size of the files.
  • 从启用分区选项的数据存储(包括 OracleNetezzaTeradataSAP HANASAP TableSAP Open Hub)复制数据。使用的节点数取决于数据分区数。Copy data from partition-option-enabled data store (including Oracle, Netezza, Teradata, SAP HANA, SAP Table, and SAP Open Hub), depending on the number of data partitions.

并行复制Parallel copy

可以在复制活动中设置并行复制(parallelCopies 属性)来指示希望复制活动使用的并行度。You can set parallel copy (parallelCopies property) on copy activity to indicate the parallelism that you want the copy activity to use. 可将此属性视为复制活动内,可从源并行读取或并行写入接收器数据存储的最大线程数。You can think of this property as the maximum number of threads within the copy activity that read from your source or write to your sink data stores in parallel.

并行复制独立于数据集成单元自承载 IR 节点The parallel copy is orthogonal to Data Integration Units or Self-hosted IR nodes. 它是根据所有 DIU 或自承载 IR 节点统计的。It is counted across all the DIUs or Self-hosted IR nodes.

对于每个复制活动运行,Azure 数据工厂默认会根据源-接收器对和数据模式动态应用最佳的并行复制设置。For each copy activity run, by default Azure Data Factory dynamically applies the optimal parallel copy setting based on your source-sink pair and data pattern.

提示

并行复制的默认行为通常可以提供最佳吞吐量,该吞吐量是 ADF 根据源-接收器对、数据模式,以及 DIU 数目或自承载 IR 的 CPU/内存/节点计数自动确定的。The default behavior of parallel copy usually gives you the best throughput, which is auto-determined by ADF based on your source-sink pair, data pattern and number of DIUs or the Self-hosted IR's CPU/memory/node count. 有关何时优化并行复制的信息,请参阅排查复制活动性能问题Refer to Troubleshoot copy activity performance on when to tune parallel copy.

下表列出了并行复制行为:The following table lists the parallel copy behavior:

复制方案Copy scenario 并行复制行为Parallel copy behavior
文件存储之间Between file stores parallelCopies 确定文件级别的并行度。parallelCopies determines the parallelism at the file level. 每个文件内的区块化会自动透明地在该级别下进行。The chunking within each file happens underneath automatically and transparently. 它旨在对给定数据存储类型使用最佳区块大小,以并行加载数据。It's designed to use the best suitable chunk size for a given data store type to load data in parallel.

复制活动在运行时使用的实际并行副本数不超过现有的文件数。The actual number of parallel copies copy activity uses at run time is no more than the number of files you have. 如果复制行为是在文件接收器中执行 mergeFile,则复制活动无法利用文件级并行度。If the copy behavior is mergeFile into file sink, the copy activity can't take advantage of file-level parallelism.
从文件存储到非文件存储From file store to non-file store - 在将数据复制到 Azure SQL 数据库或 Azure Cosmos DB 时,默认的并行副本数还取决于接收器层(DTU/RU 数目)。- When copying data into Azure SQL Database or Azure Cosmos DB, default parallel copy also depend on the sink tier (number of DTUs/RUs).
- 在将数据复制到 Azure 表时,默认的并行副本数为 4 个。- When copying data into Azure Table, default parallel copy is 4.
从非文件存储到文件存储From non-file store to file store - 从启用分区选项的数据存储(包括 OracleNetezzaTeradataSAP HANASAP TableSAP Open Hub)复制数据时,默认的并行副本数为 4 个。- When copying data from partition-option-enabled data store (including Oracle, Netezza, Teradata, SAP HANA, SAP Table, and SAP Open Hub), default parallel copy is 4. 复制活动在运行时使用的实际并行副本数不超过现有的数据分区数。The actual number of parallel copies copy activity uses at run time is no more than the number of data partitions you have. 使用自承载集成运行时并复制到 Azure Blob/ADLS Gen2 时请注意,每个 IR 节点的最大有效并行副本数为 4 或 5 个。When use Self-hosted Integration Runtime and copy to Azure Blob/ADLS Gen2, note the max effective parallel copy is 4 or 5 per IR node.
- 对于其他方案,并行复制不起作用。- For other scenarios, parallel copy doesn't take effect. 即使指定了并行度,也不会应用并行复制。Even if parallelism is specified, it's not applied.
非文件存储之间Between non-file stores - 在将数据复制到 Azure SQL 数据库或 Azure Cosmos DB 时,默认的并行副本数还取决于接收器层(DTU/RU 数目)。- When copying data into Azure SQL Database or Azure Cosmos DB, default parallel copy also depend on the sink tier (number of DTUs/RUs).
- 从启用分区选项的数据存储(包括 OracleNetezzaTeradataSAP HANASAP TableSAP Open Hub)复制数据时,默认的并行副本数为 4 个。- When copying data from partition-option-enabled data store (including Oracle, Netezza, Teradata, SAP HANA, SAP Table, and SAP Open Hub), default parallel copy is 4.
- 在将数据复制到 Azure 表时,默认的并行副本数为 4 个。- When copying data into Azure Table, default parallel copy is 4.

若要控制托管数据存储的计算机上的负载或优化复制性能,可以替代默认值并为 parallelCopies 属性指定值。To control the load on machines that host your data stores, or to tune copy performance, you can override the default value and specify a value for the parallelCopies property. 该值必须是大于或等于 1 的整数。The value must be an integer greater than or equal to 1. 在运行时,为了获得最佳性能,复制活动使用小于或等于所设置的值。At run time, for the best performance, the copy activity uses a value that is less than or equal to the value that you set.

parallelCopies 属性指定值时,请考虑到源和接收器数据存储上的负载会增大。When you specify a value for the parallelCopies property, take the load increase on your source and sink data stores into account. 另外请考虑到,如果复制活动由自承载集成运行时提供支持,则自承载集成运行时的负载也会增大。Also consider the load increase to the self-hosted integration runtime if the copy activity is empowered by it. 尤其在有多个活动或针对同一数据存储运行的相同活动有并发运行时,会发生这种负载增加的情况。This load increase happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store. 如果注意到数据存储或自承载集成运行时负载过重,请减小 parallelCopies 值以减轻负载。If you notice that either the data store or the self-hosted integration runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.

示例:Example:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStorageSink"
            },
            "parallelCopies": 32
        }
    }
]

暂存复制Staged copy

将数据从源数据存储复制到接收器数据存储时,可能会选择使用 Blob 存储作为过渡暂存存储。When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. 暂存在以下情况下特别有用:Staging is especially useful in the following cases:

  • 你要通过 PolyBase 将数据从各种数据存储引入 Azure Synapse Analytics(以前称为 SQL 数据仓库)。You want to ingest data from various data stores into Azure Synapse Analytics (formerly SQL Data Warehouse) via PolyBase. Azure Synapse Analytics 使用 PolyBase 作为高吞吐量机制,将大量数据加载到 Azure Synapse Analytics 中。Azure Synapse Analytics uses PolyBase as a high-throughput mechanism to load a large amount of data into Azure Synapse Analytics. 源数据必须位于 Blob 存储中,并且它必须满足其他条件。The source data must be in Blob storage, and it must meet additional criteria. 从 Blob 存储以外的数据存储加载数据时,可通过过渡暂存 Blob 存储激活数据复制。When you load data from a data store other than Blob storage, you can activate data copying via interim staging Blob storage. 在这种情况下,Azure 数据工厂会执行所需的数据转换,确保其满足 PolyBase 的要求。In that case, Azure Data Factory performs the required data transformations to ensure that it meets the requirements of PolyBase. 然后,它使用 PolyBase 将数据有效地加载到 Azure Synapse Analytics 中。Then it uses PolyBase to load data into Azure Synapse Analytics efficiently. 有关详细信息,请参阅使用 PolyBase 将数据加载到 Azure SQL 数据仓库For more information, see Use PolyBase to load data into Azure SQL Data Warehouse.
  • 有时,通过速度慢的网络连接执行混合数据移动(即从本地数据存储复制到云数据存储)需要一段时间。Sometimes it takes a while to perform a hybrid data movement (that is, to copy from an on-premises data store to a cloud data store) over a slow network connection. 为了提高性能,可以使用暂存复制来压缩本地数据,缩短将数据移动到云中的暂存数据存储的时间。To improve performance, you can use staged copy to compress the data on-premises so that it takes less time to move data to the staging data store in the cloud. 然后,可先在暂存存储中解压缩数据,再将它们加载到目标数据存储。Then you can decompress the data in the staging store before you load into the destination data store.
  • 由于企业 IT 策略,不希望在防火墙中打开除端口 80 和端口 443 以外的端口。You don't want to open ports other than port 80 and port 443 in your firewall because of corporate IT policies. 例如,将数据从本地数据存储复制到 Azure SQL 数据库接收器或 Azure Synapse Analytics 接收器时,需要为 Windows 防火墙和公司防火墙激活端口 1433 上的出站 TCP 通信。For example, when you copy data from an on-premises data store to an Azure SQL Database sink or an Azure Synapse Analytics sink, you need to activate outbound TCP communication on port 1433 for both the Windows firewall and your corporate firewall. 在这种情况下,暂存复制可以利用自承载集成运行时首先在端口 443 上通过 HTTP 或 HTTPS 将数据复制到 Blob 存储暂存实例。In this scenario, staged copy can take advantage of the self-hosted integration runtime to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. 然后,它可以将数据从 Blob 暂存存储加载到 SQL 数据库或 Azure Synapse Analytics 中。Then it can load the data into SQL Database or Azure Synapse Analytics from Blob storage staging. 在此流中,不需要启用端口 1433。In this flow, you don't need to enable port 1433.

暂存复制的工作原理How staged copy works

激活暂存功能时,首先将数据从源数据存储复制到暂存 Blob 存储(自带)。When you activate the staging feature, first the data is copied from the source data store to the staging Blob storage (bring your own). 然后,将数据从暂存数据存储复制到接收器数据存储。Next, the data is copied from the staging data store to the sink data store. Azure 数据工厂自动管理两阶段流。Azure Data Factory automatically manages the two-stage flow for you. 数据移动完成后,Azure 数据工厂还将清除暂存存储中的临时数据。Azure Data Factory also cleans up temporary data from the staging storage after the data movement is complete.

暂存复制

使用暂存存储激活数据移动时,可指定是否要先压缩数据,再将数据从源数据存储移动到过渡数据存储或暂存数据存储,然后先解压缩数据,再将数据从过渡数据存储或暂存数据移动到接收器数据存储。When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before you move data from the source data store to an interim or staging data store and then decompressed before you move data from an interim or staging data store to the sink data store.

目前,无论是否使用暂存复制,都无法在通过不同自承载 IR 连接的两个数据存储之间复制数据。Currently, you can't copy data between two data stores that are connected via different Self-hosted IRs, neither with nor without staged copy. 对于这种情况,可以配置两个显式链接的复制活动,将数据从源复制到暂存存储,然后从暂存存储复制到接收器。For such scenario, you can configure two explicitly chained copy activities to copy from source to staging then from staging to sink.

配置Configuration

在复制活动中配置 enableStaging 设置,指定在将数据加载到目标数据存储之前是否要在 Blob 存储中暂存。Configure the enableStaging setting in the copy activity to specify whether you want the data to be staged in Blob storage before you load it into a destination data store. enableStaging 设置为 TRUE 时,请指定下表中列出的其他属性。When you set enableStaging to TRUE, specify the additional properties listed in the following table. 如果未指定,则还需要创建 Azure 存储或存储共享访问签名链接服务用于暂存。You also need to create an Azure Storage or Storage shared access signature-linked service for staging if you don't have one.

属性Property 说明Description 默认值Default value 必须Required
enableStagingenableStaging 指定是否要通过过渡暂存存储复制数据。Specify whether you want to copy data via an interim staging store. FalseFalse No
linkedServiceNamelinkedServiceName 指定 AzureStorage 链接服务的名称,这指用作过渡暂存存储的存储实例。Specify the name of an AzureStorage linked service, which refers to the instance of Storage that you use as an interim staging store.

不能使用具有共享访问签名的存储通过 PolyBase 将数据加载到 Azure Synapse Analytics。You can't use Storage with a shared access signature to load data into Azure Synapse Analytics via PolyBase. 可在其他任何情况下使用它。You can use it in all other scenarios.
空值N/A enableStaging 设置为 TRUE 时,则为是Yes, when enableStaging is set to TRUE
pathpath 指定要包含此暂存数据的 Blob 存储路径。Specify the Blob storage path that you want to contain the staged data. 如果不提供路径,该服务将创建容器以存储临时数据。If you don't provide a path, the service creates a container to store temporary data.

只在使用具有共享访问签名的存储时,或者要求临时数据位于特定位置时才指定路径。Specify a path only if you use Storage with a shared access signature, or you require temporary data to be in a specific location.
空值N/A No
enableCompressionenableCompression 指定是否应先压缩数据,再将数据复制到目标。Specifies whether data should be compressed before it's copied to the destination. 此设置可减少传输的数据量。This setting reduces the volume of data being transferred. FalseFalse No

备注

若使用暂存复制并启用压缩,则不支持对暂存 Blob 链接服务的服务主体或 MSI 身份验证。If you use staged copy with compression enabled, the service principal or MSI authentication for staging blob linked service isn't supported.

以下是具有上表所述属性的复制活动的示例定义:Here's a sample definition of a copy activity with the properties that are described in the preceding table:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "SqlSource",
            },
            "sink": {
                "type": "SqlSink"
            },
            "enableStaging": true,
            "stagingSettings": {
                "linkedServiceName": {
                    "referenceName": "MyStagingBlob",
                    "type": "LinkedServiceReference"
                },
                "path": "stagingcontainer/path",
                "enableCompression": true
            }
        }
    }
]

暂存复制计费影响Staged copy billing impact

基于两个步骤进行计费:复制持续时间和复制类型。You're charged based on two steps: copy duration and copy type.

  • 在云复制期间(将数据从一个云数据存储复制到另一个云数据存储,两个阶段均由 Azure Integration Runtime 提供支持)使用暂存时,需要支付 [步骤 1 和步骤 2 的复制持续时间总和] x [云复制单元价格]。When you use staging during a cloud copy, which is copying data from a cloud data store to another cloud data store, both stages empowered by Azure integration runtime, you're charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price].
  • 在混合复制期间(将数据从本地数据存储复制到云数据存储,一个阶段由自承载集成运行时提供支持)使用暂存时,需要支付 [混合复制持续时间] x [混合复制单元价格] + [云复制持续时间] x [云复制单元价格]。When you use staging during a hybrid copy, which is copying data from an on-premises data store to a cloud data store, one stage empowered by a self-hosted integration runtime, you're charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price].

后续步骤Next steps

请参阅其他复制活动文章:See the other copy activity articles: