Azure 数据工厂中的复制活动Copy activity in Azure Data Factory

在 Azure 数据工厂中,可以使用复制活动在本地与云数据存储之间复制数据。In Azure Data Factory, you can use the Copy activity to copy data among data stores located on-premises and in the cloud. 复制数据后,可以使用其他活动进一步转换和分析数据。After you copy the data, you can use other activities to further transform and analyze it. 还可使用复制活动发布有关商业智能 (BI) 和应用程序消耗的转换和分析结果。You can also use the Copy activity to publish transformation and analysis results for business intelligence (BI) and application consumption.

复制活动的角色

复制活动在集成运行时上执行。The Copy activity is executed on an integration runtime. 对于不同的数据复制方案,可以使用不同类型的集成运行时:You can use different types of integration runtimes for different data copy scenarios:

  • 在可以使用任何 IP 通过 Internet 公开访问的两个数据存储之间复制数据时,可对复制活动使用 Azure 集成运行时。When you're copying data between two data stores that are publicly accessible through the internet from any IP, you can use the Azure integration runtime for the copy activity. 此集成运行时较为安全可靠、可缩放并全局可用This integration runtime is secure, reliable, scalable, and globally available.
  • 向/从本地数据存储或位于具有访问控制的网络(如 Azure 虚拟网络)中的数据存储复制数据时,需要安装自承载集成运行时。When you're copying data to and from data stores that are located on-premises or in a network with access control (for example, an Azure virtual network), you need to set up a self-hosted integration runtime.

集成运行时需要与每个源数据存储和接收器数据存储相关联。An integration runtime needs to be associated with each source and sink data store. 有关复制活动如何确定要使用的集成运行时的信息,请参阅确定要使用哪个 IRFor information about how the Copy activity determines which integration runtime to use, see Determining which IR to use.

若要将数据从源复制到接收器,运行复制活动的服务将执行以下步骤:To copy data from a source to a sink, the service that runs the Copy activity performs these steps:

  1. 读取源数据存储中的数据。Reads data from a source data store.
  2. 执行序列化/反序列化、压缩/解压缩、列映射等。Performs serialization/deserialization, compression/decompression, column mapping, and so on. 此服务基于输入数据集、输出数据集和复制活动的配置执行这些操作。It performs these operations based on the configuration of the input dataset, output dataset, and Copy activity.
  3. 将数据写入接收器/目标数据存储。Writes data to the sink/destination data store.

“复制活动”概述

支持的数据存储和格式Supported data stores and formats

CategoryCategory 数据存储Data store 支持用作源Supported as a source 支持用作接收器Supported as a sink Azure IR 支持Supported by Azure IR 自承载 IR 支持Supported by self-hosted IR
AzureAzure Azure Blob 存储Azure Blob storage
  Azure Cosmos DB (SQL API)Azure Cosmos DB (SQL API)
  Azure Cosmos DB 的 API for MongoDBAzure Cosmos DB's API for MongoDB
  Azure 数据资源管理器Azure Data Explorer
  Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2
  Azure Database for MariaDBAzure Database for MariaDB
  Azure Database for MySQLAzure Database for MySQL
  Azure Database for PostgreSQLAzure Database for PostgreSQL
  Azure 文件存储Azure File Storage
  Azure SQL 数据库Azure SQL Database
  Azure SQL 数据库托管实例Azure SQL Database Managed Instance
  Azure Synapse Analytics(以前称为 SQL 数据仓库)Azure Synapse Analytics (formerly SQL Data Warehouse)
  Azure 表存储Azure Table storage
数据库Database Amazon RedshiftAmazon Redshift
  DB2DB2
  DrillDrill
  Google BigQueryGoogle BigQuery
  GreenplumGreenplum
  HBaseHBase
  HiveHive
  Apache ImpalaApache Impala
  InformixInformix
  MariaDBMariaDB
  Microsoft AccessMicrosoft Access
  MySQLMySQL
  NetezzaNetezza
  OracleOracle
  PhoenixPhoenix
  PostgreSQLPostgreSQL
  Presto(预览)Presto (Preview)
  通过 Open Hub 实现的 SAP Business WarehouseSAP Business Warehouse via Open Hub
  通过 MDX 实现的 SAP Business WarehouseSAP Business Warehouse via MDX
  SAP HANASAP HANA
  SAP 表SAP table
  SparkSpark
  SQL ServerSQL Server
  SybaseSybase
  TeradataTeradata
  VerticaVertica
NoSQLNoSQL CassandraCassandra
  Couchbase(预览)Couchbase (Preview)
  MongoDBMongoDB
文件File Amazon S3Amazon S3
  文件系统File system
  FTPFTP
  Google Cloud StorageGoogle Cloud Storage
  HDFSHDFS
  SFTPSFTP
通用协议Generic protocol 泛型 HTTPGeneric HTTP
  泛型 ODataGeneric OData
  泛型 ODBCGeneric ODBC
  泛型 RESTGeneric REST
服务和应用Services and apps Amazon Marketplace Web ServiceAmazon Marketplace Web Service
  Common Data ServiceCommon Data Service
  Concur(预览)Concur (Preview)
  Dynamics 365Dynamics 365
  Dynamics AXDynamics AX
  Dynamics CRMDynamics CRM
  Google AdWordsGoogle AdWords
  HubSpot(预览)HubSpot (Preview)
  JiraJira
  Magento(预览)Magento (Preview)
  Marketo(预览)Marketo (Preview)
  Office 365Office 365
  Oracle Eloqua(预览)Oracle Eloqua (Preview)
  Oracle Responsys(预览)Oracle Responsys (Preview)
  Oracle 服务云(预览)Oracle Service Cloud (Preview)
  PayPal(预览)PayPal (Preview)
  QuickBooks(预览)QuickBooks (Preview)
  SalesforceSalesforce
  Salesforce 服务云Salesforce Service Cloud
  Salesforce Marketing CloudSalesforce Marketing Cloud
  SAP Cloud for Customer (C4C)SAP Cloud for Customer (C4C)
  SAP ECCSAP ECC
  ServiceNowServiceNow
  Shopify(预览)Shopify (Preview)
  Square(预览)Square (Preview)
  Web 表(HTML 表)Web table (HTML table)
  XeroXero
  Zoho(预览)Zoho (Preview)

Note

如果连接器标记为“预览” ,可以试用它并给我们反馈。If a connector is marked Preview, you can try it out and give us feedback. 若要在解决方案中使用预览版连接器的依赖项,请联系 Azure 支持If you want to take a dependency on preview connectors in your solution, contact Azure support.

支持的文件格式Supported file formats

Azure 数据工厂支持以下文件格式。Azure Data Factory support the following file formats. 请参阅每一篇关于基于格式的设置的文章。Refer to each article on format-based settings.

可以使用复制活动在两个基于文件的数据存储之间按原样复制文件,在这种情况下,无需任何序列化或反序列化即可高效复制数据。You can use the Copy activity to copy files as-is between two file-based data stores, in which case the data is copied efficiently without any serialization or deserialization. 此外,还可以分析或生成给定格式的文件。例如,可以执行以下操作:In addition, you can also parse or generate files of a given format, for example, you can perform the following:

  • 从本地 SQL Server 数据库复制数据,并将数据以 Parquet 格式写入 Azure Data Lake Storage Gen2。Copy data from an on-premises SQL Server database and write to Azure Data Lake Storage Gen2 in Parquet format.
  • 从本地文件系统中复制文本 (CSV) 格式文件,并将其以 Avro 格式写入 Azure Blob 存储。Copy files in text (CSV) format from an on-premises file system and write to Azure Blob storage in Avro format.
  • 从本地文件系统复制压缩文件,动态解压缩,然后将提取的文件写入 Azure Data Lake Storage Gen2。Copy zipped files from an on-premises file system, decompress them on-the-fly, and write extracted files to Azure Data Lake Storage Gen2.
  • 从 Azure Blob 存储复制 Gzip 压缩文本 (CSV) 格式的数据,并将其写入 Azure SQL 数据库。Copy data in Gzip compressed-text (CSV) format from Azure Blob storage and write it to Azure SQL Database.
  • 需要序列化/反序列化或压缩/解压缩的其他许多活动。Many more activities that require serialization/deserialization or compression/decompression.

支持的区域Supported regions

支持复制活动的服务已在 Azure Integration Runtime 位置中所列的区域和地理位置全球发布。The service that enables the Copy activity is available globally in the regions and geographies listed in Azure integration runtime locations. 全局可用拓扑可确保高效的数据移动,此类移动通常避免跨区域跃点。The globally available topology ensures efficient data movement that usually avoids cross-region hops. 请参阅各区域推出的产品,检查数据工厂和数据移动是否可在特定的区域中使用。See Products by region to check the availability of Data Factory and data movement in a specific region.

配置Configuration

若要使用 Azure 数据工厂中的复制活动,需要执行以下操作:To use the Copy activity in Azure Data Factory, you need to:

  1. 创建用于源数据存储和接收器数据存储的链接服务。Create linked services for the source data store and the sink data store. 有关配置信息和支持的属性,请参阅连接器文章的“链接服务属性”部分。Refer to the connector article's "Linked service properties" section for configuration information and supported properties. 在本文的支持的数据存储和格式部分可以找到支持的连接器列表。You can find the list of supported connectors in the Supported data stores and formats section of this article.
  2. 为源和接收器创建数据集。Create datasets for the source and sink. 有关配置信息和支持的属性,请参阅源和接收器连接器文章的“数据集属性”部分。Refer to the "Dataset properties" sections of the source and sink connector articles for configuration information and supported properties.
  3. 创建包含复制活动的管道。Create a pipeline with the Copy activity. 接下来的部分将提供示例。The next section provides an example.

语法Syntax

以下复制活动模板包含受支持属性的完整列表。The following template of a Copy activity contains a complete list of supported properties. 指定适合你的方案的属性。Specify the ones that fit your scenario.

"activities":[
    {
        "name": "CopyActivityTemplate",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<source dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<sink dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>",
                <properties>
            },
            "sink": {
                "type": "<sink type>"
                <properties>
            },
            "translator":
            {
                "type": "TabularTranslator",
                "columnMappings": "<column mapping>"
            },
            "dataIntegrationUnits": <number>,
            "parallelCopies": <number>,
            "enableStaging": true/false,
            "stagingSettings": {
                <properties>
            },
            "enableSkipIncompatibleRow": true/false,
            "redirectIncompatibleRowSettings": {
                <properties>
            }
        }
    }
]

语法详细信息Syntax details

属性Property 说明Description 必需?Required?
typetype 对于复制活动,请设置为 CopyFor a Copy activity, set to Copy Yes
inputsinputs 指定创建的指向源数据的数据集。Specify the dataset that you created that points to the source data. 复制活动仅支持单个输入。The Copy activity supports only a single input. Yes
outputsoutputs 指定创建的指向接收器数据的数据集。Specify the dataset that you created that points to the sink data. 复制活动仅支持单个输出。The Copy activity supports only a single output. Yes
typePropertiestypeProperties 指定用于配置复制活动的属性。Specify properties to configure the Copy activity. Yes
sourcesource 指定复制源类型以及用于检索数据的相应属性。Specify the copy source type and the corresponding properties for retrieving data.
有关详细信息,请参阅受支持的数据存储和格式中所列的连接器文章中的“复制活动属性”部分。For more information, see the "Copy activity properties" section in the connector article listed in Supported data stores and formats.
Yes
接收器sink 指定复制接收器类型以及用于写入数据的相应属性。Specify the copy sink type and the corresponding properties for writing data.
有关详细信息,请参阅受支持的数据存储和格式中所列的连接器文章中的“复制活动属性”部分。For more information, see the "Copy activity properties" section in the connector article listed in Supported data stores and formats.
Yes
转换器translator 指定从源到接收器的显式列映射。Specify explicit column mappings from source to sink. 当默认复制行为无法满足需求时,此属性适用。This property applies when the default copy behavior doesn't meet your needs.
有关详细信息,请参阅复制活动中的架构映射For more information, see Schema mapping in copy activity.
No
dataIntegrationUnitsdataIntegrationUnits 指定一个度量值,用于表示 Azure Integration Runtime 在复制数据时的算力。Specify a measure that represents the amount of power that the Azure integration runtime uses for data copy. 这些单位以前称为云数据移动单位 (DMU)。These units were formerly known as cloud Data Movement Units (DMU).
有关详细信息,请参阅数据集成单位For more information, see Data Integration Units.
No
parallelCopiesparallelCopies 指定从源读取数据和向接收器写入数据时想要复制活动使用的并行度。Specify the parallelism that you want the Copy activity to use when reading data from the source and writing data to the sink.
有关详细信息,请参阅并行复制For more information, see Parallel copy.
No
保护区preserve 指定在数据复制期间是否保留元数据/ACL。Specify whether to preserve metadata/ACLs during data copy.
有关详细信息,请参阅保留元数据For more information, see Preserve metadata.
No
enableStagingenableStaging
stagingSettingsstagingSettings
指定是否将临时数据分阶段存储在 Blob 存储中,而不是将数据直接从源复制到接收器。Specify whether to stage the interim data in Blob storage instead of directly copying data from source to sink.
有关有用的方案和配置详细信息,请参阅分阶段复制For information about useful scenarios and configuration details, see Staged copy.
No
enableSkipIncompatibleRowenableSkipIncompatibleRow
redirectIncompatibleRowSettingsredirectIncompatibleRowSettings
选择将数据从源复制到接收器时如何处理不兼容的行。Choose how to handle incompatible rows when you copy data from source to sink.
有关详细信息,请参阅容错For more information, see Fault tolerance.
No

监视Monitoring

可通过编程方式或在 Azure 数据工厂的“创作和监视”UI 上监视复制活动运行。 You can monitor the Copy activity run in the Azure Data Factory Author & Monitor UI or programmatically.

直观地监视Monitor visually

若要直观监视复制活动运行,请转到数据工厂,然后转到“创作和监视”。 To visually monitor the Copy activity run, go to your data factory and then go to Author & Monitor. 在“监视”选项卡上,可以看到一个管道运行列表,其“操作”列中提供了“查看活动运行”按钮: On the Monitor tab, you see a list of pipeline runs with a View Activity Run button in the Actions column:

监视管道运行

选择“查看管道运行”以查看管道运行中的活动列表。 Select View Activity Runs to see the list of activities in the pipeline run. 在“操作”列中,可以看到复制活动输入、输出、错误(如果复制活动运行失败)和详细信息的链接: In the Actions column, you see links to the Copy activity input, output, errors (if the Copy activity run fails), and details:

监视活动运行

选择“操作”列中的“详细信息”按钮,查看复制活动的执行详细信息和性能特征。 Select the Details button in the Actions column to see the Copy activity's execution details and performance characteristics. 你将看到以下信息:从源复制到接收器的数据的数量/行数/文件数、吞吐量、复制活动执行的步骤数和相应的持续时间,以及复制方案使用的配置。You see information like volume/number of rows/number of files of data copied from source to sink, throughput, steps the Copy activity goes through with corresponding durations, and configurations used for your copy scenario.

Tip

在某些方案中,你还会在复制监视页的顶部看到“性能优化提示”。 In some scenarios, you'll also see Performance tuning tips at the top of the copy monitoring page. 这些提示告知已识别到的瓶颈,并提供一些信息来帮助你做出更改,以提升复制吞吐量。These tips tell you about identified bottlenecks and provide information on what to change to boost copy throughput. 有关示例,请参阅本文的性能和优化部分。For an example, see the Performance and tuning section of this article.

示例:从 Amazon S3 复制到 Azure Data Lake Store 监视活动运行详细信息Example: Copy from Amazon S3 to Azure Data Lake Store Monitor activity run details

示例:使用分阶段复制从 Azure SQL 数据库复制到 Azure SQL 数据仓库 监视活动运行详细信息Example: Copy from Azure SQL Database to Azure SQL Data Warehouse with staged copy Monitor activity run details

以编程方式监视Monitor programmatically

“复制活动运行结果” > “输出”部分中也会返回复制活动执行详细信息和性能特征。 Copy activity execution details and performance characteristics are also returned in the Copy Activity run result > Output section. 下面是可能返回的属性的完整列表。Following is a complete list of properties that might be returned. 只会显示适用于你的复制方案的属性。You'll see only the properties that are applicable to your copy scenario. 有关如何监视活动运行的信息,请参阅监视管道运行For information about how to monitor activity runs, see Monitor a pipeline run.

属性名称Property name 说明Description 计价单位Unit
dataReaddataRead 从源读取的数据量。Amount of data read from source. Int64 值,以字节为单位Int64 value, in bytes
DataWrittendataWritten 写入接收器的数据量。Amount of data written to sink. Int64 值,以字节为单位Int64 value, in bytes
filesReadfilesRead 从文件存储复制过程中复制的文件数。Number of files copied during copy from file storage. Int64 值(未指定单位)Int64 value (no unit)
filesWrittenfilesWritten 复制到文件存储过程中复制的文件数。Number of files copied during copy to file storage. Int64 值(未指定单位)Int64 value (no unit)
sourcePeakConnectionssourcePeakConnections 复制活动运行期间与源数据存储建立的并发连接峰值数量。Peak number of concurrent connections established to the source data store during the Copy activity run. Int64 值(未指定单位)Int64 value (no unit)
sinkPeakConnectionssinkPeakConnections 复制活动运行期间与接收器数据存储建立的并发连接峰值数量。Peak number of concurrent connections established to the sink data store during the Copy activity run. Int64 值(未指定单位)Int64 value (no unit)
rowsReadrowsRead 从源读取的行数(不适用于二进制副本)。Number of rows read from the source (not applicable for binary copy). Int64 值(未指定单位)Int64 value (no unit)
rowsCopiedrowsCopied 复制到接收器的行数(不适用于二进制副本)。Number of rows copied to sink (not applicable for binary copy). Int64 值(未指定单位)Int64 value (no unit)
rowsSkippedrowsSkipped 跳过的不兼容行数。Number of incompatible rows that were skipped. 可通过将 enableSkipIncompatibleRow 设置为 true 来跳过不兼容的行。You can enable incompatible rows to be skipped by setting enableSkipIncompatibleRow to true. Int64 值(未指定单位)Int64 value (no unit)
copyDurationcopyDuration 复制运行的持续时间。Duration of the copy run. Int32 值,以秒为单位Int32 value, in seconds
throughputthroughput 数据传输速率。Rate of data transfer. 浮点数,以 KBps 为单位Floating point number, in KBps
sourcePeakConnectionssourcePeakConnections 复制活动运行期间与源数据存储建立的并发连接峰值数量。Peak number of concurrent connections established to the source data store during the Copy activity run. Int32 值(无单位)Int32 value (no unit)
sinkPeakConnectionssinkPeakConnections 复制活动运行期间与接收器数据存储建立的并发连接峰值数量。Peak number of concurrent connections established to the sink data store during the Copy activity run. Int32 值(无单位)Int32 value (no unit)
sqlDwPolyBasesqlDwPolyBase 将数据复制到 SQL 数据仓库时是否使用了 PolyBase。Whether PolyBase is used when data is copied into SQL Data Warehouse. 布尔Boolean
redshiftUnloadredshiftUnload 从 Redshift 复制数据时是否使用了 UNLOAD。Whether UNLOAD is used when data is copied from Redshift. 布尔Boolean
hdfsDistcphdfsDistcp 从 HDFS 复制数据时是否使用了 DistCp。Whether DistCp is used when data is copied from HDFS. 布尔Boolean
effectiveIntegrationRuntimeeffectiveIntegrationRuntime 用来为活动运行提供支持的一个或多个集成运行时 (IR),采用 <IR name> (<region if it's Azure IR>) 格式。The integration runtime (IR) or runtimes used to power the activity run, in the format <IR name> (<region if it's Azure IR>). 文本(字符串)Text (string)
usedDataIntegrationUnitsusedDataIntegrationUnits 复制期间的有效数据集成单位。The effective Data Integration Units during copy. Int32 值Int32 value
usedParallelCopiesusedParallelCopies 复制期间的有效 parallelCopies。The effective parallelCopies during copy. Int32 值Int32 value
redirectRowPathredirectRowPath redirectIncompatibleRowSettings 中配置的 Blob 存储中已跳过的不兼容行的日志路径。Path to the log of skipped incompatible rows in the blob storage you configure in the redirectIncompatibleRowSettings property. 请参阅本文稍后的容错See Fault tolerance later in this article. 文本(字符串)Text (string)
executionDetailsexecutionDetails 有关复制活动经历的各个阶段、相应步骤、持续时间、配置等的更多详细信息。More details on the stages the Copy activity goes through and the corresponding steps, durations, configurations, and so on. 不建议分析此节,因为它有可能发生更改。We don't recommend that you parse this section because it might change.

数据工厂还会在 detailedDurations 下报告各个阶段花费的详细持续时间(以秒为单位)。Data Factory also reports the detailed durations (in seconds) spent on various stages under detailedDurations. 这些步骤的持续时间各不相同。The durations of these steps are exclusive. 只显示适用于给定复制活动运行的持续时间:Only durations that apply to the given Copy activity run appear:
排队持续时间 (queuingDuration):在集成运行时中实际启动复制活动之前经过的时间。Queuing duration (queuingDuration): The amount of time before the Copy activity actually starts on the integration runtime. 如果使用自承载 IR,而此值较大,请检查 IR 容量和用量,并根据工作负荷进行纵向或横向扩展。If you use a self-hosted IR and this value is large, check the IR capacity and usage, and scale up or out according to your workload.
复制前脚本持续时间 (preCopyScriptDuration):在 IR 中启动复制活动之后、复制活动在接收器数据存储中完成运行复制前脚本之前所经过的时间。Pre-copy script duration (preCopyScriptDuration): The time elapsed between when the Copy activity starts on the IR and when the Copy activity finishes running the pre-copy script in the sink data store. 配置复制前脚本时适用。Applies when you configure the pre-copy script.
距第一字节的时间 (timeToFirstByte):在前一步骤结束之后、IR 从源数据存储收到第一个字节之前所经过的时间。Time to first byte (timeToFirstByte): The time elapsed between the end of the previous step and the time when the IR receives the first byte from the source data store. 适用于不是基于文件的源。Applies to non-file-based sources. 如果此值很大,请检查并优化查询或服务器。If this value is large, check and optimize the query or server.
传输持续时间 (transferDuration):在前一步骤结束之后、IR 将所有数据从源传输到接收器之前所经过的时间。Transfer duration (transferDuration): The time elapsed between the end of the previous step and the time when the IR transfers all the data from source to sink.
ArrayArray
perfRecommendationperfRecommendation 复制性能优化提示。Copy performance tuning tips. 有关详细信息,请参阅性能和优化See Performance and tuning for details. ArrayArray
"output": {
    "dataRead": 6198358,
    "dataWritten": 19169324,
    "filesRead": 1,
    "sourcePeakConnections": 1,
    "sinkPeakConnections": 2,
    "rowsRead": 39614,
    "rowsCopied": 39614,
    "copyDuration": 1325,
    "throughput": 4.568,
    "errors": [],
    "effectiveIntegrationRuntime": "DefaultIntegrationRuntime (China East 2)",
    "usedDataIntegrationUnits": 4,
    "usedParallelCopies": 1,
    "executionDetails": [
        {
            "source": {
                "type": "AzureBlobStorage"
            },
            "sink": {
                "type": "AzureSqlDatabase"
            },
            "status": "Succeeded",
            "start": "2019-08-06T01:01:36.7778286Z",
            "duration": 1325,
            "usedDataIntegrationUnits": 4,
            "usedParallelCopies": 1,
            "detailedDurations": {
                "queuingDuration": 2,
                "preCopyScriptDuration": 12,
                "transferDuration": 1311
            }
        }
    ],
    "perfRecommendation": [
        {
            "Tip": "Sink Azure SQL Database: The DTU utilization was high during the copy activity run. To achieve better performance, you are suggested to scale the database to a higher tier than the current 1600 DTUs.",
            "ReferUrl": "https://docs.azure.cn/zh-cn/sql-database/sql-database-purchase-models",
            "RuleName": "AzureDBTierUpgradePerfRecommendRule"
        }
    ]
}

增量复制Incremental copy

数据工厂可让你以递增方式将增量数据从源数据存储复制到接收器数据存储。Data Factory enables you to incrementally copy delta data from a source data store to a sink data store. 有关详细信息,请参阅教程:以增量方式复制数据For details, see Tutorial: Incrementally copy data.

性能和优化Performance and tuning

复制活动性能和可伸缩性指南中描述了哪些关键因素会影响通过 Azure 数据工厂中的复制活动移动数据时的性能。The Copy activity performance and scalability guide describes key factors that affect the performance of data movement via the Copy activity in Azure Data Factory. 其中还列出了在测试期间观测到的性能值,并介绍了如何优化复制活动的性能。It also lists the performance values observed during testing and discusses how to optimize the performance of the Copy activity.

在某些方案中,当你运行数据工厂中的复制活动时,将在复制活动监视页的顶部看到“性能调优提示”,如以下示例所示。 In some scenarios, when you run a Copy activity in Data Factory, you'll see Performance tuning tips at the top of the Copy activity monitoring page, as shown in the following example. 这些提示告知针对给定复制运行识别到的瓶颈。The tips tell you the bottleneck identified for the given copy run. 其中还提供了一些信息来帮助你做出更改,以提升复制吞吐量。They also provide information on what to change to boost copy throughput. 性能优化提示目前提供如下建议:在将数据复制到 Azure SQL 数据仓库时使用 PolyBase;在数据存储端资源出现瓶颈时增加 Azure Cosmos DB RU 或 Azure SQL 数据库 DTU;删除不必要的暂存副本。The performance tuning tips currently provide suggestions like using PolyBase when copying data into Azure SQL Data Warehouse, increasing Azure Cosmos DB RUs or Azure SQL Database DTUs when the resource on the data store side is the bottleneck, and removing unnecessary staged copies.

示例:复制到 Azure SQL 数据库时的性能优化提示Example: Copy into Azure SQL Database, with a performance tuning tip

在此示例中,在复制运行期间,数据工厂发现接收器 Azure SQL 数据库的 DTU 利用率很高。In this sample, during a copy run, Data Factory tracks a high DTU utilization in the sink Azure SQL Database. 这种状况会减慢写入操作的速度。This condition slows down write operations. 建议增加 Azure SQL 数据库层上的 DTU:The suggestion is to increase the DTUs on the Azure SQL Database tier:

包含性能优化提示的复制监视

保留元数据和数据Preserve metadata along with data

在将数据从源复制到接收器时,在进行 Data Lake 迁移这样的情况下,还可以选择使用复制活动来保存元数据和 ACL 以及数据。While copying data from source to sink, in scenarios like data lake migration, you can also choose to preserve the metadata and ACLs along with data using copy activity. 有关详细信息,请参阅保留元数据See Preserve metadata for details.

架构和数据类型映射Schema and data type mapping

有关复制活动如何将源数据映射到接收器的信息,请参阅架构和数据类型映射See Schema and data type mapping for information about how the Copy activity maps your source data to your sink.

容错Fault tolerance

默认情况下,如果源数据行与接收器数据行不兼容,复制活动将停止复制数据,并返回失败结果。By default, the Copy activity stops copying data and returns a failure when source data rows are incompatible with sink data rows. 要使复制成功,可将复制活动配置为跳过并记录不兼容的行,仅复制兼容的数据。To make the copy succeed, you can configure the Copy activity to skip and log the incompatible rows and copy only the compatible data. 有关详细信息,请参阅复制活动容错See Copy activity fault tolerance for details.

后续步骤Next steps

请参阅以下快速入门、教程和示例:See the following quickstarts, tutorials, and samples: