Azure 数据工厂中的复制活动Copy activity in Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

在 Azure 数据工厂中,可以使用复制活动在本地与云数据存储之间复制数据。In Azure Data Factory, you can use the Copy activity to copy data among data stores located on-premises and in the cloud. 复制数据后,可以使用其他活动进一步转换和分析数据。After you copy the data, you can use other activities to further transform and analyze it. 还可使用复制活动发布有关商业智能 (BI) 和应用程序消耗的转换和分析结果。You can also use the Copy activity to publish transformation and analysis results for business intelligence (BI) and application consumption.

复制活动的角色

复制活动在集成运行时上执行。The Copy activity is executed on an integration runtime. 对于不同的数据复制方案,可以使用不同类型的集成运行时:You can use different types of integration runtimes for different data copy scenarios:

  • 在可以使用任何 IP 通过 Internet 公开访问的两个数据存储之间复制数据时,可对复制活动使用 Azure 集成运行时。When you're copying data between two data stores that are publicly accessible through the internet from any IP, you can use the Azure integration runtime for the copy activity. 此集成运行时较为安全可靠、可缩放并全局可用This integration runtime is secure, reliable, scalable, and globally available.
  • 向/从本地数据存储或位于具有访问控制的网络(如 Azure 虚拟网络)中的数据存储复制数据时,需要安装自承载集成运行时。When you're copying data to and from data stores that are located on-premises or in a network with access control (for example, an Azure virtual network), you need to set up a self-hosted integration runtime.

集成运行时需要与每个源数据存储和接收器数据存储相关联。An integration runtime needs to be associated with each source and sink data store. 有关复制活动如何确定要使用的集成运行时的信息,请参阅确定要使用哪个 IRFor information about how the Copy activity determines which integration runtime to use, see Determining which IR to use.

若要将数据从源复制到接收器,运行复制活动的服务将执行以下步骤:To copy data from a source to a sink, the service that runs the Copy activity performs these steps:

  1. 读取源数据存储中的数据。Reads data from a source data store.
  2. 执行序列化/反序列化、压缩/解压缩、列映射等。Performs serialization/deserialization, compression/decompression, column mapping, and so on. 此服务基于输入数据集、输出数据集和复制活动的配置执行这些操作。It performs these operations based on the configuration of the input dataset, output dataset, and Copy activity.
  3. 将数据写入接收器/目标数据存储。Writes data to the sink/destination data store.

“复制活动”概述

支持的数据存储和格式Supported data stores and formats

CategoryCategory 数据存储Data store 支持用作源Supported as a source 支持用作接收器Supported as a sink Azure IR 支持Supported by Azure IR 自承载 IR 支持Supported by self-hosted IR
AzureAzure Azure Blob 存储Azure Blob storage
  Azure 认知搜索索引Azure Cognitive Search index
  Azure Cosmos DB (SQL API)Azure Cosmos DB (SQL API)
  Azure Cosmos DB 的 API for MongoDBAzure Cosmos DB's API for MongoDB
  Azure 数据资源管理器Azure Data Explorer
  Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2
  Azure Database for MariaDBAzure Database for MariaDB
  Azure Database for MySQLAzure Database for MySQL
  Azure Database for PostgreSQLAzure Database for PostgreSQL
  Azure 文件存储Azure File Storage
  Azure SQL 数据库Azure SQL Database
  Azure SQL 托管实例Azure SQL Managed Instance
  Azure Synapse Analytics(以前称为 SQL 数据仓库)Azure Synapse Analytics (formerly SQL Data Warehouse)
  Azure 表存储Azure Table storage
DatabaseDatabase Amazon RedshiftAmazon Redshift
  DB2DB2
  DrillDrill
  Google BigQueryGoogle BigQuery
  GreenplumGreenplum
  HBaseHBase
  HiveHive
  Apache ImpalaApache Impala
  InformixInformix
  MariaDBMariaDB
  Microsoft AccessMicrosoft Access
  MySQLMySQL
  NetezzaNetezza
  OracleOracle
  PhoenixPhoenix
  PostgreSQLPostgreSQL
  Presto(预览)Presto (Preview)
  通过 Open Hub 实现的 SAP Business WarehouseSAP Business Warehouse via Open Hub
  通过 MDX 实现的 SAP Business WarehouseSAP Business Warehouse via MDX
  SAP HANASAP HANA
  SAP 表SAP table
  SnowflakeSnowflake
  SparkSpark
  SQL ServerSQL Server
  SybaseSybase
  TeradataTeradata
  VerticaVertica
NoSQLNoSQL CassandraCassandra
  Couchbase(预览)Couchbase (Preview)
  MongoDBMongoDB
文件File Amazon S3Amazon S3
  文件系统File system
  FTPFTP
  Google Cloud StorageGoogle Cloud Storage
  HDFSHDFS
  SFTPSFTP
通用协议Generic protocol 泛型 HTTPGeneric HTTP
  泛型 ODataGeneric OData
  泛型 ODBCGeneric ODBC
  泛型 RESTGeneric REST
服务和应用Services and apps Amazon Marketplace Web ServiceAmazon Marketplace Web Service
  Common Data ServiceCommon Data Service
  Concur(预览)Concur (Preview)
  Dynamics 365Dynamics 365
  Dynamics AXDynamics AX
  Dynamics CRMDynamics CRM
  Google AdWordsGoogle AdWords
  HubSpot(预览)HubSpot (Preview)
  JiraJira
  Magento(预览)Magento (Preview)
  Marketo(预览)Marketo (Preview)
  Office 365Office 365
  Oracle Eloqua(预览)Oracle Eloqua (Preview)
  Oracle Responsys(预览)Oracle Responsys (Preview)
  Oracle 服务云(预览)Oracle Service Cloud (Preview)
  PayPal(预览)PayPal (Preview)
  QuickBooks(预览)QuickBooks (Preview)
  SalesforceSalesforce
  Salesforce 服务云Salesforce Service Cloud
  Salesforce Marketing CloudSalesforce Marketing Cloud
  SAP Cloud for Customer (C4C)SAP Cloud for Customer (C4C)
  SAP ECCSAP ECC
  ServiceNowServiceNow
SharePoint Online 列表SharePoint Online List
  Shopify(预览)Shopify (Preview)
  Square(预览)Square (Preview)
  Web 表(HTML 表)Web table (HTML table)
  XeroXero
  Zoho(预览)Zoho (Preview)

Note

如果连接器标记为“预览”,可以试用它并给我们反馈。If a connector is marked Preview, you can try it out and give us feedback. 若要在解决方案中使用预览版连接器的依赖项,请联系 Azure 支持If you want to take a dependency on preview connectors in your solution, contact Azure support.

支持的文件格式Supported file formats

Azure 数据工厂支持以下文件格式。Azure Data Factory supports the following file formats. 请参阅每一篇介绍基于格式的设置的文章。Refer to each article for format-based settings.

可以使用复制活动在两个基于文件的数据存储之间按原样复制文件,在这种情况下,无需任何序列化或反序列化即可高效复制数据。You can use the Copy activity to copy files as-is between two file-based data stores, in which case the data is copied efficiently without any serialization or deserialization. 此外,还可以分析或生成给定格式的文件。例如,可以执行以下操作:In addition, you can also parse or generate files of a given format, for example, you can perform the following:

  • 从 SQL Server 数据库复制数据,并将数据以 Parquet 格式写入 Azure Data Lake Storage Gen2。Copy data from a SQL Server database and write to Azure Data Lake Storage Gen2 in Parquet format.
  • 从本地文件系统中复制文本 (CSV) 格式文件,并将其以 Avro 格式写入 Azure Blob 存储。Copy files in text (CSV) format from an on-premises file system and write to Azure Blob storage in Avro format.
  • 从本地文件系统复制压缩文件,动态解压缩,然后将提取的文件写入 Azure Data Lake Storage Gen2。Copy zipped files from an on-premises file system, decompress them on-the-fly, and write extracted files to Azure Data Lake Storage Gen2.
  • 从 Azure Blob 存储复制 Gzip 压缩文本 (CSV) 格式的数据,并将其写入 Azure SQL 数据库。Copy data in Gzip compressed-text (CSV) format from Azure Blob storage and write it to Azure SQL Database.
  • 需要序列化/反序列化或压缩/解压缩的其他许多活动。Many more activities that require serialization/deserialization or compression/decompression.

支持的区域Supported regions

支持复制活动的服务已在 Azure Integration Runtime 位置中所列的区域和地理位置全球发布。The service that enables the Copy activity is available globally in the regions and geographies listed in Azure integration runtime locations. 全局可用拓扑可确保高效的数据移动,此类移动通常避免跨区域跃点。The globally available topology ensures efficient data movement that usually avoids cross-region hops. 请参阅各区域推出的产品,检查数据工厂和数据移动是否可在特定的区域中使用。See Products by region to check the availability of Data Factory and data movement in a specific region.

配置Configuration

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

通常,若要使用 Azure 数据工厂中的复制活动,需要执行以下操作:In general, to use the Copy activity in Azure Data Factory, you need to:

  1. 创建用于源数据存储和接收器数据存储的链接服务。Create linked services for the source data store and the sink data store. 在本文的支持的数据存储和格式部分可以找到支持的连接器列表。You can find the list of supported connectors in the Supported data stores and formats section of this article. 有关配置信息和支持的属性,请参阅连接器文章的“链接服务属性”部分。Refer to the connector article's "Linked service properties" section for configuration information and supported properties.
  2. 为源和接收器创建数据集。Create datasets for the source and sink. 有关配置信息和支持的属性,请参阅源和接收器连接器文章的“数据集属性”部分。Refer to the "Dataset properties" sections of the source and sink connector articles for configuration information and supported properties.
  3. 创建包含复制活动的管道。Create a pipeline with the Copy activity. 接下来的部分将提供示例。The next section provides an example.

语法Syntax

以下复制活动模板包含受支持属性的完整列表。The following template of a Copy activity contains a complete list of supported properties. 指定适合你的方案的属性。Specify the ones that fit your scenario.

"activities":[
    {
        "name": "CopyActivityTemplate",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<source dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<sink dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>",
                <properties>
            },
            "sink": {
                "type": "<sink type>"
                <properties>
            },
            "translator":
            {
                "type": "TabularTranslator",
                "columnMappings": "<column mapping>"
            },
            "dataIntegrationUnits": <number>,
            "parallelCopies": <number>,
            "enableStaging": true/false,
            "stagingSettings": {
                <properties>
            },
            "enableSkipIncompatibleRow": true/false,
            "redirectIncompatibleRowSettings": {
                <properties>
            }
        }
    }
]

语法详细信息Syntax details

属性Property 说明Description 必需?Required?
typetype 对于复制活动,请设置为 CopyFor a Copy activity, set to Copy Yes
inputsinputs 指定创建的指向源数据的数据集。Specify the dataset that you created that points to the source data. 复制活动仅支持单个输入。The Copy activity supports only a single input. Yes
outputsoutputs 指定创建的指向接收器数据的数据集。Specify the dataset that you created that points to the sink data. 复制活动仅支持单个输出。The Copy activity supports only a single output. Yes
typePropertiestypeProperties 指定用于配置复制活动的属性。Specify properties to configure the Copy activity. Yes
sourcesource 指定复制源类型以及用于检索数据的相应属性。Specify the copy source type and the corresponding properties for retrieving data.
有关详细信息,请参阅受支持的数据存储和格式中所列的连接器文章中的“复制活动属性”部分。For more information, see the "Copy activity properties" section in the connector article listed in Supported data stores and formats.
Yes
接收器sink 指定复制接收器类型以及用于写入数据的相应属性。Specify the copy sink type and the corresponding properties for writing data.
有关详细信息,请参阅受支持的数据存储和格式中所列的连接器文章中的“复制活动属性”部分。For more information, see the "Copy activity properties" section in the connector article listed in Supported data stores and formats.
Yes
转换器translator 指定从源到接收器的显式列映射。Specify explicit column mappings from source to sink. 当默认复制行为无法满足需求时,此属性适用。This property applies when the default copy behavior doesn't meet your needs.
有关详细信息,请参阅复制活动中的架构映射For more information, see Schema mapping in copy activity.
No
dataIntegrationUnitsdataIntegrationUnits 指定一个度量值,用于表示 Azure Integration Runtime 在复制数据时的算力。Specify a measure that represents the amount of power that the Azure integration runtime uses for data copy. 这些单位以前称为云数据移动单位 (DMU)。These units were formerly known as cloud Data Movement Units (DMU).
有关详细信息,请参阅数据集成单位For more information, see Data Integration Units.
No
parallelCopiesparallelCopies 指定从源读取数据和向接收器写入数据时想要复制活动使用的并行度。Specify the parallelism that you want the Copy activity to use when reading data from the source and writing data to the sink.
有关详细信息,请参阅并行复制For more information, see Parallel copy.
No
保护区preserve 指定在数据复制期间是否保留元数据/ACL。Specify whether to preserve metadata/ACLs during data copy.
有关详细信息,请参阅保留元数据For more information, see Preserve metadata.
No
enableStagingenableStaging
stagingSettingsstagingSettings
指定是否将临时数据分阶段存储在 Blob 存储中,而不是将数据直接从源复制到接收器。Specify whether to stage the interim data in Blob storage instead of directly copying data from source to sink.
有关有用的方案和配置详细信息,请参阅分阶段复制For information about useful scenarios and configuration details, see Staged copy.
No
enableSkipIncompatibleRowenableSkipIncompatibleRow
redirectIncompatibleRowSettingsredirectIncompatibleRowSettings
选择将数据从源复制到接收器时如何处理不兼容的行。Choose how to handle incompatible rows when you copy data from source to sink.
有关详细信息,请参阅容错For more information, see Fault tolerance.
No

监视Monitoring

可通过视觉和编程方式监视在 Azure 数据工厂中运行的复制活动。You can monitor the Copy activity run in the Azure Data Factory both visually and programmatically. 有关详细信息,请参阅监视复制活动For details, see Monitor copy activity.

增量复制Incremental copy

数据工厂可让你以递增方式将增量数据从源数据存储复制到接收器数据存储。Data Factory enables you to incrementally copy delta data from a source data store to a sink data store. 有关详细信息,请参阅教程:以增量方式复制数据For details, see Tutorial: Incrementally copy data.

性能和优化Performance and tuning

复制活动监视体验向你显示每个活动运行的复制性能统计信息。The copy activity monitoring experience shows you the copy performance statistics for each of your activity run. 复制活动性能和可伸缩性指南中描述了哪些关键因素会影响通过 Azure 数据工厂中的复制活动移动数据时的性能。The Copy activity performance and scalability guide describes key factors that affect the performance of data movement via the Copy activity in Azure Data Factory. 其中还列出了在测试期间观测到的性能值,并介绍了如何优化复制活动的性能。It also lists the performance values observed during testing and discusses how to optimize the performance of the Copy activity.

从上次失败的运行恢复Resume from last failed run

在基于文件的存储之间以二进制格式按原样复制大量文件,并选择将文件夹/文件层次结构从源保存到接收器(例如,将数据从 Amazon S3 迁移到 Azure Data Lake Storage Gen2)时,复制活动支持从上次失败的运行中恢复。Copy activity supports resume from last failed run when you copy large size of files as-is with binary format between file-based stores and choose to preserve the folder/file hierarchy from source to sink, e.g. to migrate data from Amazon S3 to Azure Data Lake Storage Gen2. 它适用于下述基于文件的连接器:Amazon S3Azure BlobAzure Data Lake Storage Gen2Azure 文件存储文件系统FTPGoogle Cloud StorageHDFSSFTPIt applies to the following file-based connectors: Amazon S3, Azure Blob, Azure Data Lake Storage Gen2, Azure File Storage, File System, FTP, Google Cloud Storage, HDFS, and SFTP.

可以通过下述两种方式利用复制活动恢复功能:You can leverage the copy activity resume in the following two ways:

  • 活动级别重试: 可以设置复制活动的重试计数。Activity level retry: You can set retry count on copy activity. 在管道执行过程中,如果此复制活动运行失败,则下一次自动重试将从上一次试用的故障点开始。During the pipeline execution, if this copy activity run fails, the next automatic retry will start from last trial's failure point.

  • 从失败的活动重新运行: 管道执行完成后,还可以通过 ADF UI 监视视图或编程方式触发从失败的活动重新运行的操作。Rerun from failed activity: After pipeline execution completion, you can also trigger a rerun from the failed activity in the ADF UI monitoring view or programmatically. 如果失败的活动是复制活动,则该管道将不仅从此活动重新运行,而且也会从上一个运行的故障点恢复。If the failed activity is a copy activity, the pipeline will not only rerun from this activity, but also resume from the previous run's failure point.

    复制恢复

要注意的几点:Few points to note:

  • 恢复发生在文件级别。Resume happens at file level. 如果复制活动在复制文件时失败,则在下一次运行时,会重新复制此特定文件。If copy activity fails when copying a file, in next run, this specific file will be re-copied.
  • 若要正常使用恢复功能,请不要在两次重新运行之间更改复制活动设置。For resume to work properly, do not change the copy activity settings between the reruns.
  • 从 Amazon S3、Azure Blob、Azure Data Lake Storage Gen2 和 Google Cloud Storage 复制数据时,复制活动可以从任意数量的已复制文件恢复。When you copy data from Amazon S3, Azure Blob, Azure Data Lake Storage Gen2 and Google Cloud Storage, copy activity can resume from arbitrary number of copied files. 虽然就其他作为源的基于文件的连接器来说,目前复制活动只支持从有限数量的文件中恢复(通常是在数万个文件的范围内,因文件路径的长度而异),但超出此数量的文件将在重新运行期间重新复制。While for the rest of file-based connectors as source, currently copy activity supports resume from a limited number of files, usually at the range of tens of thousands and varies depending on the length of the file paths; files beyond this number will be re-copied during reruns.

对于二进制文件复制以外的其他情况,将从头开始重新运行复制活动。For other scenarios than binary file copy, copy activity rerun starts from the beginning.

保留元数据和数据Preserve metadata along with data

在将数据从源复制到接收器时,在进行 Data Lake 迁移这样的情况下,还可以选择使用复制活动来保存元数据和 ACL 以及数据。While copying data from source to sink, in scenarios like data lake migration, you can also choose to preserve the metadata and ACLs along with data using copy activity. 有关详细信息,请参阅保留元数据See Preserve metadata for details.

架构和数据类型映射Schema and data type mapping

有关复制活动如何将源数据映射到接收器的信息,请参阅架构和数据类型映射See Schema and data type mapping for information about how the Copy activity maps your source data to your sink.

在复制过程中添加其他列Add additional columns during copy

除了将数据从源数据存储复制到接收器外,还可以进行配置,以便添加要一起复制到接收器的其他数据列。In addition to copying data from source data store to sink, you can also configure to add additional data columns to copy along to sink. 例如:For example:

  • 从基于文件的源复制时,将相对文件路径存储为一个附加列,用以跟踪数据来自哪个文件。When copy from file-based source, store the relative file path as an additional column to trace from which file the data comes from.
  • 添加包含 ADF 表达式的列,以附加 ADF 系统变量(例如管道名称/管道 ID),或存储来自上游活动输出的其他动态值。Add a column with ADF expression, to attach ADF system variables like pipeline name/pipeline id, or store other dynamic value from upstream activity's output.
  • 添加一个包含静态值的列以满足下游消耗需求。Add a column with static value to meet your downstream consumption need.

可以在复制活动源选项卡上找到以下配置:You can find the following configuration on copy activity source tab:

在复制活动中添加其他列

Tip

此功能适用于最新的数据集模型。This feature works with the latest dataset model. 如果在 UI 中未看到此选项,请尝试创建一个新数据集。If you don't see this option from the UI, try creating a new dataset.

若要以编程方式对其进行配置,请在复制活动源中添加 additionalColumns 属性:To configure it programmatically, add the additionalColumns property in your copy activity source:

属性Property 说明Description 必须Required
additionalColumnsadditionalColumns 添加要复制到接收器的其他数据列。Add additional data columns to copy to sink.

additionalColumns 数组下的每个对象都表示一个额外的列。Each object under the additionalColumns array represents an extra column. name 定义列名称,value 表示该列的数据值。The name defines the column name, and the value indicates the data value of that column.

允许的数据值为:Allowed data values are:
- $$FILEPATH - 一个保留变量,指示将源文件的相对路径存储在数据集中指定的文件夹路径。- $$FILEPATH - a reserved variable indicates to store the source files' relative path to the folder path specified in dataset. 应用于基于文件的源。Apply to file-based source.
- 表达式- Expression
- 静态值- Static value
No

示例:Example:

"activities":[
    {
        "name": "CopyWithAdditionalColumns",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "<source type>",
                "additionalColumns": [
                    {
                        "name": "filePath",
                        "value": "$$FILEPATH"
                    },
                    {
                        "name": "pipelineName",
                        "value": {
                            "value": "@pipeline().Pipeline",
                            "type": "Expression"
                        }
                    },
                    {
                        "name": "staticValue",
                        "value": "sampleValue"
                    }
                ],
                ...
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

容错Fault tolerance

默认情况下,如果源数据行与接收器数据行不兼容,复制活动将停止复制数据,并返回失败结果。By default, the Copy activity stops copying data and returns a failure when source data rows are incompatible with sink data rows. 要使复制成功,可将复制活动配置为跳过并记录不兼容的行,仅复制兼容的数据。To make the copy succeed, you can configure the Copy activity to skip and log the incompatible rows and copy only the compatible data. 有关详细信息,请参阅复制活动容错See Copy activity fault tolerance for details.

后续步骤Next steps

请参阅以下快速入门、教程和示例:See the following quickstarts, tutorials, and samples: