Azure 数据工厂中的“复制数据”工具Copy Data tool in Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

Azure 数据工厂“复制数据”工具可简化并优化将数据引入 Data Lake 的过程,这通常是端到端数据集成方案的第一步。The Azure Data Factory Copy Data tool eases and optimizes the process of ingesting data into a data lake, which is usually a first step in an end-to-end data integration scenario. 这可节省时间,尤其是首次使用 Azure 数据工厂引入数据源中的数据时。It saves time, especially when you use Azure Data Factory to ingest data from a data source for the first time. 下面是使用此工具的一些优势:Some of the benefits of using this tool are:

  • 使用 Azure 数据工厂“复制数据”工具时,无需了解链接的数据、数据集、管道、活动和触发器的数据工厂定义。When using the Azure Data Factory Copy Data tool, you do not need understand Data Factory definitions for linked services, datasets, pipelines, activities, and triggers.
  • 对于将数据加载到 Data Lake 中,“复制数据”工具的流程非常直观。The flow of Copy Data tool is intuitive for loading data into a data lake. 该工具会自动创建将数据从所选源数据存储复制到所选目标/接收器数据存储所需的所有数据工厂资源。The tool automatically creates all the necessary Data Factory resources to copy data from the selected source data store to the selected destination/sink data store.
  • “复制数据”工具有助于在创作时验证正在引入的数据,从而在开始时避免出现任何潜在错误。The Copy Data tool helps you validate the data that is being ingested at the time of authoring, which helps you avoid any potential errors at the beginning itself.
  • 如果需要实现复杂的业务逻辑,以便将数据加载到 Data Lake 中,仍可使用数据工厂 UI 中的“每活动创作”来编辑“复制数据”工具创建的数据工厂资源。If you need to implement complex business logic to load data into a data lake, you can still edit the Data Factory resources created by the Copy Data tool by using the per-activity authoring in Data Factory UI.

下表提供了有关在何时使用数据工厂 UI 中的“复制数据”工具与“每活动创作”的指南:The following table provides guidance on when to use the Copy Data tool vs. per-activity authoring in Data Factory UI:

复制数据工具Copy Data tool 每活动(复制活动)创作Per activity (Copy activity) authoring
需要轻松构建数据加载任务,而无需了解 Azure 数据工厂实体(链接的服务、数据集和管道等)You want to easily build a data loading task without learning about Azure Data Factory entities (linked services, datasets, pipelines, etc.) 需要实现复杂而灵活的逻辑,将数据加载到 Lake 中。You want to implement complex and flexible logic for loading data into lake.
需要将大量数据项目快速加载到 Data Lake 中。You want to quickly load a large number of data artifacts into a data lake. 需要将复制活动与后续活动链接在一起,以清理或处理数据。You want to chain Copy activity with subsequent activities for cleansing or processing data.

要启动“复制数据”工具,请单击数据工厂主页上的“复制数据”磁贴。To start the Copy Data tool, click the Copy Data tile on the home page of your data factory.

“开始使用”页 -“复制数据”工具的链接

将数据加载到 Data Lake 中的直观流程Intuitive flow for loading data into a data lake

使用此工具,可通过一个直观的流程在数分钟内轻松地将数据从各种源移动到目标:This tool allows you to easily move data from a wide variety of sources to destinations in minutes with an intuitive flow:

  1. 配置“源”的设置。Configure settings for the source.

  2. 配置“目标”的设置。Configure settings for the destination.

  3. 配置复制操作的高级设置,如列映射、性能设置和容错设置。Configure advanced settings for the copy operation such as column mapping, performance settings, and fault tolerance settings.

  4. 指定数据加载任务的“计划”。Specify a schedule for the data loading task.

  5. 查看要创建的数据工厂实体的摘要。Review summary of Data Factory entities to be created.

  6. 编辑管道,根据需要更新复制活动的设置。Edit the pipeline to update settings for the copy activity as needed.

    该工具从设计之初就考虑到了大数据,支持多种数据和对象类型。The tool is designed with big data in mind from the start, with support for diverse data and object types. 可利用该工具移动数百个文件夹、文件或表。You can use it to move hundreds of folders, files, or tables. 该工具支持自动数据预览、构架捕获和自动映射,以及筛选数据。The tool supports automatic data preview, schema capture and automatic mapping, and data filtering as well.

复制数据工具

自动数据预览Automatic data preview

可从选定的源数据存储预览部分数据,这样便可验证正在复制的数据。You can preview part of the data from the selected source data store, which allows you to validate the data that is being copied. 此外,如果源数据在文本文件中,“复制数据”工具会分析文本文件,以自动检测行和列分隔符以及架构。In addition, if the source data is in a text file, the Copy Data tool parses the text file to automatically detect the row and column delimiters, and schema.

文件设置

检测后:After the detection:

已检测的文件设置和预览

架构捕获和自动映射Schema capture and automatic mapping

在很多情况下,数据源的架构可能与数据目标的架构不同。The schema of data source may not be same as the schema of data destination in many cases. 在这种情况下,需要将源架构中的列映射到目标架构中的列。In this scenario, you need to map columns from the source schema to columns from the destination schema.

“复制数据”工具可监视并学习用户在源和目标存储之间映射列时的行为。The Copy Data tool monitors and learns your behavior when you are mapping columns between source and destination stores. 从源数据存储中选择一个或几个列并将其映射到目标架构之后,“复制数据”工具开始分析从两侧选取的列对的模式。After you pick one or a few columns from source data store, and map them to the destination schema, the Copy Data tool starts to analyze the pattern for column pairs you picked from both sides. 然后,它将相同的模式应用到其余的列。Then, it applies the same pattern to the rest of the columns. 因此,只需单击几次,即可看到所有列已经按所需方式映射到目标。Therefore, you see all the columns have been mapped to the destination in a way you want just after several clicks. 如果对“复制数据”工具提供的列映射选项不满意,可以忽略该选项并继续手动映射列。If you are not satisfied with the choice of column mapping provided by Copy Data tool, you can ignore it and continue with manually mapping the columns. 同时,“复制数据”工具不断学习和更新相关模式,最终达到所需的正确列映射模式。Meanwhile, the Copy Data tool constantly learns and updates the pattern, and ultimately reaches the right pattern for the column mapping you want to achieve.

备注

将数据从 SQL Server 或 Azure SQL 数据库复制到 Azure SQL 数据仓库时,如果目标存储中不存在表,“复制数据”工具支持使用源架构自动创建表。When copying data from SQL Server or Azure SQL Database into Azure SQL Data Warehouse, if the table does not exist in the destination store, Copy Data tool supports creation of the table automatically by using the source schema.

筛选数据Filter data

可筛选源数据,仅选择需要复制到接收器数据存储的数据。You can filter source data to select only the data that needs to be copied to the sink data store. 筛选能够减少复制到接收器数据存储的数据量,从而增强复制操作的吞吐量。Filtering reduces the volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy operation. 使用“复制数据”工具,可使用 SQL 查询语言灵活筛选关系数据库中的数据,也可灵活筛选 Azure Blob 文件夹中的文件。Copy Data tool provides a flexible way to filter data in a relational database by using the SQL query language, or files in an Azure blob folder.

筛选数据库中的数据Filter data in a database

以下屏幕截图显示用于筛选数据的 SQL 查询。The following screenshot shows a SQL query to filter the data.

筛选数据库中的数据

筛选 Azure Blob 文件夹中的数据Filter data in an Azure blob folder

可在文件夹路径中使用变量来复制文件夹中的数据。You can use variables in the folder path to copy data from a folder. 支持的变量包括:{year}、{month}、{day}、{hour} 和 {minute} 。The supported variables are: {year}, {month}, {day}, {hour}, and {minute}. 例如:inputfolder/{year}/{month}/{day}。For example: inputfolder/{year}/{month}/{day}.

假设输入文件夹格式如下:Suppose that you have input folders in the following format:

2016/03/01/01
2016/03/01/02
2016/03/01/03
...

单击“文件或文件夹”的“浏览”按钮,找到其中一个文件夹(例如,2016-> 03-> 01-> 02),并单击“选择”。Click the Browse button for File or folder, browse to one of these folders (for example, 2016->03->01->02), and click Choose. 文本框中应该会显示 2016/03/01/02。You should see 2016/03/01/02 in the text box.

然后,请用 {year} 代替 2016、{month} 代替 03、{day} 代替 01、{hour} 代替 02,并按 Tab 键 。Then, replace 2016 with {year}, 03 with {month}, 01 with {day}, and 02 with {hour}, and press the Tab key. 可以看到用于选择这四个变量格式的下拉列表:You should see drop-down lists to select the format for these four variables:

筛选文件或文件夹

创建管道时,“复制数据”工具生成包含表达式、函数和系统变量的参数,用于表示 {year}、{month}、{day}、{hour} 和 {minute}。The Copy Data tool generates parameters with expressions, functions, and system variables that can be used to represent {year}, {month}, {day}, {hour}, and {minute} when creating pipeline.

计划选项Scheduling options

可以一次性运行复制操作,也可按计划运行(每小时、每天等)。You can run the copy operation once or on a schedule (hourly, daily, and so on). 这些选项可用于跨不同环境(包括本地、云和本地桌面)的连接器。These options can be used for the connectors across different environments, including on-premises, cloud, and local desktop.

一次性复制操作仅支持一次性将数据从源移动到目标。A one-time copy operation enables data movement from a source to a destination only once. 它适用于任何大小的数据和任何受支持的格式。It applies to data of any size and any supported format. 使用计划的复制,可按指定的重复周期复制数据。The scheduled copy allows you to copy data on a recurrence that you specify. 可使用丰富的设置(例如,重试、超时和警报)来配置计划的复制。You can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy.

计划选项

后续步骤Next steps

请尝试以下使用“复制数据”工具的教程:Try these tutorials that use the Copy Data tool: