使用控制表从数据库进行大容量复制Bulk copy from a database with a control table

适用于: Azure 数据工厂 Azure Synapse Analytics

若要将数据从 Oracle 服务器、Netezza、Teradata 或 SQL Server 中的数据仓库复制到 Azure Synapse Analytics,必须从多个表加载大量数据。To copy data from a data warehouse in Oracle Server, Netezza, Teradata, or SQL Server to Azure Synapse Analytics, you have to load huge amounts of data from multiple tables. 通常情况下,必须在每个表中对数据分区,以便使用多个线程从单个表并行加载行。Usually, the data has to be partitioned in each table so that you can load rows with multiple threads in parallel from a single table. 本文介绍可以在这些方案中使用的模板。This article describes a template to use in these scenarios.

!NOTE 如果要从数据量相对较小的少数几个表中将数据复制到 Azure Synapse Analytics,则使用 Azure 数据工厂复制数据工具会更高效。!NOTE If you want to copy data from a small number of tables with relatively small data volume to Azure Synapse Analytics, it's more efficient to use the Azure Data Factory Copy Data tool. 本文介绍的模板超出你对该方案的需求。The template that's described in this article is more than you need for that scenario.

关于此解决方案模板About this solution template

此模板可从外部控制表检索用于复制的源数据库分区的列表。This template retrieves a list of source database partitions to copy from an external control table. 然后,它会循环访问源数据库中的每个分区,并将数据复制到目标。Then it iterates over each partition in the source database and copies the data to the destination.

该模板包含三个活动:The template contains three activities:

  • 查找 可从外部控制表检索源数据库分区的列表。Lookup retrieves the list of sure database partitions from an external control table.
  • ForEach 可从查找活动中获取分区列表,然后针对复制活动循环访问每个分区。ForEach gets the partition list from the Lookup activity and iterates each partition to the Copy activity.
  • 复制 可将每个分区从源数据库存储复制到目标存储。Copy copies each partition from the source database store to the destination store.

模板定义以下参数:The template defines following parameters:

  • Control_Table_Name 是外部控制表,用于存储源数据库的分区列表。Control_Table_Name is your external control table, which stores the partition list for the source database.
  • Control_Table_Schema_PartitionID 是外部控制表中用于存储每个分区 ID 的列名。Control_Table_Schema_PartitionID is the name of the column name in your external control table that stores each partition ID. 请确保源数据库中每个分区的分区 ID 都是唯一的。Make sure that the partition ID is unique for each partition in the source database.
  • Control_Table_Schema_SourceTableName 是外部控制表,用于存储源数据库中的每个表名。Control_Table_Schema_SourceTableName is your external control table that stores each table name from the source database.
  • Control_Table_Schema_FilterQuery 是外部控制表中的列名,用于存储筛选器查询,以从源数据库中的每个分区获取数据。Control_Table_Schema_FilterQuery is the name of the column in your external control table that stores the filter query to get the data from each partition in the source database. 例如,如果按年份对数据进行分区,则存储在每一行中的查询可能类似于 select * from datasource where LastModifytime >= ''2015-01-01 00:00:00'' and LastModifytime <= ''2015-12-31 23:59:59.999''。For example, if you partitioned the data by year, the query that's stored in each row might be similar to 'select * from datasource where LastModifytime >= ''2015-01-01 00:00:00'' and LastModifytime <= ''2015-12-31 23:59:59.999'''.
  • Data_Destination_Folder_Path 是将数据复制到目标存储中时使用的路径(当所选目标是“文件系统”时适用)。Data_Destination_Folder_Path is the path where the data is copied into your destination store (applicable when the destination that you choose is "File System").
  • Data_Destination_Container 是将数据复制到目标存储时使用的根文件夹路径。Data_Destination_Container is the root folder path where the data is copied to in your destination store.
  • Data_Destination_Directory 是将数据复制到目标存储中时使用的根下的目录路径。Data_Destination_Directory is the directory path under the root where the data is copied into your destination store.

仅当所选目标是基于文件的存储时,最后三个参数(定义目标存储中的路径)才可见。The last three parameters, which define the path in your destination store are only visible if the destination that you choose is file-based storage. 如果选择“Azure Synapse Analytics”作为目标存储,则不需要这些参数。If you choose "Azure Synapse Analytics" as the destination store, these parameters are not required. 但 Azure Synapse Analytics 中的表名和架构必须与源数据库中的表名和架构相同。But the table names and the schema in Azure Synapse Analytics must be the same as the ones in the source database.

如何使用此解决方案模板How to use this solution template

  1. 在 SQL Server 或 Azure SQL 数据库中创建控制表,以存储源数据库分区列表来进行大容量复制。Create a control table in SQL Server or Azure SQL Database to store the source database partition list for bulk copy. 在以下示例中,源数据库有五个分区。In the following example, there are five partitions in the source database. 三个分区用于 datasource_table,两个分区用于 project_table。Three partitions are for the datasource_table, and two are for the project_table. 列 LastModifytime 用于对源数据库的表 datasource_table 中的数据进行分区。The column LastModifytime is used to partition the data in table datasource_table from the source database. 用于读取第一个分区的查询是“select * from datasource_table where LastModifytime >= ''2015-01-01 00:00:00'' and LastModifytime <= ''2015-12-31 23:59:59.999''”。The query that's used to read the first partition is 'select * from datasource_table where LastModifytime >= ''2015-01-01 00:00:00'' and LastModifytime <= ''2015-12-31 23:59:59.999'''. 可以使用类似查询从其他分区读取数据。You can use a similar query to read data from other partitions.

           Create table ControlTableForTemplate
           (
           PartitionID int,
           SourceTableName  varchar(255),
           FilterQuery varchar(255)
           );
    
           INSERT INTO ControlTableForTemplate
           (PartitionID, SourceTableName, FilterQuery)
           VALUES
           (1, 'datasource_table','select * from datasource_table where LastModifytime >= ''2015-01-01 00:00:00'' and LastModifytime <= ''2015-12-31 23:59:59.999'''),
           (2, 'datasource_table','select * from datasource_table where LastModifytime >= ''2016-01-01 00:00:00'' and LastModifytime <= ''2016-12-31 23:59:59.999'''),
           (3, 'datasource_table','select * from datasource_table where LastModifytime >= ''2017-01-01 00:00:00'' and LastModifytime <= ''2017-12-31 23:59:59.999'''),
           (4, 'project_table','select * from project_table where ID >= 0 and ID < 1000'),
           (5, 'project_table','select * from project_table where ID >= 1000 and ID < 2000');
    
  2. 转到“从数据库进行大容量复制”模板。Go to the Bulk Copy from Database template. 创建与在步骤 1 中创建的外部控制表的新连接。Create a New connection to the external control table that you created in step 1.

    与控制表建立新的连接

  3. 创建与要从中复制数据的源数据库的新连接。Create a New connection to the source database that you're copying data from.

    与源数据库建立新的连接

  4. 创建与要将数据复制到其中的目标数据存储的新连接。Create a New connection to the destination data store that you're copying the data to.

    与目标存储建立新的连接

  5. 选择“使用此模板”。Select Use this template.

  6. 可以看到管道,如以下示例所示:You see the pipeline, as shown in the following example:

    查看管道

  7. 选择“调试”,输入“参数”,然后选择“完成”。Select Debug, enter the Parameters, and then select Finish.

    单击“调试”****

  8. 看到的结果类似于以下示例:You see results that are similar to the following example:

    查看结果

  9. (可选)如果选择“Azure Synapse Analytics”作为数据目标,必须按 Azure Synapse Analytics Polybase 的要求输入用于暂存的 Azure Blob 存储的连接。(Optional) If you chose "Azure Synapse Analytics" as the data destination, you must enter a connection to Azure Blob storage for staging, as required by Azure Synapse Analytics Polybase. 模板会自动为 Blob 存储生成容器路径。The template will automatically generate a container path for your Blob storage. 检查是否在管道运行后创建了容器。Check if the container has been created after the pipeline run.

    Polybase 设置

后续步骤Next steps