使用 Azure 数据工厂模板从数据库批量复制到 Azure 数据资源管理器Copy in bulk from a database to Azure Data Explorer by using the Azure Data Factory template

Azure 数据资源管理器是一个快速、完全托管的数据分析服务。Azure Data Explorer is a fast, fully managed, data-analytics service. 它可以实时分析从应用程序、网站和 IoT 设备等许多源流式传输的大量数据。It offers real-time analysis on large volumes of data that stream from many sources, such as applications, websites, and IoT devices.

若要将数据从 Oracle 服务器、Netezza、Teradata 或 SQL Server 中的数据库复制到 Azure 数据资源管理器,你必须从多个表加载大量数据。To copy data from a database in Oracle Server, Netezza, Teradata, or SQL Server to Azure Data Explorer, you have to load huge amounts of data from multiple tables. 通常情况下,必须在每个表中对数据分区,以便使用多个线程从单个表并行加载行。Usually, the data has to be partitioned in each table so that you can load rows with multiple threads in parallel from a single table. 本文介绍一个可以在这些情况下使用的模板。This article describes a template to use in these scenarios.

Azure 数据工厂模板是预定义的数据工厂管道。Azure Data Factory templates are predefined Data Factory pipelines. 这些模板可以帮助你快速开始使用数据工厂,缩短数据集成项目的开发时间。These templates can help you get started quickly with Data Factory and reduce development time on data integration projects.

可以使用 LookupForEach 活动创建“从数据库批量复制到 Azure 数据资源管理器”模板。You create the Bulk Copy from Database to Azure Data Explorer template by using Lookup and ForEach activities. 为了更快地复制数据,可以使用模板针对每个数据库或每个表创建多个管道。For faster data copying, you can use the template to create many pipelines per database or per table.

重要

请务必使用与所要复制的数据量相适应的工具。Be sure to use the tool that's appropriate for the quantity of data you want to copy.

  • 使用“从数据库批量复制到 Azure 数据资源管理器”模板可将 SQL Server 和 Google BigQuery 等数据库中的大量数据复制到 Azure 数据资源管理器。 Use the Bulk Copy from Database to Azure Data Explorer template to copy large amounts of data from databases such as SQL server and Google BigQuery to Azure Data Explorer.
  • 使用数据工厂复制数据工具可将少量或中等数量的数据复制到 Azure 数据资源管理器。Use the Data Factory Copy Data tool to copy a few tables with small or moderate amounts of data into Azure Data Explorer.

必备条件Prerequisites

创建 ControlTableDatasetCreate ControlTableDataset

ControlTableDataset 指示要将源中的哪些数据复制到管道中的目标。ControlTableDataset indicates what data will be copied from the source to the destination in the pipeline. 行数指示复制这些数据所需的管道总数。The number of rows indicates the total number of pipelines that are needed to copy the data. 应将 ControlTableDataset 定义为源数据库的一部分。You should define ControlTableDataset as part of the source database.

以下代码演示了 SQL Server 源表格式的示例:An example of the SQL Server source table format is shown in the following code:

CREATE TABLE control_table (
PartitionId int,
SourceQuery varchar(255),
ADXTableName varchar(255)
);

下表描述了代码元素:The code elements are described in the following table:

propertiesProperty 说明Description 示例Example
PartitionIdPartitionId 复制顺序The copy order 11
SourceQuerySourceQuery 指示在管道运行时期间要复制哪些数据的查询The query that indicates which data will be copied during the pipeline runtime
select * from table where lastmodifiedtime LastModifytime >= ''2015-01-01 00:00:00''>
ADXTableNameADXTableName 目标表名称The destination table name MyAdxTableMyAdxTable

如果你的 ControlTableDataset 采用不同的格式,请根据自己的格式创建相应的 ControlTableDataset。If your ControlTableDataset is in a different format, create a comparable ControlTableDataset for your format.

使用“从数据库批量复制到 Azure 数据资源管理器”模板Use the Bulk Copy from Database to Azure Data Explorer template

  1. 在“开始”窗格中,选择“从模板创建管道”打开“模板库”窗格。 In the Let's get started pane, select Create pipeline from template to open the Template gallery pane.

    Azure 数据工厂的“开始”窗格

  2. 选择“从数据库批量复制到 Azure 数据资源管理器”模板。 Select the Bulk Copy from Database to Azure Data Explorer template.

    “从数据库批量复制到 Azure 数据资源管理器”模板

  3. 在“从数据库批量复制到 Azure 数据资源管理器”窗格中的“用户输入”下,按如下所述指定数据集: In the Bulk Copy from Database to Azure Data Explorer pane, under User Inputs, specify your datasets by doing the following:

    a.a. 在“ControlTableDataset”下拉列表中,选择控制表对应的链接服务,该表指示要将源中的哪些数据复制到目标,以及要将数据放在目标中的哪个位置。 In the ControlTableDataset drop-down list, select the linked service to the control table that indicates what data is copied from the source to the destination and where it will be placed in the destination.

    b.b. 在“SourceDataset”下拉列表中,选择源数据库对应的链接服务。 In the SourceDataset drop-down list, select the linked service to the source database.

    c.c. 在“AzureDataExplorerTable”下拉列表中,选择“Azure 数据资源管理器”表。 In the AzureDataExplorerTable drop-down list, select the Azure Data Explorer table. 如果该数据集不存在,请创建 Azure 数据资源管理器链接服务以添加该数据集。If the dataset doesn't exist, create the Azure Data Explorer linked service to add the dataset.

    d.d. 选择“使用此模板” 。Select Use this template.

    “从数据库批量复制到 Azure 数据资源管理器”窗格

  4. 在画布中选择活动外部的某个区域,以访问模板管道。Select an area in the canvas, outside the activities, to access the template pipeline. 选择“参数”选项卡并输入表的参数,包括“名称”(控制表名称)和“默认值”(列名称)。 Select the Parameters tab to enter the parameters for the table, including Name (control table name) and Default value (column names).

    管道参数

  5. 在“Lookup”下,选择“GetPartitionList”以查看默认设置。 Under Lookup, select GetPartitionList to view the default settings. 系统会自动创建查询。The query is automatically created.

  6. 选择命令活动“ForEachPartition”,选择“设置”选项卡,然后执行以下操作: Select the Command activity, ForEachPartition, select the Settings tab, and then do the following:

    a.a. 在“批计数”中,输入从 1 到 50 的数字。 In the Batch count box, enter a number from 1 to 50. 此数字确定了在达到 ControlTableDataset 行数之前,可并行运行的管道数。This selection determines the number of pipelines that run in parallel until the number of ControlTableDataset rows is reached.

    b.b. 为确保管道批并行运行,请不要选中“顺序”复选框。 To ensure that the pipeline batches run in parallel, do not select the Sequential check box.

    ForEachPartition 设置

    提示

    最佳做法是并行运行多个管道,以加快数据复制速度。The best practice is to run many pipelines in parallel so that your data can be copied more quickly. 若要提高效率,请将源表中的数据分区,并根据日期和表为每个管道分配一个分区。To increase efficiency, partition the data in the source table and allocate one partition per pipeline, according to date and table.

  7. 选择“全部验证”以验证 Azure 数据工厂管道,然后在“管道验证输出”窗格中查看结果。 Select Validate All to validate the Azure Data Factory pipeline, and then view the result in the Pipeline Validation Output pane.

    验证模板管道

  8. 如果需要,请选择“调试”,然后选择“添加触发器”以运行该管道。 If necessary, select Debug, and then select Add trigger to run the pipeline.

    “调试”和“运行管道”按钮

现在可以使用该模板有效地从数据库和表中复制大量数据。You can now use the template to efficiently copy large amounts of data from your databases and tables.

后续步骤Next steps