使用 Azure 数据工厂向 Azure SQL 数据仓库加载数据Load data into Azure SQL Data Warehouse by using Azure Data Factory

Azure SQL 数据仓库是一种基于云的向外扩展数据库,可以处理大量数据(关系数据和非关系数据)。Azure SQL Data Warehouse is a cloud-based, scale-out database that's capable of processing massive volumes of data, both relational and non-relational. SQL 数据仓库在大规模并行处理 (MPP) 体系结构的基础上构建,已针对企业数据仓库工作负荷进行优化。SQL Data Warehouse is built on the massively parallel processing (MPP) architecture that's optimized for enterprise data warehouse workloads. 它通过灵活地缩放存储以及独立计算提供云灵活性。It offers cloud elasticity with the flexibility to scale storage and compute independently.

现在通过使用 Azure 数据工厂,Azure SQL 数据仓库入门变得前所未有地简单。Getting started with Azure SQL Data Warehouse is now easier than ever when you use Azure Data Factory. Azure 数据工厂是一个完全托管的基于云的数据集成服务。Azure Data Factory is a fully managed cloud-based data integration service. 该服务可用于使用现有系统中的数据填充 SQL 数据仓库,在生成分析解决方案时节省时间。You can use the service to populate a SQL Data Warehouse with data from your existing system and save time when building your analytics solutions.

以下是使用 Azure 数据工厂将数据加载到 Azure SQL 数据仓库的优点:Azure Data Factory offers the following benefits for loading data into Azure SQL Data Warehouse:

  • 轻松设置:无需脚本的直观 5 步向导。Easy to set up: An intuitive 5-step wizard with no scripting required.
  • 丰富的数据存储支持:对一组丰富的本地和基于云的数据存储的内置支持。Rich data store support: Built-in support for a rich set of on-premises and cloud-based data stores. 有关详细列表,请参阅表支持的数据存储For a detailed list, see the table of Supported data stores.
  • 安全且合规:通过 HTTPS 或 ExpressRoute 传输数据。Secure and compliant: Data is transferred over HTTPS or ExpressRoute. 存在全局服务可确保数据永远不会离开地理边界。The global service presence ensures that your data never leaves the geographical boundary.
  • 通过使用 PolyBase 提供无与伦比的性能:使用 Polybase 是将数据移到 Azure SQL 数据仓库的最高效方法。Unparalleled performance by using PolyBase: Polybase is the most efficient way to move data into Azure SQL Data Warehouse. 使用暂存 blob 功能,可以从所有类型的数据存储(包括 Azure Blob 存储)实现高加载速度。Use the staging blob feature to achieve high load speeds from all types of data stores, including Azure Blob storage. (默认情况下,Polybase 支持 Azure Blob 存储。)有关详细信息,请参阅复制活动性能(Polybase supports Azure Blob storage by default.) For details, see Copy activity performance.

本文介绍如何使用数据工厂复制数据工具将数据从 Azure SQL 数据库加载至 Azure SQL 数据仓库 。This article shows you how to use the Data Factory Copy Data tool to load data from Azure SQL Database into Azure SQL Data Warehouse. 可以遵循类似步骤,从其他类型的数据存储中复制数据。You can follow similar steps to copy data from other types of data stores.

先决条件Prerequisites

  • Azure 订阅:如果没有 Azure 订阅,可在开始前创建一个 1 元人民币试用帐户Azure subscription: If you don't have an Azure subscription, create a 1rmb trial account before you begin.
  • Azure SQL 数据仓库:此数据仓库保存从 SQL 数据库复制的数据。Azure SQL Data Warehouse: The data warehouse holds the data that's copied over from the SQL database. 如果没有 Azure SQL 数据仓库,请参阅创建 Azure SQL 数据仓库中的说明。If you don't have an Azure SQL Data Warehouse, see the instructions in Create a SQL Data Warehouse.
  • Azure SQL 数据库:本教程使用 Adventure Works LT 示例数据从 Azure SQL 数据库复制数据。Azure SQL Database: This tutorial copies data from an Azure SQL database with Adventure Works LT sample data. 可以遵照创建 Azure SQL 数据库中的说明创建 SQL 数据库。You can create a SQL database by following the instructions in Create an Azure SQL database.
  • Azure 存储帐户:Azure 存储用作大容量复制操作中的暂存 blob 。Azure storage account: Azure Storage is used as the staging blob in the bulk copy operation. 如果没有 Azure 存储帐户,请参阅创建存储帐户中的说明。If you don't have an Azure storage account, see the instructions in Create a storage account.

创建数据工厂Create a data factory

  1. 在左侧菜单中,选择“创建资源” > “数据 + 分析” > “数据工厂” :On the left menu, select Create a resource > Data + Analytics > Data Factory:

    在“新建”窗格中选择“数据工厂”

  2. 在“新建数据工厂”页中,为下图中所示的字段提供值 :In the New data factory page, provide values for the fields that are shown in the following image:

    “新建数据工厂”页

    • 名称:输入 Azure 数据工厂的全局唯一名称。Name: Enter a globally unique name for your Azure data factory. 如果收到错误“数据工厂名称 "LoadSQLDWDemo" 不可用”,请输入不同的数据工厂名称。If you receive the error "Data factory name "LoadSQLDWDemo" is not available," enter a different name for the data factory. 例如,可以使用名称 yourname ADFTutorialDataFactoryFor example, you could use the name yournameADFTutorialDataFactory. 请重试创建数据工厂。Try creating the data factory again. 有关数据工厂项目的命名规则,请参阅数据工厂命名规则For the naming rules for Data Factory artifacts, see Data Factory naming rules.
    • 订阅:选择要在其中创建数据工厂的 Azure 订阅。Subscription: Select your Azure subscription in which to create the data factory.
    • 资源组:从下拉列表中选择现有资源组,或选择“新建” 选项并输入资源组的名称。Resource Group: Select an existing resource group from the drop-down list, or select the Create new option and enter the name of a resource group. 若要了解有关资源组的详细信息,请参阅 使用资源组管理 Azure 资源To learn about resource groups, see Using resource groups to manage your Azure resources.
    • 版本:选择“V2” 。Version: Select V2.
    • 位置:选择数据工厂的位置。Location: Select the location for the data factory. 下拉列表中仅显示支持的位置。Only supported locations are displayed in the drop-down list. 数据工厂使用的数据存储可以在其他位置和区域中。The data stores that are used by data factory can be in other locations and regions. 这些数据存储包括 Azure 存储、Azure SQL 数据库,等等。These data stores include Azure Storage, Azure SQL Database, and so on.
  3. 选择“创建” 。Select Create.

  4. 创建操作完成后,请转到数据工厂。After creation is complete, go to your data factory. 此时会看到“数据工厂” 主页,如下图所示:You see the Data Factory home page as shown in the following image:

    数据工厂主页

    选择“创作和监视”磁贴,在单独的选项卡中启动数据集成应用程序 。Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

将数据载入 Azure SQL 数据仓库Load data into Azure SQL Data Warehouse

  1. 在“入门”页中,单击“复制数据”磁贴启动“复制数据”工具 :In the Get started page, select the Copy Data tile to launch the Copy Data tool:

    “复制数据”工具磁贴

  2. 在“属性”页中,为“任务名称”字段指定“CopyFromSQLToSQLDW”,然后选择“下一步” :In the Properties page, specify CopyFromSQLToSQLDW for the Task name field, and select Next:

    “属性”页

  3. 在“源数据存储”页上,完成以下步骤 :In the Source data store page, complete the following steps:

    a.a. 单击“+ 创建新连接” :click + Create new connection:

    “源数据存储”页

    b.b. 从库中选择“Azure SQL 数据库”,然后选择“继续” 。Select Azure SQL Database from the gallery, and select Continue. 可以在搜索框中键入“SQL”以筛选连接器。You can type "SQL" in the search box to filter the connectors.

    选择 Azure SQL DB

    c.c. 在“新建链接服务”页上,从下拉列表中选择服务器名称和 DB 名称,指定用户名和密码 。In the New Linked Service page, select your server name and DB name from the dropdown list, and specify the username and password. 单击“测试连接”以验证设置,然后选择“完成” 。Click Test connection to validate the settings, then select Finish.

    配置 Azure SQL DB

    d.d. 选择新创建的链接服务作为源,然后单击“下一步”。 Select the newly created linked service as source, then click Next.

    选择源链接服务

  4. 在“选择要从中复制数据的表或使用自定义查询”页中,输入 SalesLT 以筛选表。 In the Select tables from which to copy the data or use a custom query page, enter SalesLT to filter the tables. 选中“(全选)”复选框以便对副本使用所有表,然后选择“下一步”: Choose the (Select all) box to use all of the tables for the copy, and then select Next:

    选择源表

  5. 在“目标数据存储”页上,完成以下步骤 :In the Destination data store page, complete the following steps:

    a.a. 单击“+ 创建新连接”来添加连接 Click + Create new connection to add a connection

    接收器数据存储页

    b.b. 从库中选择“Azure SQL 数据仓库”,然后选择“下一步” 。Select Azure SQL Data Warehouse from the gallery, and select Next.

    选择 Azure SQL DW

    c.c. 在“新建链接服务”页上,从下拉列表中选择服务器名称和 DB 名称,指定用户名和密码 。In the New Linked Service page, select your server name and DB name from the dropdown list, and specify the username and password. 单击“测试连接”以验证设置,然后选择“完成” 。Click Test connection to validate the settings, then select Finish.

    配置 Azure SQL DW

    d.d. 选择新创建的链接服务作为接收器,然后单击“下一步” 。Select the newly created linked service as sink, then click Next.

    选择接收器链接服务

  6. 在“表映射”页中查看内容并选择“下一步”。 In the Table mapping page, review the content, and select Next. 此时会显示智能表映射。An intelligent table mapping displays. 源表已根据表名映射到目标表。The source tables are mapped to the destination tables based on the table names. 如果目标中不存在表,则默认情况下 Azure 数据工厂将创建一个具有相同名称的目标表。If a source table doesn't exist in the destination, Azure Data Factory creates a destination table with the same name by default. 还可以将源表映射到现有目标表。You can also map a source table to an existing destination table.

    Note

    当 SQL Server 或 Azure SQL 数据库为源时,将应用 SQL 数据仓库接收器的自动表创建。Automatic table creation for the SQL Data Warehouse sink applies when SQL Server or Azure SQL Database is the source. 若从其他源数据存储复制数据,则在执行数据复制前,需先在接收器 Azure SQL 数据仓库中预创建架构。If you copy data from another source data store, you need to pre-create the schema in the sink Azure SQL Data Warehouse before executing the data copy.

    “表映射”页

  7. 在“架构映射”页中查看内容并选择“下一步”。 In the Schema mapping page, review the content, and select Next. 智能表映射基于列名。The intelligent table mapping is based on the column name. 如果自动创建的表的数据工厂,源和目标存储之间存在不兼容时,可能发生数据类型转换。If you let Data Factory automatically create the tables, data type conversion can occur when there are incompatibilities between the source and destination stores. 如果在源列与目标列之间进行不受支持的数据类型转换,会显示错误消息以及相应的表。If there's an unsupported data type conversion between the source and destination column, you see an error message next to the corresponding table.

    “架构映射”页

  8. 在“设置”页上,完成以下步骤 :In the Settings page, complete the following steps:

    a.a. 在“暂存设置”部分,单击“+ 新建”,新建临时存储 。In Staging settings section, click + New to new a staging storage. 该存储用于在使用 PolyBase 将数据加载至 SQL 数据仓库前暂存数据。The storage is used for staging the data before it loads into SQL Data Warehouse by using PolyBase. 复制完成后,会自动清除 Azure 存储中的临时数据。After the copy is complete, the interim data in Azure Storage is automatically cleaned up.

    配置暂存

    b.b. 在“新链接服务”页上,选择存储帐户,然后选择“完成” 。In the New Linked Service page, select your storage account, and select Finish.

    配置 Azure 存储

    c.c. 在“高级设置”部分,取消选择“使用默认类型”选项,然后选择“下一步” 。In the Advanced settings section, deselect the Use type default option, then select Next.

    配置 PolyBase

  9. 在“摘要” 页中检查设置,然后选择“下一步” :In the Summary page, review the settings, and select Next:

    “摘要”页

  10. 在“部署”页中,选择“监视”可以监视管道(任务) :In the Deployment page, select Monitor to monitor the pipeline (task):

    “部署”页

  11. 请注意,界面中已自动选择左侧的“监视”选项卡。 Notice that the Monitor tab on the left is automatically selected. “操作”列中包含用于查看活动运行详细信息以及用于重新运行管道的链接 :The Actions column includes links to view activity run details and to rerun the pipeline:

    监视管道运行

  12. 若要查看与管道运行关联的活动运行,请选择“操作”列中的“查看活动运行”链接。 To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the Actions column. 若要切换回到管道运行视图,请选择顶部的“管道”链接 。To switch back to the pipeline runs view, select the Pipelines link at the top. 选择“刷新”可刷新列表。 Select Refresh to refresh the list.

    监视活动运行

  13. 若要监视每个复制活动的执行详细信息,请在活动监视视图中选择“操作”下的“详细信息”链接 。To monitor the execution details for each copy activity, select the Details link under Actions in the activity monitoring view. 可以监视详细信息,如从源复制到接收器的数据量、吞吐量、执行步骤以及相应的持续时间和使用的配置:You can monitor details like the volume of data copied from the source to the sink, data throughput, execution steps with corresponding duration, and used configurations:

    监视活动运行详细信息

后续步骤Next steps

请转至下列文章,了解有关 Azure SQL 数据仓库支持的相关信息:Advance to the following article to learn about Azure SQL Data Warehouse support: