使用 Azure 数据工厂,将数据从 Azure Blob 存储复制到 Azure SQL 数据库中的数据库Copy data from Azure Blob storage to a database in Azure SQL Database by using Azure Data Factory

适用于:是 Azure 数据工厂否 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory noAzure Synapse Analytics (Preview)

在本教程中,请使用 Azure 数据工厂用户界面 (UI) 创建数据工厂。In this tutorial, you create a data factory by using the Azure Data Factory user interface (UI). 此数据工厂中的管道将数据从 Azure Blob 存储复制到 Azure SQL 数据库中的数据库。The pipeline in this data factory copies data from Azure Blob storage to a database in Azure SQL Database. 本教程中的配置模式适用于从基于文件的数据存储复制到关系数据存储。The configuration pattern in this tutorial applies to copying from a file-based data store to a relational data store. 如需可以用作源和接收器的数据存储的列表,请参阅支持的数据存储表。For a list of data stores supported as sources and sinks, see the supported data stores table.

备注

将在本教程中执行以下步骤:In this tutorial, you perform the following steps:

  • 创建数据工厂。Create a data factory.
  • 创建包含复制活动的管道。Create a pipeline with a copy activity.
  • 测试性运行管道。Test run the pipeline.
  • 手动触发管道。Trigger the pipeline manually.
  • 按计划触发管道。Trigger the pipeline on a schedule.
  • 监视管道和活动运行。Monitor the pipeline and activity runs.

先决条件Prerequisites

  • Azure 订阅Azure subscription. 如果没有 Azure 订阅,可在开始前创建一个 Azure 试用帐户If you don't have an Azure subscription, create a Azure trial account before you begin.
  • Azure 存储帐户Azure storage account. 可将 Blob 存储用作源数据存储。You use Blob storage as a source data store. 如果没有存储帐户,请参阅创建 Azure 存储帐户以获取创建步骤。If you don't have a storage account, see Create an Azure storage account for steps to create one.
  • Azure SQL 数据库Azure SQL Database. 将数据库用作接收器数据存储。You use the database as a sink data store. 如果没有 Azure SQL 数据库中的数据库,请参阅在 Azure SQL 数据库中创建数据库了解创建步骤。If you don't have a database in Azure SQL Database, see the Create a database in Azure SQL Database for steps to create one.

创建 blob 和 SQL 表Create a blob and a SQL table

现在,请执行以下步骤来准备本教程所需的 Blob 存储和 SQL 数据库。Now, prepare your Blob storage and SQL database for the tutorial by performing the following steps.

创建源 blobCreate a source blob

  1. 启动记事本。Launch Notepad. 复制以下文本并将其在磁盘上另存为 emp.txt 文件:Copy the following text, and save it as an emp.txt file on your disk:

    FirstName,LastName
    John,Doe
    Jane,Doe
    
  2. 在 Blob 存储中创建名为 adftutorial 的容器。Create a container named adftutorial in your Blob storage. 在该容器中创建名为 input 的文件夹。Create a folder named input in this container. 然后,将 emp.txt 文件上传到 input 文件夹。Then, upload the emp.txt file to the input folder. 请使用 Azure 门户或工具(例如 Azure 存储资源管理器)执行这些任务。Use the Azure portal or tools such as Azure Storage Explorer to do these tasks.

创建接收器 SQL 表Create a sink SQL table

  1. 使用以下 SQL 脚本在数据库中创建 dbo.emp 表:Use the following SQL script to create the dbo.emp table in your database:

    CREATE TABLE dbo.emp
    (
        ID int IDENTITY(1,1) NOT NULL,
        FirstName varchar(50),
        LastName varchar(50)
    )
    GO
    
    CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);
    
  2. 允许 Azure 服务访问 SQL Server。Allow Azure services to access SQL Server. 确保 SQL Server 的“允许访问 Azure 服务”处于“打开”状态,以便数据工厂可以将数据写入 SQL Server。Ensure that Allow access to Azure services is turned ON for your SQL Server so that Data Factory can write data to your SQL Server. 若要验证并启用此设置,请转到“逻辑 SQL Server”>“安全性” > “防火墙和虚拟网络”,然后将“允许访问 Azure 服务”选项设置为“开” 。To verify and turn on this setting, go to logical SQL server > Security > Firewalls and virtual networks > set the Allow access to Azure services option to ON.

创建数据工厂Create a data factory

在此步骤中,请先创建数据工厂,然后启动数据工厂 UI,在该数据工厂中创建一个管道。In this step, you create a data factory and start the Data Factory UI to create a pipeline in the data factory.

  1. 打开 Microsoft EdgeGoogle ChromeOpen Microsoft Edge or Google Chrome. 目前,仅 Microsoft Edge 和 Google Chrome Web 浏览器支持数据工厂 UI。Currently, Data Factory UI is supported only in Microsoft Edge and Google Chrome web browsers.

  2. 在左侧菜单中,选择“创建资源” > “数据 + 分析” > “数据工厂”。On the left menu, select Create a resource > Data + Analytics > Data Factory.

  3. 在“新建数据工厂”页的“名称”下输入 ADFTutorialDataFactoryOn the New data factory page, under Name, enter ADFTutorialDataFactory.

    Azure 数据工厂的名称必须 全局唯一The name of the Azure data factory must be globally unique. 如果收到有关名称值的错误消息,请为数据工厂输入另一名称。If you receive an error message about the name value, enter a different name for the data factory. (例如 yournameADFTutorialDataFactory)。(for example, yournameADFTutorialDataFactory). 有关数据工厂项目的命名规则,请参阅数据工厂命名规则For naming rules for Data Factory artifacts, see Data Factory naming rules.

    新建数据工厂

  4. 选择要在其中创建数据工厂的 Azure 订阅Select the Azure subscription in which you want to create the data factory.

  5. 对于“资源组”,请执行以下步骤之一:For Resource Group, take one of the following steps:

    a.a. 选择“使用现有资源组”,并从下拉列表选择现有的资源组。Select Use existing, and select an existing resource group from the drop-down list.

    b.b. 选择“新建”,并输入资源组的名称。Select Create new, and enter the name of a resource group.

    若要了解资源组,请参阅使用资源组管理 Azure 资源To learn about resource groups, see Use resource groups to manage your Azure resources.

  6. 在“版本”下选择“V2”。 Under Version, select V2.

  7. 在“位置”下选择数据工厂所在的位置。Under Location, select a location for the data factory. 下拉列表中仅显示支持的位置。Only locations that are supported are displayed in the drop-down list. 数据工厂使用的数据存储(例如,Azure 存储和 SQL 数据库)和计算资源(例如,Azure HDInsight)可以位于其他区域。The data stores (for example, Azure Storage and SQL Database) and computes (for example, Azure HDInsight) used by the data factory can be in other regions.

  8. 选择“创建” 。Select Create.

  9. 创建完成后,通知中心内会显示通知。After the creation is finished, you see the notice in Notifications center. 选择“转到资源”导航到“数据工厂”页。Select Go to resource to navigate to the Data factory page.

  10. 选择“创作和监视”,在单独的选项卡中启动数据工厂 UI。Select Author & Monitor to launch the Data Factory UI in a separate tab.

创建管道Create a pipeline

本步骤在数据工厂中创建包含复制活动的管道。In this step, you create a pipeline with a copy activity in the data factory. 复制活动将数据从 Blob 存储复制到 SQL 数据库。The copy activity copies data from Blob storage to SQL Database. 快速入门教程中,已通过以下步骤创建一个管道:In the Quickstart tutorial, you created a pipeline by following these steps:

  1. 创建链接服务。Create the linked service.
  2. 创建输入和输出数据集。Create input and output datasets.
  3. 创建管道。Create a pipeline.

在本教程中,请首先创建管道,In this tutorial, you start with creating the pipeline. 然后在配置管道时根据需要创建链接服务和数据集。Then you create linked services and datasets when you need them to configure the pipeline.

  1. 在“开始使用”页中,选择“创建管道”。 On the Let's get started page, select Create pipeline.

    创建管道

    1. 在“常规”面板的“属性”中,将“名称”指定为 CopyPipeline 。In the General panel under Properties, specify CopyPipeline for Name. 然后通过单击右上角的“属性”图标来折叠面板。Then collapse the panel by clicking the Properties icon in the top-right corner.
  2. 在“活动”工具箱中,展开“移动和转换”类别,然后将“复制数据”活动从工具箱拖放到管道设计器图面。 In the Activities tool box, expand the Move and Transform category, and drag and drop the Copy Data activity from the tool box to the pipeline designer surface. 指定 CopyFromBlobToSql 作为名称Specify CopyFromBlobToSql for Name.

    复制活动

配置源Configure source

提示

本教程使用“帐户密钥”作为源数据存储的身份验证类型,但你可以根据需要选择其他受支持的身份验证方法:“SAS URI”、“服务主体”和“托管标识”。In this tutorial, you use Account key as the authentication type for your source data store, but you can choose other supported authentication methods: SAS URI,Service Principal and Managed Identity if needed. 有关详细信息,请参阅此文中的相应部分。Refer to corresponding sections in this article for details. 为了安全地存储数据存储的机密,我们还建议使用 Azure Key Vault。To store secrets for data stores securely, it's also recommended to use an Azure Key Vault. 有关详细说明,请参阅此文Refer to this article for detailed illustrations.

  1. 转到“源”选项卡。选择“+ 新建”创建源数据集。Go to the Source tab. Select + New to create a source dataset.

  2. 在“新建数据集”对话框中选择“Azure Blob 存储”,然后选择“继续”。 In the New Dataset dialog box, select Azure Blob Storage, and then select Continue. 源数据位于 Blob 存储中,因此选择“Azure Blob 存储”作为源数据集。The source data is in Blob storage, so you select Azure Blob Storage for the source dataset.

  3. 在“选择格式”对话框中选择数据的格式类型,然后选择“继续”。 In the Select Format dialog box, choose the format type of your data, and then select Continue.

  4. 在“设置属性”对话框中,输入 SourceBlobDataset 作为名称。In the Set Properties dialog box, enter SourceBlobDataset for Name. 选中“第一行作为标题”复选框。Select the checkbox for First row as header. 在“链接服务”文本框下,选择“+ 新建”。 Under the Linked service text box, select + New.

  5. 在“新建链接服务(Azure Blob 存储)”窗口中,输入 AzureStorageLinkedService 作为名称,从“存储帐户名称”列表中选择你的存储帐户。 In the New Linked Service (Azure Blob Storage) dialog box, enter AzureStorageLinkedService as name, select your storage account from the Storage account name list. 测试连接,选择“创建”以部署该链接服务。Test connection, select Create to deploy the linked service.

  6. 创建链接服务后,会导航回到“设置属性”页。After the linked service is created, it's navigated back to the Set properties page. 在“文件路径”旁边,选择“浏览”。 Next to File path, select Browse.

  7. 导航到 adftutorial/input 文件夹,选择 emp.txt 文件,然后选择“确定”。Navigate to the adftutorial/input folder, select the emp.txt file, and then select OK.

  8. 选择“确定” 。Select OK. 将自动导航到管道页。It automatically navigates to the pipeline page. 在“源”选项卡中,确认已选择“SourceBlobDataset”。 In Source tab, confirm that SourceBlobDataset is selected. 若要预览此页上的数据,请选择“预览数据”。To preview data on this page, select Preview data.

    源数据集

配置接收器Configure sink

提示

本教程使用“SQL 身份验证”作为接收器数据存储的身份验证类型,但你可以根据需要选择其他受支持的身份验证方法:“服务主体”和“托管标识”。In this tutorial, you use SQL authentication as the authentication type for your sink data store, but you can choose other supported authentication methods: Service Principal and Managed Identity if needed. 有关详细信息,请参阅此文中的相应部分。Refer to corresponding sections in this article for details. 为了安全地存储数据存储的机密,我们还建议使用 Azure Key Vault。To store secrets for data stores securely, it's also recommended to use an Azure Key Vault. 有关详细说明,请参阅此文Refer to this article for detailed illustrations.

  1. 转到“接收器”选项卡,选择“+ 新建”,创建一个接收器数据集。 Go to the Sink tab, and select + New to create a sink dataset.

  2. 在“新建数据集”对话框中的搜索框内输入“SQL”来筛选连接器,选择“Azure SQL 数据库”,然后选择“继续”。 In the New Dataset dialog box, input "SQL" in the search box to filter the connectors, select Azure SQL Database, and then select Continue. 在本教程中,请将数据复制到 SQL 数据库。In this tutorial, you copy data to a SQL database.

  3. 在“设置属性”对话框中,输入 OutputSqlDataset 作为名称。In the Set Properties dialog box, enter OutputSqlDataset for Name. 从“链接服务”下拉列表中,选择“+ 新建”。From the Linked service dropdown list, select + New. 数据集必须与链接服务相关联。A dataset must be associated with a linked service. 该链接服务包含的连接字符串可供数据工厂用于在运行时连接到 SQL 数据库。The linked service has the connection string that Data Factory uses to connect to SQL Database at runtime. 数据集指定可将数据复制到其中的容器、文件夹和文件(可选)。The dataset specifies the container, folder, and the file (optional) to which the data is copied.

  4. 在“新建链接服务(Azure SQL 数据库)”对话框中执行以下步骤:In the New Linked Service (Azure SQL Database) dialog box, take the following steps:

    a.a. 在“名称”下输入 AzureSqlDatabaseLinkedServiceUnder Name, enter AzureSqlDatabaseLinkedService.

    b.b. 在“服务器名称”下选择 SQL Server 实例。Under Server name, select your SQL Server instance.

    c.c. 在“数据库名称”下选择数据库。Under Database name, select your database.

    d.d. 在“用户名”下输入用户的名称。Under User name, enter the name of the user.

    e.e. 在“密码”下输入用户的密码。Under Password, enter the password for the user.

    f.f. 选择“测试连接”以测试连接。Select Test connection to test the connection.

    g.g. 选择“创建”以部署链接服务。Select Create to deploy the linked service.

    保存新建链接服务

  5. 将自动导航到“设置属性”对话框。It automatically navigates to the Set Properties dialog box. 在“表”中选择“[dbo].[emp]”。 In Table, select [dbo].[emp]. 然后选择“确定”。Then select OK.

  6. 转到包含管道的选项卡。在“接收器数据集”中,确认已选中“OutputSqlDataset”。 Go to the tab with the pipeline, and in Sink Dataset, confirm that OutputSqlDataset is selected.

    “管道”选项卡

可以选择按照复制活动中的架构映射中所述将源架构映射到对应的目标架构。You can optionally map the schema of the source to corresponding schema of destination by following Schema mapping in copy activity.

验证管道Validate the pipeline

若要验证管道,请从工具栏中选择“验证”。To validate the pipeline, select Validate from the tool bar.

可以通过单击右上角的“代码”来查看与管道关联的 JSON 代码。You can see the JSON code associated with the pipeline by clicking Code on the upper right.

调试和发布管道Debug and publish the pipeline

可以先调试管道,然后再将项目(链接服务、数据集和管道)发布到数据工厂。You can debug a pipeline before you publish artifacts (linked services, datasets, and pipeline) to Data Factory.

  1. 若要调试管道,请在工具栏上选择“调试”。To debug the pipeline, select Debug on the toolbar. 可以在窗口底部的“输出”选项卡中看到管道运行的状态。You see the status of the pipeline run in the Output tab at the bottom of the window.

  2. 在管道可以成功运行后,在顶部工具栏中选择“全部发布”。Once the pipeline can run successfully, in the top toolbar, select Publish all. 此操作将所创建的实体(数据集和管道)发布到数据工厂。This action publishes entities (datasets, and pipelines) you created to Data Factory.

  3. 等待“已成功发布”消息出现。Wait until you see the Successfully published message. 若要查看通知消息,请单击右上角的“显示通知”(铃铛按钮)。To see notification messages, click the Show Notifications on the top-right (bell button).

手动触发管道Trigger the pipeline manually

在此步骤中,请手动触发在前面的步骤中发布的管道。In this step, you manually trigger the pipeline you published in the previous step.

  1. 选择工具栏中的“触发器”,然后选择“立即触发”。 Select Trigger on the toolbar, and then select Trigger Now. 在“管道运行”页上,选择“确定”。 On the Pipeline Run page, select OK.

  2. 转到左侧的“监视”选项卡。Go to the Monitor tab on the left. 此时会看到由手动触发器触发的管道运行。You see a pipeline run that is triggered by a manual trigger. 可以使用“管道名称”列下的链接来查看活动详细信息以及重新运行该管道。You can use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.

    监视管道运行Monitor pipeline runs

  3. 若要查看与管道运行关联的活动运行,请选择“管道名称”列下的“CopyPipeline”链接。To see activity runs associated with the pipeline run, select the CopyPipeline link under the PIPELINE NAME column. 此示例中只有一个活动,因此列表中只看到一个条目。In this example, there's only one activity, so you see only one entry in the list. 有关复制操作的详细信息,请选择“活动名称”列下的“详细信息”链接(眼镜图标)。 For details about the copy operation, select the Details link (eyeglasses icon) under the ACTIVITY NAME column. 选择顶部的“所有管道运行”,回到“管道运行”视图。Select All pipeline runs at the top to go back to the Pipeline Runs view. 若要刷新视图,请选择“刷新”。To refresh the view, select Refresh.

    监视活动运行Monitor activity runs

  4. 验证是否又向数据库的 emp 表添加了两行。Verify that two more rows are added to the emp table in the database.

按计划触发管道Trigger the pipeline on a schedule

在此计划中,请为管道创建计划触发器。In this schedule, you create a schedule trigger for the pipeline. 触发器按指定的计划(例如,每小时或每天)运行管道。The trigger runs the pipeline on the specified schedule, such as hourly or daily. 此处,你要将触发器设置为每分钟运行一次,直至指定的结束日期/时间。Here you set the trigger to run every minute until the specified end datetime.

  1. 转到左侧位于“监视器”选项卡上方的“创作”选项卡。Go to the Author tab on the left above the monitor tab.

  2. 转到你的管道,在工具栏上单击“触发器”,然后选择“新建/编辑”。 Go to your pipeline, click Trigger on the tool bar, and select New/Edit.

  3. 在“添加触发器”对话框中,针对“选择触发器”区域选择“+ 新建”。 In the Add triggers dialog box, select + New for Choose trigger area.

  4. 在“新建触发器”窗口中,执行以下步骤:In the New Trigger window, take the following steps:

    a.a. 在“名称”下输入 RunEveryMinuteUnder Name, enter RunEveryMinute.

    b.b. 在“结束”下选择“在特定日期”。 Under End, select On Date.

    c.c. 在“结束日期”下选择下拉列表。Under End On, select the drop-down list.

    d.d. 选择“当天”选项。Select the current day option. 默认情况下,结束日期设置为第二天。By default, the end day is set to the next day.

    e.e. 更新“结束时间”部分,使之超过当前的日期/时间数分钟。Update the End Time part to be a few minutes past the current datetime. 触发器只会在发布所做的更改后激活。The trigger is activated only after you publish the changes. 如果将其设置为仅数分钟后激活,而到时又不进行发布,则看不到触发器运行。If you set it to only a couple of minutes apart, and you don't publish it by then, you don't see a trigger run.

    f.f. 选择“确定” 。Select OK.

    g.g. 对于“已激活”选项,请选择“是”。 For Activated option, select Yes.

    h.如果该值不存在,请单击“添加行”。h. 选择“确定” 。Select OK.

    重要

    每个管道运行都有相关联的成本,因此请正确设置结束日期。A cost is associated with each pipeline run, so set the end date appropriately.

  5. 在“编辑触发器”页中查看警告,然后选择“保存”。 On the Edit trigger page, review the warning, and then select Save. 此示例中的管道不采用任何参数。The pipeline in this example doesn't take any parameters.

  6. 单击“全部发布”,发布所做的更改。Click Publish all to publish the change.

  7. 转到左侧的“监视”选项卡,查看触发的管道运行。Go to the Monitor tab on the left to see the triggered pipeline runs.

    触发的管道运行Triggered pipeline runs

  8. 若要从“管道运行”视图切换到“触发器运行”视图,请选择窗口左侧的“触发器运行”。 To switch from the Pipeline Runs view to the Trigger Runs view, select Trigger Runs on the left side of the window.

  9. 可以在列表中看到触发器运行。You see the trigger runs in a list.

  10. 验证是否每分钟将两个行(对于每个管道运行)插入 emp 表中,直至指定的结束时间。Verify that two rows per minute (for each pipeline run) are inserted into the emp table until the specified end time.

后续步骤Next steps

此示例中的管道将数据从 Blob 存储中的一个位置复制到另一个位置。The pipeline in this sample copies data from one location to another location in Blob storage. 你已了解如何:You learned how to:

  • 创建数据工厂。Create a data factory.
  • 创建包含复制活动的管道。Create a pipeline with a copy activity.
  • 测试性运行管道。Test run the pipeline.
  • 手动触发管道。Trigger the pipeline manually.
  • 按计划触发管道。Trigger the pipeline on a schedule.
  • 监视管道和活动运行。Monitor the pipeline and activity runs.

若要了解如何将数据从本地复制到云,请转到以下教程:Advance to the following tutorial to learn how to copy data from on-premises to the cloud: