使用复制数据工具仅根据时间分区文件名以增量方式复制新文件Incrementally copy new files based on time partitioned file name by using the Copy Data tool

适用于: Azure 数据工厂 Azure Synapse Analytics(预览版)

在本教程中,我们将使用 Azure 门户创建数据工厂。In this tutorial, you use the Azure portal to create a data factory. 然后,使用复制数据工具创建一个管道,该管道根据时间分区文件名以增量方式将新文件从 Azure Blob 存储复制到 Azure Blob 存储。Then, you use the Copy Data tool to create a pipeline that incrementally copies new files based on time partitioned file name from Azure Blob storage to Azure Blob storage.

备注

如果对 Azure 数据工厂不熟悉,请参阅 Azure 数据工厂简介If you're new to Azure Data Factory, see Introduction to Azure Data Factory.

将在本教程中执行以下步骤:In this tutorial, you perform the following steps:

  • 创建数据工厂。Create a data factory.
  • 使用“复制数据”工具创建管道。Use the Copy Data tool to create a pipeline.
  • 监视管道和活动运行。Monitor the pipeline and activity runs.

先决条件Prerequisites

  • Azure 订阅 :如果没有 Azure 订阅,可在开始前创建一个 1 元人民币试用帐户Azure subscription : If you don't have an Azure subscription, create a 1rmb trial account before you begin.
  • Azure 存储帐户 :将 Blob 存储用作源 和接收器 数据存储。Azure storage account : Use Blob storage as the source and sink data store. 如果没有 Azure 存储帐户,请参阅创建存储帐户中的说明。If you don't have an Azure storage account, see the instructions in Create a storage account.

在 Blob 存储中创建两个容器Create two containers in Blob storage

执行以下步骤,准备本教程所需的 Blob 存储。Prepare your Blob storage for the tutorial by performing these steps.

  1. 创建名为 source 的容器。Create a container named source. 在容器中将文件夹路径创建为“2020/03/17/03” 。Create a folder path as 2020/03/17/03 in your container. 创建空的文本文件,并将其命名为 file1.txtCreate an empty text file, and name it as file1.txt. 将 file1.txt 上传到存储帐户中的文件夹路径“source/2020/03/17/03” 。Upload the file1.txt to the folder path source/2020/03/17/03 in your storage account. 可以使用各种工具(例如 Azure 存储资源管理器)来执行这些任务。You can use various tools to perform these tasks, such as Azure Storage Explorer.

    上传文件

    备注

    请根据 UTC 时间调整文件夹名称。Please adjust the folder name with your UTC time. 例如,如果当前的 UTC 时间为 2020 年 3 月 17 日凌晨 3:38,则可按规则“source/{年}/{月}/{日}/{小时}/” 创建文件夹路径“source/2020/03/17/03/” 。For example, if the current UTC time is 3:38 AM on March 17, 2020, you can create the folder path as source/2020/03/17/03/ by the rule of source/{Year}/{Month}/{Day}/{Hour}/.

  2. 创建名为 destination 的容器。Create a container named destination. 可以使用各种工具(例如 Azure 存储资源管理器)来执行这些任务。You can use various tools to perform these tasks, such as Azure Storage Explorer.

创建数据工厂Create a data factory

  1. 在左侧菜单中,选择“创建资源” > “数据 + 分析” > “数据工厂” :On the left menu, select Create a resource > Data + Analytics > Data Factory :

    在“新建”窗格中选择“数据工厂”

  2. 在“新建数据工厂” 页的“名称”下输入 ADFTutorialDataFactoryOn the New data factory page, under Name , enter ADFTutorialDataFactory.

    数据工厂的名称必须全局唯一。 The name for your data factory must be globally unique. 可能会收到以下错误消息:You might receive the following error message:

    新的数据工厂错误消息

    如果收到有关名称值的错误消息,请为数据工厂输入另一名称。If you receive an error message about the name value, enter a different name for the data factory. 例如,使用名称 yourname ADFTutorialDataFactoryFor example, use the name yournameADFTutorialDataFactory. 有关数据工厂项目的命名规则,请参阅数据工厂命名规则For the naming rules for Data Factory artifacts, see Data Factory naming rules.

  3. 选择要在其中创建新数据工厂的 Azure 订阅Select the Azure subscription in which to create the new data factory.

  4. 对于“资源组”,请执行以下步骤之一: For Resource Group , take one of the following steps:

    a.a. 选择“使用现有资源组”,并从下拉列表选择现有的资源组。 Select Use existing , and select an existing resource group from the drop-down list.

    b.b. 选择“新建”,并输入资源组的名称。 Select Create new , and enter the name of a resource group.

    若要了解资源组,请参阅使用资源组管理 Azure 资源To learn about resource groups, see Use resource groups to manage your Azure resources.

  5. 在“版本”下选择“V2”作为版本。 Under version , select V2 for the version.

  6. 在“位置”下选择数据工厂的位置。 Under location , select the location for the data factory. 下拉列表中仅显示支持的位置。Only supported locations are displayed in the drop-down list. 数据工厂使用的数据存储(例如,Azure 存储和 SQL 数据库)和计算资源(例如,Azure HDInsight)可以位于其他位置和区域。The data stores (for example, Azure Storage and SQL Database) and computes (for example, Azure HDInsight) that are used by your data factory can be in other locations and regions.

  7. 选择“创建” 。Select Create.

  8. 创建完以后,会显示“数据工厂” 主页。After creation is finished, the Data Factory home page is displayed.

  9. 若要在单独的选项卡中启动 Azure 数据工厂用户界面 (UI),请选择“创作和监视”磁贴。 To launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.

    数据工厂主页

使用“复制数据”工具创建管道Use the Copy Data tool to create a pipeline

  1. 在“开始使用”页中选择“复制数据”标题,启动“复制数据”工具。 On the Let's get started page, select the Copy Data title to launch the Copy Data tool.

    “复制数据”工具磁贴

  2. 在“属性”页上执行以下步骤: On the Properties page, take the following steps:

    a.a. 在“任务名称”下 输入 DeltaCopyFromBlobPipelineUnder Task name , enter DeltaCopyFromBlobPipeline.

    b.b. 在“任务频率或任务计划”下 ,选择“按计划定期运行” 。Under Task cadence or Task schedule , select Run regularly on schedule.

    c.c. 在“触发器类型”下,选择“翻转窗口”。 Under Trigger type , select Tumbling Window.

    d.d. 在“重复周期”下输入“1 小时”。 Under Recurrence , enter 1 Hour(s).

    e.e. 选择“ 下一步 ”。Select Next.

    数据工厂 UI 将使用指定的任务名称创建一个管道。The Data Factory UI creates a pipeline with the specified task name.

    “属性”页

  3. 在“源数据存储” 页上,完成以下步骤:On the Source data store page, complete the following steps:

    a.a. 单击“+ 创建新连接”来添加连接 Click + Create new connection to add a connection

    b.b. 从库中选择“Azure Blob 存储”,然后选择“继续”。Select Azure Blob Storage from the gallery, and then select Continue.

    c.c. 在“新建链接服务(Azure Blob 存储)” 页上,为链接服务输入一个名称。On the New Linked Service (Azure Blob Storage) page, enter a name for the linked service. 选择你的 Azure 订阅,从“存储帐户名称”列表中选择你的存储帐户。 Select your Azure subscription, and select your storage account from the Storage account name list. 测试连接,然后选择“创建”。 Test connection and then select Create.

    “源数据存储”页

    d.d. 在“源数据存储”页上选择新创建的链接服务,然后单击“下一步”。 Select the newly created linked service on the Source data store page, and then click Next.

  4. 在“选择输入文件或文件夹”页中执行以下步骤: On the Choose the input file or folder page, do the following steps:

    a.a. 浏览并选择 source 容器,然后选择“选择”。 Browse and select the source container, then select Choose.

    屏幕截图显示了“选择输入文件或文件夹”对话框。

    b.b. 在“文件加载行为”下选择“增量加载: 时间分区文件夹/文件名”。 Under File loading behavior , select Incremental load: time-partitioned folder/file names.

    c.c. 将动态文件夹路径编写为“source/{年}/{月}/{日}/{小时}/” ,并按下面的屏幕截图所示更改格式。Write the dynamic folder path as source/{year}/{month}/{day}/{hour}/ , and change the format as shown in following screenshot. 勾选“二进制副本”,然后单击“下一步”。 Check Binary copy and click Next.

    屏幕截图显示了“选择输入文件或文件夹”对话框,其中的一个文件夹处于选中状态。

  5. 在“目标数据存储” 页上选择“AzureBlobStorage” (与数据源存储相同的存储帐户),然后单击“下一步” 。On the Destination data store page, select the AzureBlobStorage , which is the same storage account as data source store, and then click Next.

    “目标数据存储”页

  6. 在“选择输出文件或文件夹”页上执行以下步骤: On the Choose the output file or folder page, do the following steps:

    a.a. 浏览并选择 destination 文件夹,然后单击“选择”。 Browse and select the destination folder, then click Choose.

    b.b. 将动态文件夹路径编写为 destination/{年}/{月}/{日}/{小时}/ ,并将格式更改如下:Write the dynamic folder path as destination/{year}/{month}/{day}/{hour}/ , and change the format as followings:

    屏幕截图显示了“选择输出文件或文件夹”对话框。

    c.c. 单击“下一步” 。Click Next.

    屏幕截图显示了“选择输出文件或文件夹”对话框,其中的“下一步”处于选中状态。

  7. 在“设置”页中,选择“下一步”。 On the Settings page, select Next.

  8. 在“摘要” 页中检查设置,然后选择“下一步” 。On the Summary page, review the settings, and then select Next.

    “摘要”页

  9. 在“部署”页中,选择“监视”可以监视管道(任务) 。On the Deployment page , select Monitor to monitor the pipeline (task). “部署”页Deployment page

  10. 请注意,界面中已自动选择左侧的“监视”选项卡。 Notice that the Monitor tab on the left is automatically selected. 在自动触发管道后,需等待管道运行(约等待一小时)。You need wait for the pipeline run when it is triggered automatically (about after one hour). 在它运行时,单击管道名称链接“DeltaCopyFromBlobPipeline” ,以查看活动运行详细信息或重新运行管道。When it runs, click the pipeline name link DeltaCopyFromBlobPipeline to view activity run details or rerun the pipeline. 选择“刷新”可刷新列表。 Select Refresh to refresh the list.

    屏幕截图显示了“管道运行”窗格。

  11. 该管道只包含一个活动(复制活动),因此只显示了一个条目。There's only one activity (copy activity) in the pipeline, so you see only one entry. 调整“源” 和“目标” 列的列宽(如果需要)以显示更多详细信息。可以看到源文件 (file1.txt) 已从“source/2020/03/17/03/” 复制到“destination/2020/03/17/03/” ,文件名保持未变。Adjust the column width of the source and destination columns (if necessary) to display more details, you can see the source file (file1.txt) has been copied from source/2020/03/17/03/ to destination/2020/03/17/03/ with the same file name.

    屏幕截图显示了管道运行详细信息。

    也可使用 Azure 存储资源管理器进行相同的验证 (https://storageexplorer.com/) 以扫描这些文件。You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files.

    屏幕截图显示了目标的管道运行详细信息。

  12. 将另一个使用新名称的空文本文件创建为 file2.txtCreate another empty text file with the new name as file2.txt. 将 file2.txt 文件上传到存储帐户中的文件夹路径“source/2020/03/17/04” 。Upload the file2.txt file to the folder path source/2020/03/17/04 in your storage account. 可以使用各种工具(例如 Azure 存储资源管理器)来执行这些任务。You can use various tools to perform these tasks, such as Azure Storage Explorer.

    备注

    你可能知道,必须创建新的文件夹路径。You might be aware that a new folder path is required to be created. 请根据 UTC 时间调整文件夹名称。Please adjust the folder name with your UTC time. 例如,如果当前的 UTC 时间为 2020 年 3 月 17 日早上 4:20,则可按规则“/{年}/{月}/{日}/{小时}/” 创建文件夹路径“source/2020/03/17/04/” 。For example, if the current UTC time is 4:20 AM on Mar. 17th, 2020, you can create the folder path as source/2020/03/17/04/ by the rule of {Year}/{Month}/{Day}/{Hour}/.

  13. 若要回到“管道运行”视图,请选择“所有管道运行”,然后等待同一管道在一小时后再次自动触发 。To go back to the Pipeline Runs view, select All Pipelines runs , and wait for the same pipeline being triggered again automatically after another one hour.

    屏幕截图显示了用于返回到该页的“所有管道运行”链接。

  14. 当第二个管道运行到来时,选择新的“DeltaCopyFromBlobPipeline”链接,然后执行相同的操作以查看详细信息。 Select the new DeltaCopyFromBlobPipeline link for the second pipeline run when it comes, and do the same to review details. 你会看到源文件 (file2.txt) 已从“source/2020/03/17/04/” 复制到“destination/2020/03/17/04/” ,文件名保持未变。You will see the source file (file2.txt) has been copied from source/2020/03/17/04/ to destination/2020/03/17/04/ with the same file name. 也可在 destination 容器中使用 Azure 存储资源管理器进行相同的验证 (https://storageexplorer.com/) 以扫描这些文件。You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files in destination container.

后续步骤Next steps

请转到下一篇教程,了解如何在 Azure 上使用 Spark 群集转换数据:Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure: