使用复制数据工具根据 LastModifiedDate 以增量方式复制新的和已更改的文件Incrementally copy new and changed files based on LastModifiedDate by using the Copy Data tool

适用于: Azure 数据工厂 Azure Synapse Analytics(预览版)

在本教程中,你将使用 Azure 门户创建数据工厂。In this tutorial, you'll use the Azure portal to create a data factory. 然后,使用复制数据工具创建一个管道,该管道仅将 Azure Blob 存储中的新文件和已更改的文件以增量方式复制到另一个 Azure Blob 存储。You'll then use the Copy Data tool to create a pipeline that incrementally copies new and changed files only, from Azure Blob storage to Azure Blob storage. 它使用 LastModifiedDate 来确定要复制的文件。It uses LastModifiedDate to determine which files to copy.

完成此处的步骤后,Azure 数据工厂会扫描源存储中的所有文件,应用按 LastModifiedDate 进行筛选的文件筛选器,然后仅将新文件和自上次以来已更新的文件复制到目标存储。After you complete the steps here, Azure Data Factory will scan all the files in the source store, apply the file filter by LastModifiedDate, and copy to the destination store only files that are new or have been updated since last time. 请注意,如果数据工厂扫描大量文件,复制过程预计仍要花费很长时间。Note that if Data Factory scans large numbers of files, you should still expect long durations. 即使减少了要复制的数据量,文件扫描也很耗时。File scanning is time consuming, even when the amount of data copied is reduced.

备注

如果对数据工厂不熟悉,请参阅 Azure 数据工厂简介If you're new to Data Factory, see Introduction to Azure Data Factory.

在本教程中,你将完成以下任务:In this tutorial, you'll complete these tasks:

  • 创建数据工厂。Create a data factory.
  • 使用“复制数据”工具创建管道。Use the Copy Data tool to create a pipeline.
  • 监视管道和活动运行。Monitor the pipeline and activity runs.

先决条件Prerequisites

  • Azure 订阅:如果没有 Azure 订阅,可在开始前创建一个 1 元人民币试用帐户Azure subscription: If you don't have an Azure subscription, create a 1rmb trial account before you begin.
  • Azure 存储帐户,将 Blob 存储用作源和接收器数据存储。Azure Storage account: Use Blob storage for the source and sink data stores. 如果没有 Azure 存储帐户,请按照创建存储帐户中的说明操作。If you don't have an Azure Storage account, follow the instructions in Create a storage account.

在 Blob 存储中创建两个容器Create two containers in Blob storage

完成以下步骤,准备本教程所需的 Blob 存储:Prepare your Blob storage for the tutorial by completing these steps:

  1. 创建名为 source 的容器。Create a container named source. 可以使用各种工具(例如 Azure 存储资源管理器)来执行此任务。You can use various tools to perform this task, like Azure Storage Explorer.

  2. 创建名为 destination 的容器。Create a container named destination.

创建数据工厂Create a data factory

  1. 在左窗格中,选择“创建资源” 。In the left pane, select Create a resource. 选择“分析” > “数据工厂” :Select Analytics > Data Factory:

    选择“数据工厂”

  2. 在“新建数据工厂” 页的“名称”下输入 ADFTutorialDataFactoryOn the New data factory page, under Name, enter ADFTutorialDataFactory.

    数据工厂的名称必须全局独一无二。The name for your data factory must be globally unique. 可能会收到以下错误消息:You might receive this error message:

    “名称不可用”错误消息

    如果收到有关名称值的错误消息,请为数据工厂输入另一名称。If you receive an error message about the name value, enter a different name for the data factory. 例如,使用名称 yourname ADFTutorialDataFactoryFor example, use the name yournameADFTutorialDataFactory. 有关数据工厂项目的命名规则,请参阅数据工厂命名规则For the naming rules for Data Factory artifacts, see Data Factory naming rules.

  3. 在“订阅”下,选择要在其中创建新数据工厂的 Azure 订阅 。Under Subscription, select the Azure subscription in which you'll create the new data factory.

  4. 在“资源组”下执行以下步骤之一 :Under Resource Group, take one of these steps:

    • 选择“使用现有”,然后在列表中选择现有的资源组。 Select Use existing and then select an existing resource group in the list.

    • 选择“新建”,然后输入资源组的名称 。Select Create new and then enter a name for the resource group.

    若要了解资源组,请参阅使用资源组管理 Azure 资源To learn about resource groups, see Use resource groups to manage your Azure resources.

  5. 在“版本”下选择“V2”。 Under Version, select V2.

  6. 在“位置”下选择数据工厂的位置。 Under Location, select the location for the data factory. 列表中仅显示支持的位置。Only supported locations appear in the list. 数据工厂使用的数据存储(例如,Azure 存储和 Azure SQL 数据库)和计算资源(例如,Azure HDInsight)可以位于其他位置和区域。The data stores (for example, Azure Storage and Azure SQL Database) and computes (for example, Azure HDInsight) that your data factory uses can be in other locations and regions.

  7. 选择“创建” 。Select Create.

  8. 创建数据工厂后,会显示数据工厂主页。After the data factory is created, the data factory home page appears.

  9. 若要在单独的选项卡中打开 Azure 数据工厂用户界面 (UI),请选择“创作和监视”磁贴: To open the Azure Data Factory user interface (UI) on a separate tab, select the Author & Monitor tile:

    数据工厂主页

使用“复制数据”工具创建管道Use the Copy Data tool to create a pipeline

  1. 在“开始使用”页中选择“复制数据”磁贴,打开“复制数据”工具: On the Let's get started page, select the Copy Data tile to open the Copy Data tool:

    “复制数据”磁贴

  2. 在“属性”页上执行以下步骤: On the Properties page, take the following steps:

    a.a. 在“任务名称”下 输入 DeltaCopyFromBlobPipelineUnder Task name, enter DeltaCopyFromBlobPipeline.

    b.b. 在“任务频率或任务计划”下 ,选择“按计划定期运行” 。Under Task cadence or Task schedule, select Run regularly on schedule.

    c.c. 在“触发器类型”下,选择“翻转窗口”。 Under Trigger type, select Tumbling window.

    d.d. 在“重复周期”下输入“15 分钟”。 Under Recurrence, enter 15 Minute(s).

    e.e. 选择“下一步”。Select Next.

    数据工厂会使用指定的任务名称创建一个管道。Data Factory creates a pipeline with the specified task name.

    复制数据 - 属性页

  3. 在“源数据存储” 页上,完成以下步骤:On the Source data store page, complete these steps:

    a.a. 选择“创建新连接”,添加一个连接。 Select Create new connection to add a connection.

    b.b. 从库中选择“Azure Blob 存储” ,然后选择“继续”:Select Azure Blob Storage from the gallery, and then select Continue:

    选择“Azure Blob 存储”

    c.c. 在“新建链接服务(Azure Blob 存储)”页上,从“存储帐户名称”列表中选择你的存储帐户。 On the New Linked Service (Azure Blob Storage) page, select your storage account from the Storage account name list. 测试连接,然后选择“创建” 。Test the connection and then select Create.

    d.d. 选择新的链接服务,然后选择“下一步”: Select the new linked service and then select Next:

    选择新的链接服务

  4. 在“选择输入文件或文件夹”页中完成以下步骤: On the Choose the input file or folder page, complete the following steps:

    a.a. 浏览并选择 source 文件夹,然后选择“选择” 。Browse for and select the source folder, and then select Choose.

    选择输入文件或文件夹

    b.b. 在“文件加载行为”下,选择“增量加载: LastModifiedDate”。Under File loading behavior, select Incremental load: LastModifiedDate.

    c.c. 选择“二进制副本”,然后选择“下一步” :Select Binary copy and then select Next:

    “选择输入文件或文件夹”页

  5. 在“目标数据存储”页上,选择已创建的“AzureBlobStorage”服务 。On the Destination data store page, select the AzureBlobStorage service that you created. 这是与源数据存储相同的存储帐户。This is the same storage account as the source data store. 然后,选择“下一步” 。Then select Next.

  6. 在“选择输出文件或文件夹”页中完成以下步骤: On the Choose the output file or folder page, complete the following steps:

    a.a. 浏览并选择 destination 文件夹,然后选择“选择” :Browse for and select the destination folder, and then select Choose:

    “选择输出文件或文件夹”页

    b.b. 选择“下一步”。Select Next.

  7. 在“设置”页中,选择“下一步”。 On the Settings page, select Next.

  8. 在“摘要” 页上检查设置,然后选择“下一步” 。On the Summary page, review the settings and then select Next.

    “摘要”页

  9. 在“部署”页中,选择“监视”可以监视管道(任务) 。On the Deployment page, select Monitor to monitor the pipeline (task).

    “部署”页

  10. 请注意,界面中已自动选择左侧的“监视”选项卡。 Notice that the Monitor tab on the left is automatically selected. 应用程序将切换到“监视”选项卡。 请查看管道的状态。The application switches to the Monitor tab. You see the status of the pipeline. 选择“刷新”可刷新列表。 Select Refresh to refresh the list. 选择“管道名称” 下的链接,查看活动运行详细信息或重新运行管道。Select the link under PIPELINE NAME to view activity run details or to run the pipeline again.

    刷新列表并查看活动运行详细信息

  11. 该管道只包含一个活动(复制活动),因此只显示了一个条目。There's only one activity (the copy activity) in the pipeline, so you see only one entry. 有关复制操作的详细信息,请选择“活动名称”列中的“详细信息”链接(眼镜图标)。 For details about the copy operation, select the Details link (the eyeglasses icon) in the ACTIVITY NAME column. 有关属性的详细信息,请参阅复制活动概述For details about the properties, see Copy activity overview.

    管道中的“复制”活动

    由于 Blob 存储帐户的源容器中没有文件,因此不会看到任何文件复制到该帐户的目标容器中:Because there are no files in the source container in your Blob storage account, you won't see any files copied to the destination container in the account:

    源容器或目标容器中没有文件

  12. 创建空的文本文件,并将其命名为 file1.txtCreate an empty text file and name it file1.txt. 将此文本文件上传到存储帐户中的源容器。Upload this text file to the source container in your storage account. 可以使用各种工具(例如 Azure 存储资源管理器)来执行这些任务。You can use various tools to perform these tasks, like Azure Storage Explorer.

    创建 file1.txt 并将其上传到源容器

  13. 若要回到“管道运行”视图,请选择“所有管道运行”,然后等待再次自动触发同一管道 。To go back to the Pipeline runs view, select All pipeline runs, and wait for the same pipeline to be automatically triggered again.

  14. 当第二个管道运行完成时,请按照前述步骤查看活动运行详情。When the second pipeline run completes, follow the same steps mentioned previously to review the activity run details.

    你会发现,已将一个文件 (file1.txt) 从 Blob 存储帐户的源容器复制到目标容器。You'll see that one file (file1.txt) has been copied from the source container to the destination container of your Blob storage account:

    file1.txt 已从源容器复制到目标容器

  15. 创建另一个空的文本文件,并将其命名为 file2.txtCreate another empty text file and name it file2.txt. 将此文本文件上传到 Blob 存储帐户中的源容器。Upload this text file to the source container in your Blob storage account.

  16. 针对第二个文本文件重复步骤 13 和 14。Repeat steps 13 and 14 for the second text file. 你会发现,在执行此管道运行期间,只有新文件 (file2.txt) 从存储帐户的源容器复制到了目标容器。You'll see that only the new file (file2.txt) was copied from the source container to the destination container of your storage account during this pipeline run.

    还可以使用 Azure 存储资源管理器来扫描文件,以便验证是否只复制了一个文件:You can also verify that only one file has been copied by using Azure Storage Explorer to scan the files:

    使用 Azure 存储资源管理器来扫描文件

后续步骤Next steps

请转到下一教程,了解如何使用 Azure 上的 Apache Spark 群集来转换数据:Go to the following tutorial to learn how to transform data by using an Apache Spark cluster on Azure: