使用 Azure 数据工厂将数据加载到 Azure Data Lake Storage Gen2 中Load data into Azure Data Lake Storage Gen2 with Azure Data Factory

适用于:是 Azure 数据工厂否 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory noAzure Synapse Analytics (Preview)

Azure Data Lake Storage Gen2 是一组专用于大数据分析的功能,内置于 Azure Blob 存储中。Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built into Azure Blob storage. 它可使用文件系统和对象存储范例与数据进行交互。It allows you to interface with your data using both file system and object storage paradigms.

Azure 数据工厂 (ADF) 是一个完全托管的基于云的数据集成服务。Azure Data Factory (ADF) is a fully managed cloud-based data integration service. 通过该服务,可使用丰富的本地数据存储和基于云的数据存储中的数据填充数据湖,并更快速地生成分析解决方案。You can use the service to populate the lake with data from a rich set of on-premises and cloud-based data stores and save time when building your analytics solutions. 若要查看受支持的连接器的详细列表,请参阅支持的数据存储表。For a detailed list of supported connectors, see the table of Supported data stores.

Azure 数据工厂提供可横向扩展的托管数据移动解决方案。Azure Data Factory offers a scale-out, managed data movement solution. 得益于 ADF 的横向扩展体系结构,它能以较高的吞吐量引入数据。Due to the scale-out architecture of ADF, it can ingest data at a high throughput. 有关详细信息,请参阅复制活动性能For details, see Copy activity performance.

本文介绍如何使用数据工厂复制数据工具将数据从 Amazon Web Services S3 服务加载至 Azure Data Lake Storage Gen2 。This article shows you how to use the Data Factory Copy Data tool to load data from Amazon Web Services S3 service into Azure Data Lake Storage Gen2. 可以遵循类似步骤,从其他类型的数据存储中复制数据。You can follow similar steps to copy data from other types of data stores.


  • Azure 订阅:如果没有 Azure 订阅,可在开始前创建一个 1 元人民币试用帐户Azure subscription: If you don't have an Azure subscription, create a 1rmb trial account before you begin.
  • 启用了 Data Lake Storage Gen2 的 Azure 存储帐户:如果没有存储帐户,请创建一个帐户Azure Storage account with Data Lake Storage Gen2 enabled: If you don't have a Storage account, create an account.
  • AWS 帐户与一个包含数据的 S3 存储桶:本文介绍如何从 Amazon S3 复制数据。AWS account with an S3 bucket that contains data: This article shows how to copy data from Amazon S3. 可以按类似步骤使用其他数据存储。You can use other data stores by following similar steps.

创建数据工厂Create a data factory

  1. 在左侧菜单中,选择“创建资源” > “数据 + 分析” > “数据工厂”:On the left menu, select Create a resource > Data + Analytics > Data Factory:


  2. 在“新建数据工厂”页中,为以下字段提供值:In the New data factory page, provide values for following fields:

    • 名称:输入 Azure 数据工厂的全局唯一名称。Name: Enter a globally unique name for your Azure data factory. 如果收到错误“数据工厂名称 YourDataFactoryName 不可用”,请为数据工厂输入其他名称。If you receive the error "Data factory name YourDataFactoryName is not available", enter a different name for the data factory. 例如,可以使用名称 yournameADFTutorialDataFactoryFor example, you could use the name yournameADFTutorialDataFactory. 请重试创建数据工厂。Try creating the data factory again. 有关数据工厂项目的命名规则,请参阅数据工厂命名规则For the naming rules for Data Factory artifacts, see Data Factory naming rules.
    • 订阅:选择要在其中创建数据工厂的 Azure 订阅。Subscription: Select your Azure subscription in which to create the data factory.
    • 资源组:从下拉列表中选择现有资源组,或选择“新建”选项并输入资源组的名称。Resource Group: Select an existing resource group from the drop-down list, or select the Create new option and enter the name of a resource group. 若要了解有关资源组的详细信息,请参阅 使用资源组管理 Azure 资源To learn about resource groups, see Using resource groups to manage your Azure resources.
    • 版本:选择“V2”。Version: Select V2.
    • 位置:选择数据工厂的位置。Location: Select the location for the data factory. 下拉列表中仅显示支持的位置。Only supported locations are displayed in the drop-down list. 数据工厂使用的数据存储可以在其他位置和区域中。The data stores that are used by data factory can be in other locations and regions.
  3. 选择“创建” 。Select Create.

  4. 创建操作完成后,请转到数据工厂。After creation is complete, go to your data factory. 此时会看到“数据工厂”主页,如下图所示:You see the Data Factory home page as shown in the following image:


    选择“创作和监视”磁贴,在单独的选项卡中启动数据集成应用程序。Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

将数据加载到 Azure Data Lake Storage Gen2 中Load data into Azure Data Lake Storage Gen2

  1. 在“入门”页中,选择“复制数据”磁贴以启动“复制数据”工具 。In the Get started page, select the Copy Data tile to launch the Copy Data tool.

  2. 在“属性”页中,为“任务名称”字段指定“CopyFromAmazonS3ToADLS”,然后选择“下一步” 。In the Properties page, specify CopyFromAmazonS3ToADLS for the Task name field, and select Next.


  3. 在“源数据存储”页中,单击“+ 创建新连接”。 In the Source data store page, click + Create new connection. 从连接器库中选择“Amazon S3”,然后选择“继续” 。Select Amazon S3 from the connector gallery, and select Continue.

    “源数据存储 S3”页

  4. 在“新建链接服务(Amazon S3)”页中执行以下步骤:In the New linked service (Amazon S3) page, do the following steps:

    1. 指定“访问密钥 ID”值。Specify the Access Key ID value.

    2. 指定“机密访问密钥”值。Specify the Secret Access Key value.

    3. 单击“测试连接”以验证设置,然后选择“创建” 。Click Test connection to validate the settings, then select Create.

      指定 Amazon S3 帐户

    4. 随即会显示新创建的 AmazonS3 连接。You will see a new AmazonS3 connection gets created. 选择“下一步”。Select Next.

  5. 在“选择输入文件或文件夹”页上,浏览到要复制的文件夹和文件。In the Choose the input file or folder page, browse to the folder and file that you want to copy over. 选中文件夹/文件,然后选择“选择”。Select the folder/file, and then select Choose.


  6. 通过选中“以递归方式”和“以二进制方式复制”选项,指定复制行为 。Specify the copy behavior by checking the Recursively and Binary copy options. 选择“下一步”。Select Next.


  7. 在“目标数据存储”页中,单击“+ 新建连接”,接着选择“Azure Data Lake Storage Gen2”,然后选择“继续” 。In the Destination data store page, click + Create new connection, and then select Azure Data Lake Storage Gen2, and select Continue.


  8. 在“新建链接服务(Azure Data Lake Storage Gen2)”页中,执行以下步骤:In the New linked service (Azure Data Lake Storage Gen2) page, do the following steps:

    1. 从“存储帐户名称”下拉列表中选择能使用 Data Lake Storage Gen2 的帐户。Select your Data Lake Storage Gen2 capable account from the "Storage account name" drop-down list.

    2. 选择“创建”以创建连接。Select Create to create the connection. 然后,选择“下一步”。Then select Next.

      指定 Azure Data Lake Storage Gen2 帐户

  9. 在“选择输出文件或文件夹”页中,输入 copyfroms3 作为输出文件夹名称,然后选择“下一步”。In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and select Next. ADF 将在复制过程中创建相应的 ADLS Gen2 文件系统和子文件夹(如果不存在)。ADF will create the corresponding ADLS Gen2 file system and subfolders during copy if it doesn't exist.


  10. 在“设置”页中选择“下一步”,以便使用默认设置 。In the Settings page, select Next to use the default settings.


  11. 在“摘要”页中检查设置,然后选择“下一步”。In the Summary page, review the settings, and select Next.


  12. 在“部署”页中,选择“监视”可以监视管道(任务) 。On the Deployment page, select Monitor to monitor the pipeline (task).

  13. 管道运行成功完成后,你会看到由手动触发器触发的管道运行。When the pipeline run completes successfully, you see a pipeline run that is triggered by a manual trigger. 可以使用“管道名称”列下的链接来查看活动详细信息以及重新运行该管道。You can use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.


  14. 若要查看与管道运行关联的活动运行,请选择“管道名称”列下的“CopyFromAmazonS3ToADLS”链接。To see activity runs associated with the pipeline run, select the CopyFromAmazonS3ToADLS link under the PIPELINE NAME column. 有关复制操作的详细信息,请选择“活动名称”列下的“详细信息”链接(眼镜图标)。For details about the copy operation, select the Details link (eyeglasses icon) under the ACTIVITY NAME column. 可以监视详细信息,例如,从源复制到接收器的数据量、数据吞吐量、执行步骤以及相应的持续时间和使用的配置。You can monitor details like the volume of data copied from the source to the sink, data throughput, execution steps with corresponding duration, and used configuration.



  15. 若要刷新视图,请选择“刷新”。To refresh the view, select Refresh. 选择顶部的“所有管道运行”,回到“管道运行”视图。Select All pipeline runs at the top to go back to the Pipeline Runs view.

  16. 验证数据是否已复制到 Data Lake Storage Gen2 帐户。Verify that the data is copied into your Data Lake Storage Gen2 account.

后续步骤Next steps