使用 Azure 数据工厂将数据加载到 Azure Data Lake Storage Gen2 中Load data into Azure Data Lake Storage Gen2 with Azure Data Factory

适用于:是 Azure 数据工厂否 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory noAzure Synapse Analytics (Preview)

Azure Data Lake Storage Gen2 是一组专用于大数据分析的功能,内置于 Azure Blob 存储中。Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built into Azure Blob storage. 它可使用文件系统和对象存储范例与数据进行交互。It allows you to interface with your data using both file system and object storage paradigms.

Azure 数据工厂 (ADF) 是一个完全托管的基于云的数据集成服务。Azure Data Factory (ADF) is a fully managed cloud-based data integration service. 通过该服务,可使用丰富的本地数据存储和基于云的数据存储中的数据填充数据湖,并更快速地生成分析解决方案。You can use the service to populate the lake with data from a rich set of on-premises and cloud-based data stores and save time when building your analytics solutions. 若要查看受支持的连接器的详细列表,请参阅支持的数据存储表。For a detailed list of supported connectors, see the table of Supported data stores.

Azure 数据工厂提供可横向扩展的托管数据移动解决方案。Azure Data Factory offers a scale-out, managed data movement solution. 得益于 ADF 的横向扩展体系结构,它能以较高的吞吐量引入数据。Due to the scale-out architecture of ADF, it can ingest data at a high throughput. 有关详细信息,请参阅复制活动性能For details, see Copy activity performance.

本文介绍如何使用数据工厂复制数据工具将数据从 Amazon Web Services S3 服务加载至 Azure Data Lake Storage Gen2 。This article shows you how to use the Data Factory Copy Data tool to load data from Amazon Web Services S3 service into Azure Data Lake Storage Gen2. 可以遵循类似步骤,从其他类型的数据存储中复制数据。You can follow similar steps to copy data from other types of data stores.

先决条件Prerequisites

  • Azure 订阅:如果没有 Azure 订阅,可在开始前创建一个 1 元人民币试用帐户Azure subscription: If you don't have an Azure subscription, create a 1rmb trial account before you begin.
  • 启用了 Data Lake Storage Gen2 的 Azure 存储帐户:如果没有存储帐户,请创建一个帐户Azure Storage account with Data Lake Storage Gen2 enabled: If you don't have a Storage account, create an account.
  • AWS 帐户与一个包含数据的 S3 存储桶:本文介绍如何从 Amazon S3 复制数据。AWS account with an S3 bucket that contains data: This article shows how to copy data from Amazon S3. 可以按类似步骤使用其他数据存储。You can use other data stores by following similar steps.

创建数据工厂Create a data factory

  1. 在左侧菜单中,选择“创建资源” > “数据 + 分析” > “数据工厂” :On the left menu, select Create a resource > Data + Analytics > Data Factory:

    在“新建”窗格中选择“数据工厂”

  2. 在“新建数据工厂”页中,为下图中所示的字段提供值 :In the New data factory page, provide values for the fields that are shown in the following image:

    “新建数据工厂”页

    • 名称:输入 Azure 数据工厂的全局唯一名称。Name: Enter a globally unique name for your Azure data factory. 如果收到错误“数据工厂名称 "LoadADLSDemo" 不可用”,请输入不同的数据工厂名称。If you receive the error "Data factory name "LoadADLSDemo" is not available," enter a different name for the data factory. 例如,可以使用名称 yourname ADFTutorialDataFactoryFor example, you could use the name yournameADFTutorialDataFactory. 请重试创建数据工厂。Try creating the data factory again. 有关数据工厂项目的命名规则,请参阅数据工厂命名规则For the naming rules for Data Factory artifacts, see Data Factory naming rules.
    • 订阅:选择要在其中创建数据工厂的 Azure 订阅。Subscription: Select your Azure subscription in which to create the data factory.
    • 资源组:从下拉列表中选择现有资源组,或选择“新建” 选项并输入资源组的名称。Resource Group: Select an existing resource group from the drop-down list, or select the Create new option and enter the name of a resource group. 若要了解有关资源组的详细信息,请参阅 使用资源组管理 Azure 资源To learn about resource groups, see Using resource groups to manage your Azure resources.
    • 版本:选择“V2” 。Version: Select V2.
    • 位置:选择数据工厂的位置。Location: Select the location for the data factory. 下拉列表中仅显示支持的位置。Only supported locations are displayed in the drop-down list. 数据工厂使用的数据存储可以在其他位置和区域中。The data stores that are used by data factory can be in other locations and regions.
  3. 选择“创建” 。Select Create.

  4. 创建操作完成后,请转到数据工厂。After creation is complete, go to your data factory. 此时会看到“数据工厂” 主页,如下图所示:You see the Data Factory home page as shown in the following image:

    数据工厂主页

    选择“创作和监视”磁贴,在单独的选项卡中启动数据集成应用程序 。Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

将数据加载到 Azure Data Lake Storage Gen2 中Load data into Azure Data Lake Storage Gen2

  1. 在“入门”页中,单击“复制数据”磁贴启动“复制数据”工具 :In the Get started page, select the Copy Data tile to launch the Copy Data tool:

    “复制数据”工具磁贴

  2. 在“属性”页中,为“任务名称”字段指定“CopyFromAmazonS3ToADLS”,然后选择“下一步” :In the Properties page, specify CopyFromAmazonS3ToADLS for the Task name field, and select Next:

    “属性”页

  3. 在“源数据存储”页面,单击“+ 创建新连接” :In the Source data store page, click + Create new connection:

    “源数据存储”页

    在连接器库中选择“Amazon S3”,然后选择“继续” Select Amazon S3 from the connector gallery, and select Continue

    “源数据存储 S3”页

  4. 在“指定 Amazon S3 连接”页中,执行以下步骤: In the Specify Amazon S3 connection page, do the following steps:

    1. 指定“访问密钥 ID” 值。Specify the Access Key ID value.

    2. 指定“机密访问密钥” 值。Specify the Secret Access Key value.

    3. 单击“测试连接”以验证设置,然后选择“完成” 。Click Test connection to validate the settings, then select Finish.

    4. 随即会显示新创建的连接。You will see a new connection gets created. 选择“下一步”。Select Next.

      指定 Amazon S3 帐户

  5. 在“选择输入文件或文件夹” 页上,浏览到要复制的文件夹和文件。In the Choose the input file or folder page, browse to the folder and file that you want to copy over. 选择文件夹/文件,选择“选择” :Select the folder/file, select Choose:

    选择输入文件或文件夹

  6. 选中“以递归方式复制文件”和“以二进制方式复制”选项,指定复制行为 。Specify the copy behavior by checking the Copy files recursively and Binary copy options. 选择“下一步” :Select Next:

    指定输出文件夹

  7. 在“目标数据存储”页中,单击“+ 新建连接”,然后选择“Azure Data Lake Storage Gen2”,并选择“继续”: In the Destination data store page, click + Create new connection, and then select Azure Data Lake Storage Gen2, and select Continue:

    “目标数据存储”页

  8. 在“指定 Azure Data Lake 存储连接”页,执行以下步骤 :In the Specify Azure Data Lake Storage connection page, do the following steps:

    1. 从“存储帐户名称”下拉列表中选择能使用 Data Lake Storage Gen2 的帐户。Select your Data Lake Storage Gen2 capable account from the "Storage account name" drop down list.
    2. 选择“完成” 以创建连接。Select Finish to create the connection. 然后,选择“下一步” 。Then select Next.

    指定 Azure Data Lake Storage Gen2 帐户

  9. 在“选择输出文件或文件夹” 页中,输入 copyfroms3 作为输出文件夹名称,然后选择“下一步” 。In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and select Next. ADF 将在复制过程中创建相应的 ADLS Gen2 文件系统和子文件夹(如果不存在)。ADF will create the corresponding ADLS Gen2 file system and sub-folders during copy if it doesn't exist.

    指定输出文件夹

  10. 在“设置”页中选择“下一步”,以便使用默认设置 :In the Settings page, select Next to use the default settings:

    “设置”页

  11. 在“摘要” 页中检查设置,然后选择“下一步” :In the Summary page, review the settings, and select Next:

    “摘要”页

  12. 在“部署”页中选择“监视”,来监视管道 :In the Deployment page, select Monitor to monitor the pipeline:

    “部署”页

  13. 请注意,界面中已自动选择左侧的“监视”选项卡。 Notice that the Monitor tab on the left is automatically selected. “操作”列中包含用于查看活动运行详细信息以及用于重新运行管道的链接 :The Actions column includes links to view activity run details and to rerun the pipeline:

    监视管道运行

  14. 若要查看与管道运行关联的活动运行,请选择“操作”列中的“查看活动运行”链接。 To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the Actions column. 该管道只包含一个活动(复制活动),因此只显示了一个条目。There's only one activity (copy activity) in the pipeline, so you see only one entry. 若要切换回到管道运行视图,请选择顶部的“管道”链接 。To switch back to the pipeline runs view, select the Pipelines link at the top. 选择“刷新”可刷新列表。 Select Refresh to refresh the list.

    监视活动运行

  15. 若要监视每个复制活动的执行详细信息,请在活动监视视图中选择“操作”下的“详细信息”链接(眼镜图标) 。To monitor the execution details for each copy activity, select the Details link (eyeglasses image) under Actions in the activity monitoring view. 可以监视详细信息,如从源复制到接收器的数据量、吞吐量、执行步骤以及相应的持续时间和使用的配置:You can monitor details like the volume of data copied from the source to the sink, data throughput, execution steps with corresponding duration, and used configurations:

    监视活动运行详细信息

  16. 验证数据是否已复制到 Data Lake Storage Gen2 帐户。Verify that the data is copied into your Data Lake Storage Gen2 account.

后续步骤Next steps