使用 Azure 数据工厂从 Office 365 加载数据Load data from Office 365 by using Azure Data Factory

本文介绍如何使用数据工厂将 Office 365 中的数据载入 Azure Blob 存储。 This article shows you how to use the Data Factory load data from Office 365 into Azure Blob storage. 可以遵循类似的步骤将数据复制到 Azure Data Lake Gen2。You can follow similar steps to copy data to Azure Data Lake Gen2. 请参阅 Office 365 连接器文章,了解有关从 Office 365 复制数据的一般信息。Refer to Office 365 connector article on copying data from Office 365 in general.

创建数据工厂Create a data factory

  1. 在左侧菜单中,选择“创建资源” > “数据 + 分析” > “数据工厂” :On the left menu, select Create a resource > Data + Analytics > Data Factory:

    在“新建”窗格中选择“数据工厂”

  2. 在“新建数据工厂”页中,为下图中所示的字段提供值 :In the New data factory page, provide values for the fields that are shown in the following image:

    “新建数据工厂”页

    • 名称:输入 Azure 数据工厂的全局唯一名称。Name: Enter a globally unique name for your Azure data factory. 如果收到错误“数据工厂名称 LoadFromOffice365Demo 不可用”,请输入不同的数据工厂名称。If you receive the error "Data factory name LoadFromOffice365Demo is not available", enter a different name for the data factory. 例如,可以使用名称 yourname LoadFromOffice365DemoFor example, you could use the name yournameLoadFromOffice365Demo. 请重试创建数据工厂。Try creating the data factory again. 有关数据工厂项目的命名规则,请参阅数据工厂命名规则For the naming rules for Data Factory artifacts, see Data Factory naming rules.
    • 订阅:选择要在其中创建数据工厂的 Azure 订阅。Subscription: Select your Azure subscription in which to create the data factory.
    • 资源组:从下拉列表中选择现有资源组,或选择“新建” 选项并输入资源组的名称。Resource Group: Select an existing resource group from the drop-down list, or select the Create new option and enter the name of a resource group. 若要了解有关资源组的详细信息,请参阅 使用资源组管理 Azure 资源To learn about resource groups, see Using resource groups to manage your Azure resources.
    • 版本:选择“V2” 。Version: Select V2.
    • 位置:选择数据工厂的位置。Location: Select the location for the data factory. 下拉列表中仅显示支持的位置。Only supported locations are displayed in the drop-down list. 数据工厂使用的数据存储可以在其他位置和区域中。The data stores that are used by data factory can be in other locations and regions. 这些数据存储包括 Azure 存储、Azure SQL 数据库,等等。These data stores include Azure Storage, Azure SQL Database, and so on.
  3. 选择“创建” 。Select Create.

  4. 创建操作完成后,请转到数据工厂。After creation is complete, go to your data factory. 此时会看到“数据工厂” 主页,如下图所示:You see the Data Factory home page as shown in the following image:

    数据工厂主页

  5. 选择“创作和监视”磁贴,在单独的选项卡中启动数据集成应用程序 。Select the Author & Monitor tile to launch the Data Integration Application in a separate tab.

创建管道Create a pipeline

  1. 在“开始使用”页中,选择“创建管道”。 On the "Let's get started" page, select Create pipeline.

    创建管道

  2. 在管道的“常规” 选项卡中,输入“CopyPipeline”作为管道的名称In the General tab for the pipeline, enter "CopyPipeline" for Name of the pipeline.

  3. 在“活动”工具箱 >“移动和转换”类别中,将“复制”活动从工具箱拖放到管道设计器图面。 In the Activities tool box > Move & Transform category > drag and drop the Copy activity from the tool box to the pipeline designer surface. 指定“CopyFromOffice365ToBlob”作为活动名称。Specify "CopyFromOffice365ToBlob" as activity name.

配置源Configure source

  1. 转到管道 >“源”选项卡,单击“+ 新建”创建源数据集。 Go to the pipeline > Source tab, click + New to create a source dataset.

  2. 在“新建数据集”窗口中,选择“Office 365”,然后选择“继续”。 In the New Dataset window, select Office 365, and then select Continue.

  3. 现在位于“复制活动配置”选项卡中。单击 Office 365 数据集旁边的“编辑” 按钮以继续配置数据。You are now in the copy activity configuration tab. Click on the Edit button next to the Office 365 dataset to continue the data configuration.

    配置 Office 365 数据集 - 常规

  4. 此时可以看到为 Office 365 数据集打开了一个新选项卡。You see a new tab opened for Office 365 dataset. 在“属性”窗口底部的“常规”选项卡中,输入“SourceOffice365Dataset”作为名称。 In the General tab at the bottom of the Properties window, enter "SourceOffice365Dataset" for Name.

  5. 转到“属性”窗口的“连接”选项卡。 Go to the Connection tab of the Properties window. 单击“链接服务”文本框旁边的“+ 新建”。 Next to the Linked service text box, click + New.

  6. 在“新建链接服务”窗口中,输入“Office365LinkedService”作为名称,输入服务主体 ID 和服务主体密钥,然后测试连接并选择“创建” 以部署链接服务。In the New Linked Service window, enter "Office365LinkedService" as name, enter the service principal ID and service principal key, then test connection and select Create to deploy the linked service.

    新建 Office 365 链接服务

  7. 在创建链接服务后,会返回到数据集设置。After the linked service is created, you are back in the dataset settings. 在“表” 旁边,选择向下箭头展开可用的 Office 365 数据集列表,然后从下拉列表中选择“BasicDataSet_v0.Message_v0”:Next to Table, choose the down-arrow to expand the list of available Office 365 datasets, and choose "BasicDataSet_v0.Message_v0" from the drop-down list:

    配置 Office 365 数据集 - 表

  8. 现在返回到 “管道” > “源”选项卡 ,以继续配置用于 Office 365 数据提取的其他属性。Now go back to the pipeline > Source tab to continue configuring additional properties for Office 365 data extraction. 用户范围和用户范围筛选器是可选谓词,你可以定义这些谓词来限制要从 Office 365 中提取的数据。User scope and user scope filter are optional predicates that you can define to restrict the data you want to extract out of Office 365. 有关如何配置这些设置的说明,请参阅 Office 365 数据集属性部分。See Office 365 dataset properties section for how you configure these settings.

  9. 需要选择一个日期筛选器,并提供开始时间和结束时间值。You are required to choose one of the date filters and provide the start time and end time values.

  10. 单击“导入架构” 选项卡以导入消息数据集的架构。Click on the Import Schema tab to import the schema for Message dataset.

    配置 Office 365 数据集 - 架构

配置接收器Configure sink

  1. 转到管道 >“接收器”选项卡,选择“+ 新建”以创建一个接收器数据集。 Go to the pipeline > Sink tab, and select + New to create a sink dataset.

  2. 在“新建数据集”窗口中可以看到,从 Office 365 复制时仅选择了受支持的目标。In the New Dataset window, notice that only the supported destinations are selected when copying from Office 365. 选择“Azure Blob 存储” ,选择“二进制格式”,然后选择“继续” 。Select Azure Blob Storage, select Binary format, and then select Continue. 在本教程中,我们要将 Office 365 数据复制到 Azure Blob 存储。In this tutorial, you copy Office 365 data into an Azure Blob Storage.

  3. 单击“Azure Blob 存储”数据集旁边的“编辑” 按钮以继续配置数据。Click on Edit button next to the Azure Blob Storage dataset to continue the data configuration.

  4. 在“属性”窗口的“常规”选项卡的“名称”中,输入“OutputBlobDataset”。 On the General tab of the Properties window, in Name, enter "OutputBlobDataset".

  5. 转到“属性”窗口的“连接”选项卡。 Go to the Connection tab of the Properties window. 在“链接服务”文本框旁边,选择“+ 新建”。 Next to the Linked service text box, select + New.

  6. 在“新建链接服务”窗口中,输入“AzureStorageLinkedService”作为名称,从身份验证方法下拉列表中选择“服务主体”,填写“服务终结点”、“租户”、“服务主体 ID”和“服务主体密钥”,然后选择“保存”以部署链接服务。In the New Linked Service window, enter "AzureStorageLinkedService" as name, select "Service Principal" from the dropdown list of authentication methods, fill in the Service Endpoint, Tenant, Service principal ID, and Service principal key, then select Save to deploy the linked service. 请参阅此文,了解如何为 Azure Blob 存储设置服务主体身份验证。Refer here for how to set up service principal authentication for Azure Blob Storage.

    新建 Blob 链接服务

验证管道Validate the pipeline

若要验证管道,请从工具栏中选择“验证” 。To validate the pipeline, select Validate from the tool bar.

还可以通过单击右上角的“代码”来查看与管道关联的 JSON 代码。You can also see the JSON code associated with the pipeline by clicking Code on the upper-right.

发布管道Publish the pipeline

在顶部工具栏中,选择“全部发布” 。In the top toolbar, select Publish All. 此操作将所创建的实体(数据集和管道)发布到数据工厂。This action publishes entities (datasets, and pipelines) you created to Data Factory.

发布更改

手动触发管道Trigger the pipeline manually

选择工具栏中的“添加触发器”,然后选择“立即触发”。 Select Add Trigger on the toolbar, and then select Trigger Now. 在“管道运行”页上选择“完成”。 On the Pipeline Run page, select Finish.

监视管道Monitor the pipeline

转到左侧的“监视”选项卡。 Go to the Monitor tab on the left. 此时会看到由手动触发器触发的管道运行。You see a pipeline run that is triggered by a manual trigger. 可以使用“操作”列中的链接来查看活动详细信息以及重新运行该管道。 You can use links in the Actions column to view activity details and to rerun the pipeline.

监视管道

若要查看与管道运行关联的活动运行,请选择“操作”列中的“查看活动运行”链接。 To see activity runs associated with the pipeline run, select the View Activity Runs link in the Actions column. 此示例中只有一个活动,因此列表中只看到一个条目。In this example, there is only one activity, so you see only one entry in the list. 有关复制操作的详细信息,请选择“操作”列中的“详细信息”链接(眼镜图标)。 For details about the copy operation, select the Details link (eyeglasses icon) in the Actions column.

监视活动

如果这是你首次请求此上下文(要访问的数据表、要将数据加载到的目标帐户和发出数据访问请求的用户标识的组合)的数据,则复制活动状态将显示为“正在进行”;仅当单击“操作”下的“详细信息”链接时,状态才显示为“正在请求许可”。 If this is the first time you are requesting data for this context (a combination of which data table is being access, which destination account is the data being loaded into, and which user identity is making the data access request), you will see the copy activity status as In Progress, and only when you click into "Details" link under Actions will you see the status as RequesetingConsent. 在继续执行数据提取之前,数据访问审批者组的成员需要在 Privileged Access Management 中审批该请求。A member of the data access approver group needs to approve the request in the Privileged Access Management before the data extraction can proceed.

正在请求许可状态: 活动执行详细信息 - 请求许可Status as requesting consent: Activity execution details - request consent

正在提取数据状态:Status as extracting data:

活动执行详细信息 - 提取数据

提供许可后,数据提取将会继续,一段时间后,管道运行将显示为“成功”。Once the consent is provided, data extraction will continue and, after some time, the pipeline run will show as succeeded.

监视管道 - 成功

现在,请转到目标 Azure Blob 存储,并验证是否已提取二进制格式的 Office 365 数据。Now go to the destination Azure Blob Storage and verify that Office 365 data has been extracted in Binary format.

后续步骤Next steps

请转至下列文章,了解有关 Azure SQL 数据仓库支持的相关信息:Advance to the following article to learn about Azure SQL Data Warehouse support: