快速入门:使用 Azure 数据工厂 UI 创建数据工厂Quickstart: Create a data factory by using the Azure Data Factory UI

适用于:是 Azure 数据工厂否 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory noAzure Synapse Analytics (Preview)

本快速入门介绍如何使用 Azure 数据工厂 UI 创建和监视数据工厂。This quickstart describes how to use the Azure Data Factory UI to create and monitor a data factory. 在此数据工厂中创建的管道会将数据从 Azure Blob 存储中的一个文件夹 复制到另一个文件夹。The pipeline that you create in this data factory copies data from one folder to another folder in Azure Blob storage. 有关如何使用 Azure 数据工厂转换数据的教程 ,请参阅教程:使用 Spark 转换数据For a tutorial on how to transform data by using Azure Data Factory, see Tutorial: Transform data by using Spark.

先决条件Prerequisites

Azure 订阅Azure subscription

如果没有 Azure 订阅,可在开始前创建一个 1 元人民币试用帐户If you don't have an Azure subscription, create a 1rmb trial account before you begin.

Azure 角色Azure roles

若要创建数据工厂实例,用于登录到 Azure 的用户帐户必须属于参与者或所有者角色,或者是 Azure 订阅的管理员。 To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the contributor or owner role, or an administrator of the Azure subscription. 若要查看你在订阅中拥有的权限,请转到 Azure 门户,在右上角选择你的用户名,然后选择“...” 图标以显示更多选项,然后选择“我的权限” 。To view the permissions that you have in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for more options, and then select My permissions. 如果可以访问多个订阅,请选择相应的订阅。If you have access to multiple subscriptions, select the appropriate subscription.

若要为数据工厂创建和管理子资源(包括数据集、链接服务、管道、触发器和集成运行时),以下要求适用:To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers, and integration runtimes - the following requirements are applicable:

  • 若要在 Azure 门户中创建和管理子资源,你必须属于资源组级别或更高级别的数据工厂参与者角色。To create and manage child resources in the Azure portal, you must belong to the Data Factory Contributor role at the resource group level or above.
  • 若要使用 PowerShell 或 SDK 创建和管理子资源,资源级别或更高级别的参与者角色已足够。To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level or above is sufficient.

有关如何将用户添加到角色的示例说明,请参阅添加角色一文。For sample instructions about how to add a user to a role, see the Add roles article.

有关详细信息,请参阅以下文章:For more info, see the following articles:

Azure 存储帐户Azure Storage account

在本快速入门中,使用常规用途的 Azure 存储帐户(具体的说就是 Blob 存储)作为源 和目标 数据存储。You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data stores in this quickstart. 如果没有常规用途的 Azure 存储帐户,请参阅创建存储帐户创建一个。If you don't have a general-purpose Azure Storage account, see Create a storage account to create one.

获取存储帐户名称Get the storage account name

在本快速入门中,将需要 Azure 存储帐户的名称。You will need the name of your Azure Storage account for this quickstart. 以下过程提供的步骤用于获取存储帐户的名称:The following procedure provides steps to get the name of your storage account:

  1. 在 Web 浏览器中,转到 Azure 门户并使用你的 Azure 用户名和密码登录。In a web browser, go to the Azure portal and sign in using your Azure username and password.
  2. 从 Azure 门户菜单中,选择“所有服务”,然后选择“存储” > “存储帐户” 。From the Azure portal menu, select All services, then select Storage > Storage accounts. 此外,也可以在任何页面中搜索和选择“存储帐户” 。You can also search for and select Storage accounts from any page.
  3. 在“存储帐户”页中,筛选你的存储帐户(如果需要),然后选择它 。In the Storage accounts page, filter for your storage account (if needed), and then select your storage account.

此外,也可以在任何页面中搜索和选择“存储帐户” 。You can also search for and select Storage accounts from any page.

创建 Blob 容器Create a blob container

本部分介绍如何在 Azure Blob 存储中创建名为 adftutorial 的 Blob 容器。In this section, you create a blob container named adftutorial in Azure Blob storage.

  1. 在“存储帐户”页上,选择“概述” > “容器”。 From the storage account page, select Overview > Containers.

  2. 在 <Account name> - “容器”页的工具栏中,选择“容器” 。On the <Account name> - Containers page's toolbar, select Container.

  3. 在“新建容器” 对话框中,输入 adftutorial 作为名称,然后选择“确定” 。In the New container dialog box, enter adftutorial for the name, and then select OK. <Account name> - “容器”页已更新为在容器列表中包含“adftutorial” 。The <Account name> - Containers page is updated to include adftutorial in the list of containers.

    容器列表

为 Blob 容器添加输入文件夹和文件Add an input folder and file for the blob container

在此部分中,在刚创建的容器中创建名为“input”的文件夹,再将示例文件上传到 input 文件夹 。In this section, you create a folder named input in the container you just created, and then upload a sample file to the input folder. 在开始之前,打开文本编辑器(如记事本),并创建包含以下内容的名为“emp.txt”的文件 :Before you begin, open a text editor such as Notepad, and create a file named emp.txt with the following content:

John, Doe
Jane, Doe

将此文件保存在 C:\ADFv2QuickStartPSH 文件夹中 。Save the file in the C:\ADFv2QuickStartPSH folder. (如果此文件夹不存在,则创建它。)然后返回到 Azure 门户并执行以下步骤:(If the folder doesn't already exist, create it.) Then return to the Azure portal and follow these steps:

  1. 在上次离开的 <Account name> - “容器”页中,选择已更新的容器列表中的“adftutorial” 。In the <Account name> - Containers page where you left off, select adftutorial from the updated list of containers.

    1. 如果关闭了窗口或转到其他页,请再次登录到 Azure 门户If you closed the window or went to another page, sign in to the Azure portal again.
    2. 从 Azure 门户菜单中,选择“所有服务”,然后选择“存储” > “存储帐户” 。From the Azure portal menu, select All services, then select Storage > Storage accounts. 此外,也可以在任何页面中搜索和选择“存储帐户” 。You can also search for and select Storage accounts from any page.
    3. 选择存储帐户,然后选择“容器” > “adftutorial” 。Select your storage account, and then select Containers > adftutorial.
  2. 在“adftutorial”容器页面的工具栏上,选择“上传” 。On the adftutorial container page's toolbar, select Upload.

  3. 在“上传 Blob”页中,选择“文件”框,然后浏览到 emp.txt 文件并进行选择 。In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.

  4. 展开“高级”标题 。Expand the Advanced heading. 此页现在显示如下内容:The page now displays as shown:

    选择“高级...”链接

  5. 在“上传到文件夹”框中,输入“输入” 。In the Upload to folder box, enter input.

  6. 选择“上传”按钮 。Select the Upload button. 应该会在列表中看到 emp.txt 文件和上传状态。You should see the emp.txt file and the status of the upload in the list.

  7. 选择“关闭”图标 (X) 以关闭“上传 Blob”页面 。Select the Close icon (an X) to close the Upload blob page.

让“adftutorial”容器页面保持打开状态 。Keep the adftutorial container page open. 在本快速入门结束时可以使用它来验证输出。You use it to verify the output at the end of this quickstart.

创建数据工厂Create a data factory

  1. 启动 Microsoft EdgeGoogle Chrome Web 浏览器。Launch Microsoft Edge or Google Chrome web browser. 目前,仅 Microsoft Edge 和 Google Chrome Web 浏览器支持数据工厂 UI。Currently, Data Factory UI is supported only in Microsoft Edge and Google Chrome web browsers.

  2. 转到 Azure 门户Go to the Azure portal.

  3. 在 Azure 门户菜单中,选择“创建资源” 。From the Azure portal menu, select Create a resource.

    在 Azure 门户菜单中,选择“创建资源”

  4. 选择“分析”,然后选择“数据工厂” 。Select Analytics, and then select Data Factory.

    在“新建”窗格中选择“数据工厂”

  5. 在“新建数据工厂” 页中,输入 ADFTutorialDataFactory 作为名称On the New data factory page, enter ADFTutorialDataFactory for Name.

    Azure 数据工厂的名称必须 全局唯一The name of the Azure data factory must be globally unique. 如果出现以下错误,请更改数据工厂的名称(例如改为 <yourname>ADFTutorialDataFactory),并重新尝试创建。If you see the following error, change the name of the data factory (for example, <yourname>ADFTutorialDataFactory) and try creating again. 有关数据工厂项目的命名规则,请参阅数据工厂 - 命名规则一文。For naming rules for Data Factory artifacts, see the Data Factory - naming rules article.

    名称不可用时出错

  6. 对于“订阅”,请选择要在其中创建数据工厂的 Azure 订阅。 For Subscription, select your Azure subscription in which you want to create the data factory.

  7. 对于“资源组”,请使用以下步骤之一: For Resource Group, use one of the following steps:

    • 选择“使用现有”,并从列表中选择现有的资源组。 Select Use existing, and select an existing resource group from the list.
    • 选择“新建”,并输入资源组的名称。 Select Create new, and enter the name of a resource group.

    若要了解有关资源组的详细信息,请参阅 使用资源组管理 Azure 资源To learn about resource groups, see Using resource groups to manage your Azure resources.

  8. 对于“版本”,选择“V2”。 For Version, select V2.

  9. 对于“位置”,请选择数据工厂所在的位置。 For Location, select the location for the data factory.

    该列表仅显示数据工厂支持的位置,以及 Azure 数据工厂元数据要存储到的位置。The list shows only locations that Data Factory supports, and where your Azure Data Factory meta data will be stored. 数据工厂使用的关联数据存储(如 Azure 存储和 Azure SQL 数据库)和计算(如 Azure HDInsight)可以在其他区域中运行。The associated data stores (like Azure Storage and Azure SQL Database) and computes (like Azure HDInsight) that Data Factory uses can run in other regions.

  10. 选择“创建” 。Select Create. 创建完成后,选择“转到资源”导航到“数据工厂”页。 After the creation is complete, select Go to resource to navigate to the Data Factory page.

  11. 选择“创作和监视”磁贴,在单独的选项卡中启动 Azure 数据工厂用户界面 (UI) 应用程序。 Select the Author & Monitor tile to start the Azure Data Factory user interface (UI) application on a separate tab.

    数据工厂的主页,其中包含“创作和监视”磁贴

    Note

    如果看到 Web 浏览器停留在“正在授权”状态,请清除“阻止第三方 Cookie 和站点数据” 复选框。If you see that the web browser is stuck at "Authorizing", clear the Block third-party cookies and site data check box. 或者使其保持选中状态,为 login.partner.microsoftonline.cn 创建一个例外,然后再次尝试打开该应用。Or keep it selected, create an exception for login.partner.microsoftonline.cn, and then try to open the app again.

  12. 在“入门”页的左侧面板中,切换到“创作”选项卡。 On the Let's get started page, switch to the Author tab in the left panel.

    “入门”页

创建链接服务Create a linked service

在此过程中,请创建一个链接服务,将 Azure 存储帐户链接到数据工厂。In this procedure, you create a linked service to link your Azure Storage account to the data factory. 链接服务包含的连接信息可供数据工厂服务用来在运行时连接到它。The linked service has the connection information that the Data Factory service uses at runtime to connect to it.

  1. 选择“连接” ,然后选择工具栏上的“新建” 按钮(“连接” 按钮位于左栏底部的“工厂资源” 下)。Select Connections, and then select the New button on the toolbar (Connections button is located at the bottom of the left column under Factory Resources).

  2. 在“新建链接服务”页中,选择“Azure Blob 存储”,然后选择“继续”。 On the New Linked Service page, select Azure Blob Storage, and then select Continue.

  3. 在“新建链接服务(Azure Blob 存储)”页上,完成以下步骤:On the New Linked Service (Azure Blob Storage) page, complete the following steps:

    a.a. 至于“名称” ,请输入 AzureStorageLinkedServiceFor Name, enter AzureStorageLinkedService.

    b.b. 对于“存储帐户名称”,请选择 Azure 存储帐户的名称。 For Storage account name, select the name of your Azure Storage account.

    c.c. 选择“测试连接”,确认数据工厂服务可以连接到存储帐户。 Select Test connection to confirm that the Data Factory service can connect to the storage account.

    d.d. 选择“创建”以保存链接服务。 Select Create to save the linked service.

    新建链接服务

创建数据集Create datasets

此过程创建两个数据集:InputDataset 和 OutputDataset 。In this procedure, you create two datasets: InputDataset and OutputDataset. 这两个数据集的类型为 AzureBlobThese datasets are of type AzureBlob. 它们引用在上一部分创建的 Azure 存储链接服务。They refer to the Azure Storage linked service that you created in the previous section.

输入数据集表示输入文件夹中的源数据。The input dataset represents the source data in the input folder. 在输入数据集定义中,请指定包含源数据的 Blob 容器 (adftutorial)、文件夹 (input) 和文件 (emp.txt)。In the input dataset definition, you specify the blob container (adftutorial), the folder (input), and the file (emp.txt) that contain the source data.

输出数据集表示复制到目标的数据。The output dataset represents the data that's copied to the destination. 在输出数据集定义中,请指定要将数据复制到其中的 Blob 容器 (adftutorial)、文件夹 (output) 和文件。In the output dataset definition, you specify the blob container (adftutorial), the folder (output), and the file to which the data is copied. 管道的每次运行都有与之关联的唯一 ID。Each run of a pipeline has a unique ID associated with it. 可以使用系统变量 RunId 来访问该 ID。You can access this ID by using the system variable RunId. 输出文件的名称会根据管道的运行 ID 动态进行赋值。The name of the output file is dynamically evaluated based on the run ID of the pipeline.

在链接服务设置中,已指定包含源数据的 Azure 存储帐户。In the linked service settings, you specified the Azure Storage account that contains the source data. 在源数据集设置中,请指定源数据的具体驻留位置(Blob 容器、文件夹和文件)。In the source dataset settings, you specify where exactly the source data resides (blob container, folder, and file). 在接收器数据集设置中,请指定将数据复制到其中的位置(Blob 容器、文件夹和文件)。In the sink dataset settings, you specify where the data is copied to (blob container, folder, and file).

  1. 选择“+ (加)”按钮,然后选择“数据集”。 Select the + (plus) button, and then select Dataset.

    用于创建数据集的菜单

  2. 在“新建数据集”页中,选择“Azure Blob 存储”,然后选择“继续”。 On the New Dataset page, select Azure Blob Storage, and then select Continue.

  3. 在“选择格式”页上选择数据的格式类型,然后选择“继续”。 On the Select Format page, choose the format type of your data, and then select Continue. 在本例中,当按原样复制文件而不分析内容时,请选择“二进制” 。In this case, select Binary when copy files as-is without parsing the content.

    选择格式

  4. 在“设置属性” 页上,完成以下步骤:On the Set Properties page, complete following steps:

    a.a. 在“名称” 下,输入 InputDatasetUnder Name, enter InputDataset.

    b.b. 至于“链接服务”,请选择“AzureStorageLinkedService”。 For Linked service, select AzureStorageLinkedService.

    c.c. 对于“文件路径”,请选择“浏览”按钮。 For File path, select the Browse button.

    d.d. 在“选择文件或文件夹”窗口中浏览到 adftutorial 容器中的 input 文件夹,选择 emp.txt 文件,然后选择“确定”。 In the Choose a file or folder window, browse to the input folder in the adftutorial container, select the emp.txt file, and then select OK.

    e.e. 选择“确定” 。Select OK.

    设置 InputDataset 的属性

  5. 重复创建输出数据集的步骤:Repeat the steps to create the output dataset:

    a.a. 选择“+ (加)”按钮,然后选择“数据集”。 Select the + (plus) button, and then select Dataset.

    b.b. 在“新建数据集”页中,选择“Azure Blob 存储”,然后选择“继续”。 On the New Dataset page, select Azure Blob Storage, and then select Continue.

    c.c. 在“选择格式”页上选择数据的格式类型,然后选择“继续”。 On the Select Format page, choose the format type of your data, and then select Continue.

    d.d. 在“设置属性” 页上,指定 OutputDataset 作为名称。On the Set Properties page, specify OutputDataset for the name. 选择 AzureStorageLinkedService 作为链接服务。Select AzureStorageLinkedService as linked service.

    e.e. 在“文件路径” 下,输入 adftutorial/outputUnder File path, enter adftutorial/output. 如果 output 文件夹不存在,复制活动会在运行时创建它。If the output folder doesn't exist, the copy activity creates it at runtime.

    f.f. 选择“确定” 。Select OK.

    设置 OutputDataset 的属性

创建管道Create a pipeline

此过程创建和验证一个管道,其中包含的复制活动可使用输入和输出数据集。In this procedure, you create and validate a pipeline with a copy activity that uses the input and output datasets. 复制活动将数据从输入数据集设置中指定的文件复制到输出数据集设置中指定的文件。The copy activity copies data from the file you specified in the input dataset settings to the file you specified in the output dataset settings. 如果输入数据集只指定了一个文件夹(不是文件名),则复制活动会将源文件夹中的所有文件复制到目标。If the input dataset specifies only a folder (not the file name), the copy activity copies all the files in the source folder to the destination.

  1. 选择“+ (加)”按钮,然后选择“管道”。 Select the + (plus) button, and then select Pipeline.

  2. 在“常规”选项卡中指定 CopyPipeline 作为名称In the General tab, specify CopyPipeline for Name.

  3. 在“活动” 工具箱中,展开“移动和转换” 。In the Activities toolbox, expand Move & Transform. 将“复制数据”活动从“活动”工具箱拖到管道设计器图面。 Drag the Copy Data activity from the Activities toolbox to the pipeline designer surface. 也可在“活动”工具箱中搜索活动。 You can also search for activities in the Activities toolbox. 指定 CopyFromBlobToBlob 作为名称Specify CopyFromBlobToBlob for Name. 创建复制数据活动Creating a copy data activity

  4. 切换到复制活动设置中的“源”选项卡, 选择 InputDataset 作为源数据集Switch to the Source tab in the copy activity settings, and select InputDataset for Source Dataset.

  5. 切换到复制活动设置中的“接收器”选项卡, 选择 OutputDataset 作为接收器数据集Switch to the Sink tab in the copy activity settings, and select OutputDataset for Sink Dataset.

  6. 在画布上面的管道工具栏中单击“验证”,以便验证管道设置。 Click Validate on the pipeline toolbar above the canvas to validate the pipeline settings. 确认已成功验证管道。Confirm that the pipeline has been successfully validated. 若要关闭验证输出,请选择 >> (右箭头)按钮。To close the validation output, select the >> (right arrow) button. 验证管道Validate a pipeline

调试管道Debug the pipeline

此步骤对管道进行调试,然后再将其部署到数据工厂。In this step, you debug the pipeline before deploying it to Data Factory.

  1. 在画布上面的管道工具栏中单击“调试”,以便触发测试性运行。 On the pipeline toolbar above the canvas, click Debug to trigger a test run.

  2. 确认可以在底部的管道设置的“输出”选项卡中看到管道运行的状态。 Confirm that you see the status of the pipeline run on the Output tab of the pipeline settings at the bottom.

    管道运行输出

  3. 确认可以在 adftutorial 容器的 output 文件夹中看到输出文件。Confirm that you see an output file in the output folder of the adftutorial container. 如果 output 文件夹不存在,数据工厂服务会自动创建它。If the output folder doesn't exist, the Data Factory service automatically creates it.

手动触发管道Trigger the pipeline manually

在此过程中,请将实体(链接服务、数据集、管道)部署到 Azure 数据工厂,In this procedure, you deploy entities (linked services, datasets, pipelines) to Azure Data Factory. 然后手动触发管道运行。Then, you manually trigger a pipeline run.

  1. 在触发管道之前,必须将实体发布到数据工厂。Before you trigger a pipeline, you must publish entities to Data Factory. 若要发布,请选择顶部的“全部发布”。 To publish, select Publish all on the top. 全部发布Publish all

  2. 若要手动触发管道,请选择管道工具栏中的“添加触发器”,然后选择“立即触发”。 To trigger the pipeline manually, select Add Trigger on the pipeline toolbar, and then select Trigger Now. 在“管道运行”页上选择“完成”。 On the Pipeline run page, select Finish.

监视管道Monitor the pipeline

  1. 在左侧切换到“监视”选项卡。 Switch to the Monitor tab on the left. 使用“刷新”按钮刷新列表。 Use the Refresh button to refresh the list.

    用于监视管道运行的选项卡

  2. 选择“CopyPipeline”链接,此时会在此页中看到复制活动运行的状态。 Select the CopyPipeline link, you'll see the status of the copy activity run on this page.

  3. 若要查看复制操作的详细信息,请选择“详细信息”(眼镜图标)链接。 To view details about the copy operation, select the Details (eyeglasses image) link. 有关属性的详细信息,请参阅复制活动概述For details about the properties, see Copy Activity overview.

    复制操作详细信息

  4. 确认可以在 output 文件夹中看到新文件。Confirm that you see a new file in the output folder.

  5. 可以选择“所有管道运行”链接,从“活动运行”视图切换回到“管道运行”视图。 You can switch back to the Pipeline runs view from the Activity runs view by selecting the All pipeline runs link.

按计划触发管道Trigger the pipeline on a schedule

在本教程中,此过程为可选过程。This procedure is optional in this tutorial. 可以创建计划程序触发器,将管道计划为定期运行(每小时运行一次、每天运行一次,等等)。You can create a scheduler trigger to schedule the pipeline to run periodically (hourly, daily, and so on). 此过程创建一个触发器。该触发器每分钟运行一次,直至指定的结束日期和时间。In this procedure, you create a trigger to run every minute until the end date and time that you specify.

  1. 切换到“创作”选项卡。 Switch to the Author tab.

  2. 转到管道,选择管道工具栏中的“添加触发器”,然后选择“新建/编辑”。 Go to your pipeline, select Add Trigger on the pipeline toolbar, and then select New/Edit.

  3. 在“添加触发器”页中选择“选择触发器”,然后选择“新建”。 On the Add Triggers page, select Choose trigger, and then select New.

  4. 在“新建触发器”页上的“结束”下,选择“在日期”,指定一个比当前时间晚数分钟的结束时间,然后选择“确定”。 On the New Trigger page, under End, select On Date, specify an end time a few minutes after the current time, and then select OK.

    每次管道运行都需要支付相关成本,因此请指定适当的结束时间,使之仅比开始时间晚数分钟。A cost is associated with each pipeline run, so specify the end time only minutes apart from the start time. 确保两个时间是在同一天。Ensure that it's the same day. 但是,请确保在发布时间和结束时间之间有足够的时间来运行管道。However, ensure that there's enough time for the pipeline to run between the publish time and the end time. 只有在将解决方案发布到数据工厂之后,触发器才会生效,而不是在 UI 中保存触发器就会使该触发器生效。The trigger comes into effect only after you publish the solution to Data Factory, not when you save the trigger in the UI.

  5. 在“新建触发器”页上选中“已激活”复选框,然后选择“确定”。 On the New Trigger page, select the Activated check box, and then select OK.

    “新建触发器”设置

  6. 查看警告消息,然后选择“确定”。 Review the warning message, and select OK.

  7. 选择“全部发布”,将所做的更改发布到数据工厂。 Select Publish all to publish changes to Data Factory.

  8. 在左侧切换到“监视”选项卡。 Switch to the Monitor tab on the left. 选择“刷新”可刷新列表。 Select Refresh to refresh the list. 从发布时间到结束时间这段时间内,可以看到管道每分钟运行一次。You see that the pipeline runs once every minute from the publish time to the end time.

    请注意“触发者”列中的值。 Notice the values in the TRIGGERED BY column. 手动触发器运行是在此前执行的步骤(“立即触发”)中完成的。 The manual trigger run was from the step (Trigger Now) that you did earlier.

  9. 切换到“触发器运行”视图。 Switch to the Trigger runs view.

  10. 确认每次管道运行时,在 output 文件夹中都创建了输出文件,直至指定的结束日期和时间为止。Confirm that an output file is created for every pipeline run until the specified end date and time in the output folder.

后续步骤Next steps

此示例中的管道将数据从 Azure Blob 存储中的一个位置复制到另一个位置。The pipeline in this sample copies data from one location to another location in Azure Blob storage. 若要了解如何在更多方案中使用数据工厂,请完成相关教程To learn about using Data Factory in more scenarios, go through the tutorials.