通过 Azure 门户在 Azure 数据工厂中使用 Hive 活动转换 Azure 虚拟网络中的数据Transform data in Azure Virtual Network using Hive activity in Azure Data Factory using the Azure portal

适用于: Azure 数据工厂 Azure Synapse Analytics

本教程使用 Azure 门户创建一个数据工厂管道,该管道可以使用 HDInsight 群集上的 Hive 活动转换 Azure 虚拟网络 (VNet) 中的数据。In this tutorial, you use Azure portal to create a Data Factory pipeline that transforms data using Hive Activity on a HDInsight cluster that is in an Azure Virtual Network (VNet). 在本教程中执行以下步骤:You perform the following steps in this tutorial:

  • 创建数据工厂。Create a data factory.
  • 创建自承载 Integration RuntimeCreate a self-hosted integration runtime
  • 创建 Azure 存储和 Azure HDInsight 链接服务Create Azure Storage and Azure HDInsight linked services
  • 使用 Hive 活动创建管道。Create a pipeline with Hive activity.
  • 触发管道运行。Trigger a pipeline run.
  • 监视管道运行Monitor the pipeline run
  • 验证输出Verify the output

如果没有 Azure 订阅,请在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.

先决条件Prerequisites

备注

本文已经过更新,以便使用 Azure Az PowerShell 模块。This article has been updated to use the Azure Az PowerShell module. 若要与 Azure 交互,建议使用的 PowerShell 模块是 Az PowerShell 模块。The Az PowerShell module is the recommended PowerShell module for interacting with Azure. 若要开始使用 Az PowerShell 模块,请参阅安装 Azure PowerShellTo get started with the Az PowerShell module, see Install Azure PowerShell. 若要了解如何迁移到 Az PowerShell 模块,请参阅 将 Azure PowerShell 从 AzureRM 迁移到 AzTo learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

  • Azure 存储帐户Azure Storage account. 创建 Hive 脚本并将其上传到 Azure 存储。You create a hive script, and upload it to the Azure storage. Hive 脚本的输出存储在此存储帐户中。The output from the Hive script is stored in this storage account. 在本示例中,HDInsight 群集使用此 Azure 存储帐户作为主存储。In this sample, HDInsight cluster uses this Azure Storage account as the primary storage.

  • Azure 虚拟网络Azure Virtual Network. 如果没有 Azure 虚拟网络,请遵照这些说明创建虚拟网络。If you don't have an Azure virtual network, create it by following these instructions. 在本示例中,HDInsight 位于 Azure 虚拟网络中。In this sample, the HDInsight is in an Azure Virtual Network. 下面是 Azure 虚拟网络的示例配置。Here is a sample configuration of Azure Virtual Network.

    创建虚拟网络

  • HDInsight 群集HDInsight cluster. 创建一个 HDInsight 群集,并按照以下文章中所述,将该群集加入到在前一步骤中创建的虚拟网络:使用 Azure 虚拟网络扩展 Azure HDInsightCreate a HDInsight cluster and join it to the virtual network you created in the previous step by following this article: Extend Azure HDInsight using an Azure Virtual Network. 下面是虚拟网络中 HDInsight 的示例配置。Here is a sample configuration of HDInsight in a virtual network.

    虚拟网络中的 HDInsight

  • Azure PowerShellAzure PowerShell. 遵循如何安装和配置 Azure PowerShell 中的说明。Follow the instructions in How to install and configure Azure PowerShell.

  • 一个虚拟机A virtual machine. 创建一个 Azure 虚拟机 (VM),并将其加入到 HDInsight 群集所在的同一个虚拟网络。Create an Azure virtual machine VM and join it into the same virtual network that contains your HDInsight cluster. 有关详细信息,请参阅如何创建虚拟机For details, see How to create virtual machines.

将 Hive 脚本上传到 Blob 存储帐户Upload Hive script to your Blob Storage account

  1. 创建包含以下内容的名为 hivescript.hql 的 Hive SQL 文件:Create a Hive SQL file named hivescript.hql with the following content:

    DROP TABLE IF EXISTS HiveSampleOut; 
    CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' 
    STORED AS TEXTFILE LOCATION '${hiveconf:Output}';
    
    INSERT OVERWRITE TABLE HiveSampleOut
    Select 
        clientid,
        market,
        devicemodel,
        state
    FROM hivesampletable
    
  2. 在 Azure Blob 存储中,创建名为 adftutorial 的容器(如果尚不存在)。In your Azure Blob Storage, create a container named adftutorial if it does not exist.

  3. 创建名为 hivescripts 的文件夹。Create a folder named hivescripts.

  4. hivescript.hql 文件上传到 hivescripts 子文件夹。Upload the hivescript.hql file to the hivescripts subfolder.

创建数据工厂Create a data factory

  1. 启动 Microsoft EdgeGoogle Chrome Web 浏览器。Launch Microsoft Edge or Google Chrome web browser. 目前,仅 Microsoft Edge 和 Google Chrome Web 浏览器支持数据工厂 UI。Currently, Data Factory UI is supported only in Microsoft Edge and Google Chrome web browsers.

  2. 登录到 Azure 门户Log in to the Azure portal.

  3. 在左侧菜单中单击“新建”,并依次单击“数据 + 分析”、“数据工厂”。 Click New on the left menu, click Data + Analytics, and click Data Factory.

    新建 -> DataFactory

  4. 在“新建数据工厂”页中,输入 ADFTutorialHiveFactory 作为 名称In the New data factory page, enter ADFTutorialHiveFactory for the name.

    “新建数据工厂”页

    Azure 数据工厂的名称必须 全局唯一The name of the Azure data factory must be globally unique. 如果收到错误,请更改数据工厂的名称(例如改为 yournameMyAzureSsisDataFactory),并重新尝试创建。If you receive the following error, change the name of the data factory (for example, yournameMyAzureSsisDataFactory) and try creating again. 有关数据工厂项目命名规则,请参阅数据工厂 - 命名规则一文。See Data Factory - Naming Rules article for naming rules for Data Factory artifacts.

    数据工厂名“MyAzureSsisDataFactory”不可用Data factory name “MyAzureSsisDataFactory” is not available

  5. 选择要在其中创建数据工厂的 Azure 订阅Select your Azure subscription in which you want to create the data factory.

  6. 对于 资源组,请执行以下步骤之一:For the Resource Group, do one of the following steps:

    • 选择“使用现有资源组”,并从下拉列表选择现有的资源组。Select Use existing, and select an existing resource group from the drop-down list.

    • 选择“新建”,并输入资源组的名称。Select Create new, and enter the name of a resource group.

      若要了解有关资源组的详细信息,请参阅 使用资源组管理 Azure 资源To learn about resource groups, see Using resource groups to manage your Azure resources.

  7. 选择数据工厂的 位置Select the location for the data factory. 列表中只会显示支持创建数据工厂的位置。Only locations that are supported for creation of data factories are shown in the list.

  8. 选择“固定到仪表板”。Select Pin to dashboard.

  9. 单击“创建”。Click Create.

  10. 在仪表板上,你会看状态如下的以下磁贴:“正在部署数据工厂”。On the dashboard, you see the following tile with status: Deploying data factory.

    “正在部署数据工厂”磁贴

  11. 创建完成后,可以看到图中所示的“数据工厂”页。After the creation is complete, you see the Data Factory page as shown in the image.

    数据工厂主页

  12. 单击“创作和监视”,在单独的选项卡中启动数据工厂用户界面 (UI)。Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.

  13. 在“入门”页的左侧面板中,切换到“编辑”选项卡,如下图所示: In the get started page, switch to the Edit tab in the left panel as shown in the following image:

    “编辑”选项卡

创建自承载 Integration RuntimeCreate a self-hosted integration runtime

由于 Hadoop 群集位于虚拟网络中,因此需要在同一个虚拟网络中安装自承载集成运行时 (IR)。As the Hadoop cluster is inside a virtual network, you need to install a self-hosted integration runtime (IR) in the same virtual network. 在本部分,我们将创建一个新的 VM、将其加入到同一个虚拟网络,然后在其上安装自承载 IR。In this section, you create a new VM, join it to the same virtual network, and install self-hosted IR on it. 数据工厂服务可以使用自承载 IR 将处理请求分发到虚拟网络中的某个计算服务,例如 HDInsight。The self-hosted IR allows Data Factory service to dispatch processing requests to a compute service such as HDInsight inside a virtual network. 自承载 IR 还可用于将数据移入/移出虚拟网络中的数据存储,以及移到 Azure。It also allows you to move data to/from data stores inside a virtual network to Azure. 当数据存储或计算资源位于本地环境中时,也可以使用自承载 IR。You use a self-hosted IR when the data store or compute is in an on-premises environment as well.

  1. 在 Azure 数据工厂 UI 中,单击窗口底部的“链接”,切换到“集成运行时”选项卡,然后单击工具栏上的“+ 新建”按钮。 In the Azure Data Factory UI, click Connections at the bottom of the window, switch to the Integration Runtimes tab, and click + New button on the toolbar.

    新建集成运行时菜单

  2. 在“集成运行时安装”窗口中,选择“执行数据移动并将活动分发到外部计算”选项,然后单击“下一步”。 In the Integration Runtime Setup window, Select Perform data movement and dispatch activities to external computes option, and click Next.

    选择执行数据移动并分发活动选项

  3. 选择“专用网络”,单击“下一步”。 Select Private Network, and click Next.

    选择专用网络

  4. 在“名称”中输入 MySelfHostedIR,然后单击“下一步”。 Enter MySelfHostedIR for Name, and click Next.

    指定集成运行时名称

  5. 单击复制按钮复制集成运行时的 身份验证密钥,并将其保存。Copy the authentication key for the integration runtime by clicking the copy button, and save it. 使窗口保持打开。Keep the window open. 稍后将要使用此密钥注册虚拟机中安装的 IR。You use this key to register the IR installed in a virtual machine.

    复制身份验证密钥

在虚拟机上安装 IRInstall IR on a virtual machine

  1. 在该 Azure VM 上,下载自我托管的集成运行时On the Azure VM, download self-hosted integration runtime. 使用上一步骤中获取的 身份验证密钥 手动注册自承载集成运行时。Use the authentication key obtained in the previous step to manually register the self-hosted integration runtime.

    注册集成运行时

  2. 成功注册自承载集成运行时后,会看到以下消息:You see the following message when the self-hosted integration runtime is registered successfully.

    已成功注册

  3. 单击“启动配置管理器”。Click Launch Configuration Manager. 将节点连接到云服务后,会看到以下页:You see the following page when the node is connected to the cloud service:

    节点已连接

Azure 数据工厂 UI 中的自承载 IRSelf-hosted IR in the Azure Data Factory UI

  1. Azure 数据工厂 UI 中,应会看到自承载 VM 的名称及其状态。In the Azure Data Factory UI, you should see the name of the self-hosted VM name and its status.

    现有的自承载节点

  2. 单击“完成”关闭“集成运行时安装”窗口。 Click Finish to close the Integration Runtime Setup window. 集成运行时列表中会显示该自承载 IR。You see the self-hosted IR in the list of integration runtimes.

    列表中的自承载 IR

创建链接服务Create linked services

在本部分中创作并部署两个链接服务:You author and deploy two Linked Services in this section:

  • 一个用于将 Azure 存储帐户链接到数据工厂的 Azure 存储链接服务An Azure Storage Linked Service that links an Azure Storage account to the data factory. 此存储是 HDInsight 群集使用的主存储。This storage is the primary storage used by your HDInsight cluster. 在本例中,我们将使用此 Azure 存储帐户来存储 Hive 脚本以及该脚本的输出。In this case, you use this Azure Storage account to store the Hive script and output of the script.
  • 一个 HDInsight 链接服务An HDInsight Linked Service. Azure 数据工厂将 Hive 脚本提交到此 HDInsight 群集以供执行。Azure Data Factory submits the Hive script to this HDInsight cluster for execution.

创建 Azure 存储链接服务Create Azure Storage linked service

  1. 切换到“链接的服务”选项卡,单击“新建”。 Switch to the Linked Services tab, and click New.

    “新建链接服务”按钮

  2. 在“新建链接服务”窗口中,选择“Azure Blob 存储”,然后单击“继续”。 In the New Linked Service window, select Azure Blob Storage, and click Continue.

    选择“Azure Blob 存储”

  3. 在“新建链接服务”窗口中执行以下步骤:In the New Linked Service window, do the following steps:

    1. 输入 AzureStorageLinkedService 作为 名称Enter AzureStorageLinkedService for Name.

    2. 为“通过集成运行时连接”选择“MySelfHostedIR”。 Select MySelfHostedIR for Connect via integration runtime.

    3. 对于“存储帐户名称”,请选择自己的 Azure 存储帐户。Select your Azure storage account for Storage account name.

    4. 若要测试与存储帐户的连接,请单击“测试连接”。To test the connection to storage account, click Test connection.

    5. 单击“ 保存”。Click Save.

      指定 Azure Blob 存储帐户

创建 HDInsight 链接的服务Create HDInsight linked service

  1. 再次单击“新建”以创建另一个链接服务。Click New again to create another linked service.

    “新建链接服务”按钮

  2. 切换到“计算”选项卡,选择“Azure HDInsight”,然后单击“继续”。 Switch to the Compute tab, select Azure HDInsight, and click Continue.

    选择“Azure HDInsight”

  3. 在“新建链接服务”窗口中执行以下步骤:In the New Linked Service window, do the following steps:

    1. 在“名称”中输入 AzureHDInsightLinkedServiceEnter AzureHDInsightLinkedService for Name.

    2. 选择“自带 HDInsight”。Select Bring your own HDInsight.

    3. 对于“HDI 群集”,请选择自己的 HDInsight 群集。Select your HDInsight cluster for Hdi cluster.

    4. 输入 HDInsight 群集的 用户名Enter the user name for the HDInsight cluster.

    5. 输入该用户的 密码Enter the password for the user.

      Azure HDInsight 设置

本文假设你有权通过 Internet 访问该群集。This article assumes that you have access to the cluster over the internet. 例如,可以通过 https://clustername.azurehdinsight.cn 连接到该群集。For example, that you can connect to the cluster at https://clustername.azurehdinsight.cn. 此地址使用公共网关。如果已使用网络安全组 (NSG) 或用户定义的路由 (UDR) 限制了从 Internet 的访问,则该网关不可用。This address uses the public gateway, which is not available if you have used network security groups (NSGs) or user-defined routes (UDRs) to restrict access from the internet. 要使数据工厂能够将作业提交到 Azure 虚拟网络中的 HDInsight 群集,需要相应地配置 Azure 虚拟网络,使 URL 可解析成 HDInsight 所用的网关的专用 IP 地址。For Data Factory to be able to submit jobs to HDInsight cluster in Azure Virtual Network, you need to configure your Azure Virtual Network such a way that the URL can be resolved to the private IP address of gateway used by HDInsight.

  1. 在 Azure 门户中,打开 HDInsight 所在的虚拟网络。From Azure portal, open the Virtual Network the HDInsight is in. 打开名称以 nic-gateway-0 开头的网络接口。Open the network interface with name starting with nic-gateway-0. 记下其专用 IP 地址。Note down its private IP address. 例如 10.6.0.15。For example, 10.6.0.15.

  2. 如果 Azure 虚拟网络包含 DNS 服务器,请更新 DNS 记录,使 HDInsight 群集 URL https://<clustername>.azurehdinsight.cn 可解析成 10.6.0.15If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster URL https://<clustername>.azurehdinsight.cn can be resolved to 10.6.0.15. 如果 Azure 虚拟网络中没有 DNS 服务器,可以通过编辑已注册为自承载集成运行时节点的所有 VM 的 hosts 文件 (C:\Windows\System32\drivers\etc) 并添加如下所示的条目,来暂时解决此问题:If you don't have a DNS server in your Azure Virtual Network, you can temporarily work around by editing the hosts file (C:\Windows\System32\drivers\etc) of all VMs that registered as self-hosted integration runtime nodes by adding an entry similar to the following one:

    10.6.0.15 myHDIClusterName.azurehdinsight.cn

创建管道Create a pipeline

本步骤创建包含 Hive 活动的新管道。In this step, you create a new pipeline with a Hive activity. 该活动执行 Hive 脚本来返回示例表中的数据,并将其保存到定义的路径。The activity executes Hive script to return data from a sample table and save it to a path you defined.

请注意以下几点:Note the following points:

  • scriptPath 指向用于 MyStorageLinkedService 的 Azure 存储帐户中的 Hive 脚本路径。scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService. 此路径区分大小写。The path is case-sensitive.
  • Output 是 Hive 脚本中使用的参数。Output is an argument used in the Hive script. 使用 wasbs://<Container>@<StorageAccount>.blob.core.chinacloudapi.cn/outputfolder/ 格式指向 Azure 存储中的现有文件夹。Use the format of wasbs://<Container>@<StorageAccount>.blob.core.chinacloudapi.cn/outputfolder/ to point it to an existing folder on your Azure Storage. 此路径区分大小写。The path is case-sensitive.
  1. 在数据工厂 UI 中,单击左窗格中的“+”(加号),然后单击“管道”。 In the Data Factory UI, click + (plus) in the left pane, and click Pipeline.

    “新建管道”菜单

  2. 在“活动”工具箱中展开“HDInsight”,将“Hive”活动拖放到管道设计器图面。 In the Activities toolbox, expand HDInsight, and drag-drop Hive activity to the pipeline designer surface.

    拖放 Hive 活动

  3. 在属性窗口中切换到“HDI 群集”选项卡,然后为“HDInsight 链接服务”选择“AzureHDInsightLinkedService”。 In the properties window, switch to the HDI Cluster tab, and select AzureHDInsightLinkedService for HDInsight Linked Service.

    选择“HDInsight 链接服务”

  4. 切换到“脚本”选项卡,然后执行以下步骤:Switch to the Scripts tab, and do the following steps:

    1. 为“脚本链接服务”选择“AzureStorageLinkedService”。 Select AzureStorageLinkedService for Script Linked Service.

    2. 对于“文件路径”,请单击“浏览存储”。 For File Path, click Browse Storage.

      浏览存储

    3. 在“选择文件或文件夹”窗口中导航到 adftutorial 容器中的 hivescripts 文件夹,选择 hivescript.hql,然后单击“完成”。 In the Choose a file or folder window, navigate to hivescripts folder of the adftutorial container, select hivescript.hql, and click Finish.

      选择文件或文件夹

    4. 确认“文件路径”中显示了 adftutorial/hivescripts/hivescript.hqlConfirm that you see adftutorial/hivescripts/hivescript.hql for File Path.

      脚本设置

    5. 在“脚本”选项卡中,展开“高级”部分。 In the Script tab, expand Advanced section.

    6. 单击“参数”对应的“从脚本自动填充”。 Click Auto-fill from script for Parameters.

    7. 使用以下格式输入“输出”参数的值:wasbs://<Blob Container>@<StorageAccount>.blob.core.chinacloudapi.cn/outputfolder/Enter the value for the Output parameter in the following format: wasbs://<Blob Container>@<StorageAccount>.blob.core.chinacloudapi.cn/outputfolder/. 例如:wasbs://adftutorial@mystorageaccount.blob.core.chinacloudapi.cn/outputfolder/For example: wasbs://adftutorial@mystorageaccount.blob.core.chinacloudapi.cn/outputfolder/.

      脚本参数

  5. 若要将项目发布到数据工厂,请单击“发布”。To publish artifacts to Data Factory, click Publish.

    屏幕截图显示了用于发布到数据工厂的选项。

触发管道运行Trigger a pipeline run

  1. 首先,请单击工具栏中的“验证”按钮来验证管道。First, validate the pipeline by clicking the Validate button on the toolbar. 单击 右箭头 (>>) 关闭“管道验证输出”窗口。Close the Pipeline Validation Output window by clicking right-arrow (>>).

    验证管道

  2. 若要触发某个管道运行,请在工具栏中单击“触发器”,然后单击“立即触发”。To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now.

    立即触发

监视管道运行Monitor the pipeline run

  1. 在左侧切换到“监视”选项卡。Switch to the Monitor tab on the left. “管道运行”列表中会显示一个管道运行。You see a pipeline run in the Pipeline Runs list.

    监视管道运行

  2. 若要刷新列表,请单击“刷新”。To refresh the list, click Refresh.

  3. 若要查看与管道运行相关联的活动运行,请单击“操作”列中的“查看活动运行”。 To view activity runs associated with the pipeline runs, click View activity runs in the Action column. 其他操作链接用于停止/重新运行管道。Other action links are for stopping/rerunning the pipeline.

    查看活动运行

  4. 只能看到一个活动运行,因为该管道中只包含一个 HDInsightHive 类型的活动。You see only one activity run since there is only one activity in the pipeline of type HDInsightHive. 若要切换回到上一视图,请单击顶部的“管道”链接。To switch back to the previous view, click Pipelines link at the top.

    活动运行

  5. 确认可以在 adftutorial 容器的 outputfolder 中看到输出文件。Confirm that you see an output file in the outputfolder of the adftutorial container.

    输出文件

后续步骤Next steps

已在本教程中执行了以下步骤:You performed the following steps in this tutorial:

  • 创建数据工厂。Create a data factory.
  • 创建自承载 Integration RuntimeCreate a self-hosted integration runtime
  • 创建 Azure 存储和 Azure HDInsight 链接服务Create Azure Storage and Azure HDInsight linked services
  • 使用 Hive 活动创建管道。Create a pipeline with Hive activity.
  • 触发管道运行。Trigger a pipeline run.
  • 监视管道运行Monitor the pipeline run
  • 验证输出Verify the output

请转到下一篇教程,了解如何在 Azure 上使用 Spark 群集转换数据:Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure: