在 HDInsight 中将 Apache Oozie 与 Apache Hadoop 配合使用以定义和运行工作流Use Apache Oozie with Apache Hadoop to define and run a workflow in HDInsight

了解如何使用 Apache Oozie 定义工作流以及如何在 HDInsight 上运行工作流。Learn how to use Apache Oozie to define a workflow and run the workflow on HDInsight. 要了解 Oozie 协调器,请参阅将基于时间的 Apache Oozie 协调器与 HDInsight 配合使用To learn about the Oozie coordinator, see Use time-based Apache Oozie Coordinator with HDInsight.

Apache Oozie 是一个管理 Hadoop 作业的工作流/协调系统。Apache Oozie is a workflow/coordination system that manages Hadoop jobs. 它与 Hadoop 堆栈集成,支持 Apache MapReduce、Apache Pig、Apache Hive 和 Apache Sqoop 的 Hadoop 作业。It is integrated with the Hadoop stack, and it supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. 它也能用于安排特定于某系统的作业,例如 Java 程序或 shell 脚本。It can also be used to schedule jobs that are specific to a system, like Java programs or shell scripts.

你根据本教程中的说明实现的工作流包含两个操作:The workflow you implement by following the instructions in this tutorial contains two actions:

工作流关系图

  1. Hive 操作运行 HiveQL 脚本以统计 Apache Log4j 文件中每个日志级类型出现的次数。A Hive action runs a HiveQL script to count the occurrences of each log-level type in an Apache Log4j file. 每个 log4j 文件都包含一行字段,其中包含用于显示类型和严重性的 [LOG LEVEL] 字段,例如:Each log4j file consists of a line of fields that contains a [LOG LEVEL] field that shows the type and the severity, for example:

     2012-02-03 18:35:34 SampleClass6 [INFO] everything normal for id 577725851
     2012-02-03 18:35:34 SampleClass4 [FATAL] system problem at id 1991281254
     2012-02-03 18:35:34 SampleClass3 [DEBUG] detail for id 1304807656
     ...
    

    Hive 脚本的输出结果类似如下:The Hive script output is similar to:

     [DEBUG] 434
     [ERROR] 3
     [FATAL] 1
     [INFO]  96
     [TRACE] 816
     [WARN]  4
    

    有关 Hive 的详细信息,请参阅将 Apache Hive 与 HDInsight 配合使用For more information about Hive, see Use Apache Hive with HDInsight.

  2. Sqoop 操作将 HiveQL 输出导出到 Azure SQL 数据库中的表。A Sqoop action exports the HiveQL output to a table in an Azure SQL database. 有关 Sqoop 的详细信息,请参阅将 Apache Sqoop 与 HDInsight 配合使用For more information about Sqoop, see Use Apache Sqoop with HDInsight.

Note

有关 HDInsight 群集上支持的 Oozie 版本,请参阅 HDInsight 提供的 Apache Hadoop 群集版本有哪些新增功能?For supported Oozie versions on HDInsight clusters, see What's new in the Apache Hadoop cluster versions provided by HDInsight?.

先决条件Prerequisites

Note

本文进行了更新,以便使用新的 Azure PowerShell Az 模块。This article has been updated to use the new Azure PowerShell Az module. 你仍然可以使用 AzureRM 模块,至少在 2020 年 12 月之前,它将继续接收 bug 修补程序。You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. 若要详细了解新的 Az 模块和 AzureRM 兼容性,请参阅新 Azure Powershell Az 模块简介To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. 有关 Az 模块安装说明,请参阅安装 Azure PowerShellFor Az module installation instructions, see Install Azure PowerShell.

开始学习本教程之前,必须具备以下项:Before you begin this tutorial, you must have the following item:

  • 配备 Azure PowerShell 的工作站A workstation with Azure PowerShell.

Important

使用 Azure Service Manager 管理 HDInsight 资源的 Azure PowerShell 支持已弃用,已在 2017 年 1 月 1 日删除。Azure PowerShell support for managing HDInsight resources using Azure Service Manager is deprecated, and was removed on January 1, 2017. 本文档中的步骤使用的是与 Azure Resource Manager 兼容的新 HDInsight cmdlet。The steps in this document use the new HDInsight cmdlets that work with Azure Resource Manager.

请按照 Install and configure Azure PowerShell (安装和配置 Azure PowerShell)中的步骤安装最新版本的 Azure PowerShell。Please follow the steps in Install and configure Azure PowerShell to install the latest version of Azure PowerShell. 如果脚本需要修改才能使用与 Azure Resource Manager 兼容的新 cmdlet,请参阅迁移到适用于 HDInsight 群集的基于 Azure Resource Manager 的开发工具,了解详细信息。If you have scripts that need to be modified to use the new cmdlets that work with Azure Resource Manager, see Migrating to Azure Resource Manager-based development tools for HDInsight clusters for more information.

Oozie 工作流定义是用 hPDL(一种 XML 过程定义语言)编写的。Oozie workflows definitions are written in hPDL (a XML Process Definition Language). 默认的工作流文件名为 workflow.xmlThe default workflow file name is workflow.xml. 以下是本教程中使用的工作流文件。The following is the workflow file you use in this tutorial.

    <workflow-app name="useooziewf" xmlns="uri:oozie:workflow:0.2">
        <start to = "RunHiveScript"/>

        <action name="RunHiveScript">
            <hive xmlns="uri:oozie:hive-action:0.2">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <configuration>
                    <property>
                        <name>mapred.job.queue.name</name>
                        <value>${queueName}</value>
                    </property>
                </configuration>
                <script>${hiveScript}</script>
                <param>hiveTableName=${hiveTableName}</param>
                <param>hiveDataFolder=${hiveDataFolder}</param>
                <param>hiveOutputFolder=${hiveOutputFolder}</param>
            </hive>
            <ok to="RunSqoopExport"/>
            <error to="fail"/>
        </action>

        <action name="RunSqoopExport">
            <sqoop xmlns="uri:oozie:sqoop-action:0.2">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <configuration>
                    <property>
                        <name>mapred.compress.map.output</name>
                        <value>true</value>
                    </property>
                </configuration>
            <arg>export</arg>
            <arg>--connect</arg>
            <arg>${sqlDatabaseConnectionString}</arg>
            <arg>--table</arg>
            <arg>${sqlDatabaseTableName}</arg>
            <arg>--export-dir</arg>
            <arg>${hiveOutputFolder}</arg>
            <arg>-m</arg>
            <arg>1</arg>
            <arg>--input-fields-terminated-by</arg>
            <arg>"\001"</arg>
            </sqoop>
            <ok to="end"/>
            <error to="fail"/>
        </action>

        <kill name="fail">
            <message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}] </message>
        </kill>

        <end name="end"/>
    </workflow-app>

该工作流中定义了两个操作。There are two actions defined in the workflow. start-to 操作是 RunHiveScriptThe start-to action is RunHiveScript. 如果该操作成功运行,则下一个操作是 RunSqoopExportIf the action runs successfully, the next action is RunSqoopExport.

RunHiveScript 有几个变量。The RunHiveScript has several variables. 在使用 Azure PowerShell 从工作站提交 Oozie 作业时,会传递值。You pass the values when you submit the Oozie job from your workstation by using Azure PowerShell.

工作流变量Workflow variables 说明Description
${jobTracker}${jobTracker} 指定 Hadoop 作业跟踪器的 URL。Specifies the URL of the Hadoop job tracker. 在 HDInsight 版本 3.0 和 2.1 中使用 jobtrackerhost:9010。Use jobtrackerhost:9010 in HDInsight version 3.0 and 2.1.
${nameNode}${nameNode} 指定 Hadoop 名称节点的 URL。Specifies the URL of the Hadoop name node. 请使用默认的文件系统地址,例如 wasb://<containerName>@<storageAccountName>.blob.core.chinacloudapi.cn。Use the default file system address, for example, wasb://<containerName>@<storageAccountName>.blob.core.chinacloudapi.cn.
${queueName}${queueName} 指定作业将提交到的队列名称。Specifies the queue name that the job is submitted to. 使用默认值。Use the default.
Hive 操作变量Hive action variable 说明Description
${hiveDataFolder}${hiveDataFolder} 指定 Hive Create Table 命令的源目录。Specifies the source directory for the Hive Create Table command.
${hiveOutputFolder}${hiveOutputFolder} 指定 INSERT OVERWRITE 语句的输出文件夹。Specifies the output folder for the INSERT OVERWRITE statement.
${hiveTableName}${hiveTableName} 指定引用 log4j 数据文件的 Hive 表的名称。Specifies the name of the Hive table that references the log4j data files.
Sqoop 操作变量Sqoop action variable 说明Description
${sqlDatabaseConnectionString}${sqlDatabaseConnectionString} 指定 Azure SQL 数据库连接字符串。Specifies the Azure SQL database connection string.
${sqlDatabaseTableName}${sqlDatabaseTableName} 指定数据将导出到的 Azure SQL 数据库表。Specifies the Azure SQL database table where the data is exported to.
${hiveOutputFolder}${hiveOutputFolder} 指定 Hive INSERT OVERWRITE 语句的输出文件夹。Specifies the output folder for the Hive INSERT OVERWRITE statement. 这是用于 Sqoop 导出 (export-dir) 的同一个文件夹。This is the same folder for the Sqoop export (export-dir).

有关 Oozie 工作流和使用工作流操作的详细信息,请参阅 Apache Oozie 4.0 文档(适用于 HDInsight 3.0 版)或 Apache Oozie 3.3.2 文档(适用于 HDInsight 2.1 版)。For more information about Oozie workflow and using workflow actions, see Apache Oozie 4.0 documentation (for HDInsight version 3.0) or Apache Oozie 3.3.2 documentation (for HDInsight version 2.1).

工作流中的 Hive 操作调用 HiveQL 脚本文件。The Hive action in the workflow calls a HiveQL script file. 此脚本文件包含三个 HiveQL 语句:This script file contains three HiveQL statements:

DROP TABLE ${hiveTableName};
CREATE EXTERNAL TABLE ${hiveTableName}(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '${hiveDataFolder}';
INSERT OVERWRITE DIRECTORY '${hiveOutputFolder}' SELECT t4 AS sev, COUNT(*) AS cnt FROM ${hiveTableName} WHERE t4 LIKE '[%' GROUP BY t4;
  1. DROP TABLE 语句 删除 log4j Hive 表(如果存在)。The DROP TABLE statement deletes the log4j Hive table if it exists.
  2. CREATE TABLE 语句 创建指向 log4j 日志文件位置的 log4j Hive 外部表。The CREATE TABLE statement creates a log4j Hive external table that points to the location of the log4j log file. 字段分隔符为“,”。The field delimiter is ",". 默认分行符为“\n”。The default line delimiter is "\n". 如果想要多次运行 Oozie 工作流,可使用 Hive 外部表来避免从原始位置删除数据文件。A Hive external table is used to avoid the data file being removed from the original location if you want to run the Oozie workflow multiple times.
  3. INSERT OVERWRITE 语句可从 log4j Hive 表中计算每个日志级别类型的出现次数,并将输出保存到 Azure 存储中的 Blob。The INSERT OVERWRITE statement counts the occurrences of each log-level type from the log4j Hive table, and saves the output to a blob in Azure Storage.

该脚本中使用了三个变量:There are three variables used in the script:

  • ${hiveTableName}${hiveTableName}
  • ${hiveDataFolder}${hiveDataFolder}
  • ${hiveOutputFolder}${hiveOutputFolder}

工作流定义文件(本教程中的 workflow.xml)在运行时会将三个值传递到这个 HiveQL 脚本。The workflow definition file (workflow.xml in this tutorial) passes these values to this HiveQL script at run time.

工作流文件和 HiveQL 文件存储在 Blob 容器中。Both the workflow file and the HiveQL file are stored in a blob container. 本教程后面要使用的 PowerShell 脚本会将这两个文件复制到默认存储帐户。The PowerShell script you use later in this tutorial copies both files to the default Storage account.

使用 PowerShell 提交 Oozie 作业Submit Oozie jobs using PowerShell

Azure PowerShell 目前不提供任何用于定义 Oozie 作业的 cmdlet。Azure PowerShell currently doesn't provide any cmdlets for defining Oozie jobs. 可以使用 Invoke-RestMethod cmdlet 调用 Oozie Web 服务。You can use the Invoke-RestMethod cmdlet to invoke Oozie web services. Oozie Web 服务 API 是 HTTP REST JSON API。The Oozie web services API is a HTTP REST JSON API. 有关 Oozie Web 服务 API 的详细信息,请参阅 Apache Oozie 4.0 文档(用于 HDInsight 版本 3.0)或 Apache Oozie 3.3.2 文档(用于 HDInsight 版本 2.1)。For more information about the Oozie web services API, see Apache Oozie 4.0 documentation (for HDInsight version 3.0) or Apache Oozie 3.3.2 documentation (for HDInsight version 2.1).

本部分中的 PowerShell 脚本执行以下步骤:The PowerShell script in this section performs the following steps:

  1. 连接到 Azure。Connect to Azure.

  2. 创建 Azure 资源组。Create an Azure resource group. 有关详细信息,请参阅将 Azure PowerShell 与 Azure Resource Manager 配合使用For more information, see Use Azure PowerShell with Azure Resource Manager.

  3. 创建一个 Azure SQL 数据库服务器、一个 Azure SQL 数据库和两个表。Create an Azure SQL Database server, an Azure SQL database, and two tables. 工作流中的 Sqoop 操作会使用这些项。These are used by the Sqoop action in the workflow.

    表的名称为 log4jLogCountThe table name is log4jLogCount.

  4. 创建用于运行 Oozie 作业的 HDInsight 群集。Create an HDInsight cluster used to run Oozie jobs.

    若要检查群集,可以使用 Azure 门户或 Azure PowerShell。To examine the cluster, you can use the Azure portal or Azure PowerShell.

  5. 将 Oozie 工作流文件和 HiveQL 脚本文件复制到默认文件系统。Copy the oozie workflow file and the HiveQL script file to the default file system.

    这两个文件存储在公共 Blob 容器中。Both files are stored in a public Blob container.

    • 将 HiveQL 脚本 (useoozie.hql) 复制到 Azure 存储 (wasb:///tutorials/useoozie/useoozie.hql)。Copy the HiveQL script (useoozie.hql) to Azure Storage (wasb:///tutorials/useoozie/useoozie.hql).
    • 将 workflow.xml 复制到 wasb:///tutorials/useoozie/workflow.xml。Copy workflow.xml to wasb:///tutorials/useoozie/workflow.xml.
    • 将数据文件 (/example/data/sample.log) 复制到 wasb:///tutorials/useoozie/data/sample.log。Copy the data file (/example/data/sample.log) to wasb:///tutorials/useoozie/data/sample.log.
  6. 提交 Oozie 作业。Submit an Oozie job.

    若要检查 OOzie 作业结果,请使用 Visual Studio 或其他工具连接到 Azure SQL 数据库。To examine the OOzie job results, use Visual Studio or other tools to connect to the Azure SQL Database.

脚本如下。Here is the script. 可以通过 Windows PowerShell ISE 运行该脚本。You can run the script from Windows PowerShell ISE. 只需配置前 7 个变量。You only need to configure the first 7 variables.

    #region - provide the following values

    $subscriptionID = "<Enter your Azure subscription ID>"

    # SQL Database server login credentials used for creating and connecting
    $sqlDatabaseLogin = "<Enter SQL Database Login Name>"
    $sqlDatabasePassword = "<Enter SQL Database Login Password>"

    # HDInsight cluster HTTP user credential used for creating and connecting
    $httpUserName = "admin"  # The default name is "admin"
    $httpPassword = "<Enter HDInsight Cluster HTTP User Password>"

    # Used for creating Azure service names
    $nameToken = "<Enter an Alias>"
    $namePrefix = $nameToken.ToLower() + (Get-Date -Format "MMdd")
    #endregion

    #region - variables

    # Resource group variables
    $resourceGroupName = $namePrefix + "rg"
    $location = "China East" # used by all Azure services defined in this tutorial

    # SQL database varialbes
    $sqlDatabaseServerName = $namePrefix + "sqldbserver"
    $sqlDatabaseName = $namePrefix + "sqldb"
    $sqlDatabaseConnectionString = "Data Source=$sqlDatabaseServerName.database.chinacloudapi.cn;Initial Catalog=$sqlDatabaseName;User ID=$sqlDatabaseLogin;Password=$sqlDatabasePassword;Encrypt=true;Trusted_Connection=false;"
    $sqlDatabaseMaxSizeGB = 10

    # Used for retrieving external IP address and creating firewall rules
    $ipAddressRestService = "https://bot.whatismyipaddress.com"
    $fireWallRuleName = "UseSqoop"

    # HDInsight variables
    $hdinsightClusterName = $namePrefix + "hdi"
    $defaultStorageAccountName = $namePrefix + "store"
    $defaultBlobContainerName = $hdinsightClusterName
    #endregion

    # Treat all errors as terminating
    $ErrorActionPreference = "Stop"

    #region - Connect to Azure subscription
    Write-Host "`nConnecting to your Azure subscription ..." -ForegroundColor Green
    try{Get-AzContext}
    catch{
        Connect-AzAccount -EnvironmentName AzureChinaCloud
        Select-AzSubscription -SubscriptionId $subscriptionID
    }
    #endregion

    #region - Create Azure resource group
    Write-Host "`nCreating an Azure resource group ..." -ForegroundColor Green
    try{
        Get-AzResourceGroup -Name $resourceGroupName
    }
    catch{
        New-AzResourceGroup -Name $resourceGroupName -Location $location
    }
    #endregion

    #region - Create Azure SQL database server
    Write-Host "`nCreating an Azure SQL Database server ..." -ForegroundColor Green
    try{
        Get-AzSqlServer -ServerName $sqlDatabaseServerName -ResourceGroupName $resourceGroupName}
    catch{
        Write-Host "`nCreating SQL Database server ..."  -ForegroundColor Green

        $sqlDatabasePW = ConvertTo-SecureString -String $sqlDatabasePassword -AsPlainText -Force
        $sqlLoginCredentials = New-Object System.Management.Automation.PSCredential($sqlDatabaseLogin,$sqlDatabasePW)

        $sqlDatabaseServerName = (New-AzSqlServer `
                                    -ResourceGroupName $resourceGroupName `
                                    -ServerName $sqlDatabaseServerName `
                                    -SqlAdministratorCredentials $sqlLoginCredentials `
                                    -Location $location).ServerName
        Write-Host "`tThe new SQL database server name is $sqlDatabaseServerName." -ForegroundColor Cyan

        Write-Host "`nCreating firewall rule, $fireWallRuleName ..." -ForegroundColor Green
        $workstationIPAddress = Invoke-RestMethod $ipAddressRestService
        New-AzSqlServerFirewallRule `
            -ResourceGroupName $resourceGroupName `
            -ServerName $sqlDatabaseServerName `
            -FirewallRuleName "$fireWallRuleName-workstation" `
            -StartIpAddress $workstationIPAddress `
            -EndIpAddress $workstationIPAddress

        #To allow other Azure services to access the server add a firewall rule and set both the StartIpAddress and EndIpAddress to 0.0.0.0. 
        #Note that this allows Azure traffic from any Azure subscription to access the server.
        New-AzSqlServerFirewallRule `
            -ResourceGroupName $resourceGroupName `
            -ServerName $sqlDatabaseServerName `
            -FirewallRuleName "$fireWallRuleName-Azureservices" `
            -StartIpAddress "0.0.0.0" `
            -EndIpAddress "0.0.0.0"
    }
    #endregion

    #region - Create and validate Azure SQL database
    Write-Host "`nCreating SQL Database, $sqlDatabaseName ..."  -ForegroundColor Green

    try {
        Get-AzSqlDatabase `
            -ResourceGroupName $resourceGroupName `
            -ServerName $sqlDatabaseServerName `
            -DatabaseName $sqlDatabaseName
    }
    catch {
        New-AzSqlDatabase `
            -ResourceGroupName $resourceGroupName `
            -ServerName $sqlDatabaseServerName `
            -DatabaseName $sqlDatabaseName `
            -Edition "Standard" `
            -RequestedServiceObjectiveName "S1"
    }
    #endregion

    #region - Create SQL database tables
    Write-Host "Creating the log4jlogs table  ..." -ForegroundColor Green

    $sqlDatabaseTableName = "log4jLogsCount"
    $cmdCreateLog4jCountTable = " CREATE TABLE [dbo].[$sqlDatabaseTableName](
            [Level] [nvarchar](10) NOT NULL,
            [Total] float,
        CONSTRAINT [PK_$sqlDatabaseTableName] PRIMARY KEY CLUSTERED
        (
        [Level] ASC
        )
        )"

    $conn = New-Object System.Data.SqlClient.SqlConnection
    $conn.ConnectionString = $sqlDatabaseConnectionString
    $conn.Open()

    # Create the log4jlogs table and index
    $cmd = New-Object System.Data.SqlClient.SqlCommand
    $cmd.Connection = $conn
    $cmd.CommandText = $cmdCreateLog4jCountTable
    $cmd.ExecuteNonQuery()

    $conn.close()
    #endregion

    #region - Create HDInsight cluster

    Write-Host "Creating the HDInsight cluster and the dependent services ..." -ForegroundColor Green

    # Create the default storage account
    New-AzStorageAccount `
        -ResourceGroupName $resourceGroupName `
        -Name $defaultStorageAccountName `
        -Location $location `
        -Type Standard_LRS

    # Create the default Blob container
    $defaultStorageAccountKey = (Get-AzStorageAccountKey `
                                    -ResourceGroupName $resourceGroupName `
                                    -Name $defaultStorageAccountName)[0].Value
    $defaultStorageAccountContext = New-AzStorageContext `
                                        -StorageAccountName $defaultStorageAccountName `
                                        -StorageAccountKey $defaultStorageAccountKey 
    New-AzStorageContainer `
        -Name $defaultBlobContainerName `
        -Context $defaultStorageAccountContext 

    # Create the HDInsight cluster
    $pw = ConvertTo-SecureString -String $httpPassword -AsPlainText -Force
    $httpCredential = New-Object System.Management.Automation.PSCredential($httpUserName,$pw)

    New-AzHDInsightCluster `
        -ResourceGroupName $resourceGroupName `
        -ClusterName $HDInsightClusterName `
        -Location $location `
        -ClusterType Hadoop `
        -OSType Windows `
        -ClusterSizeInNodes 2 `
        -HttpCredential $httpCredential `
        -DefaultStorageAccountName "$defaultStorageAccountName.blob.core.chinacloudapi.cn" `
        -DefaultStorageAccountKey $defaultStorageAccountKey `
        -DefaultStorageContainer $defaultBlobContainerName 

    # Validate the cluster
    Get-AzHDInsightCluster -ClusterName $hdinsightClusterName
    #endregion

    #region - copy Oozie workflow and HiveQL files

    Write-Host "Copy workflow definition and HiveQL script file ..." -ForegroundColor Green

    # Both files are stored in a public Blob
    $publicBlobContext = New-AzStorageContext -StorageAccountName "hditutorialdata" -Anonymous

    # WASB folder for storing the Oozie tutorial files.
    $destFolder = "tutorials/useoozie"  # Do NOT use the long path here

    Start-CopyAzureStorageBlob `
        -Context $publicBlobContext `
        -SrcContainer "useoozie" `
        -SrcBlob "useooziewf.hql"  `
        -DestContext $defaultStorageAccountContext `
        -DestContainer $defaultBlobContainerName `
        -DestBlob "$destFolder/useooziewf.hql" `
        -Force

    Start-CopyAzureStorageBlob `
        -Context $publicBlobContext `
        -SrcContainer "useoozie" `
        -SrcBlob "workflow.xml"  `
        -DestContext $defaultStorageAccountContext `
        -DestContainer $defaultBlobContainerName `
        -DestBlob "$destFolder/workflow.xml" `
        -Force

    #validate the copy
    Get-AzStorageBlob `
        -Context $defaultStorageAccountContext `
        -Container $defaultBlobContainerName `
        -Blob $destFolder/workflow.xml

    Get-AzStorageBlob `
        -Context $defaultStorageAccountContext `
        -Container $defaultBlobContainerName `
        -Blob $destFolder/useooziewf.hql

    #endregion

    #region - copy the sample.log file

    Write-Host "Make a copy of the sample.log file ... " -ForegroundColor Green

    Start-CopyAzureStorageBlob `
        -Context $defaultStorageAccountContext `
        -SrcContainer $defaultBlobContainerName `
        -SrcBlob "example/data/sample.log"  `
        -DestContext $defaultStorageAccountContext `
        -DestContainer $defaultBlobContainerName `
        -destBlob "$destFolder/data/sample.log" 

    #validate the copy
    Get-AzStorageBlob `
        -Context $defaultStorageAccountContext `
        -Container $defaultBlobContainerName `
        -Blob $destFolder/data/sample.log

    #endregion

    #region - submit Oozie job

    $storageUri="wasb://$defaultBlobContainerName@$defaultStorageAccountName.blob.core.chinacloudapi.cn"

    $oozieJobName = $namePrefix + "OozieJob"

    #Oozie WF variables
    $oozieWFPath="$storageUri/tutorials/useoozie"  # The default name is workflow.xml. And you don't need to specify the file name.
    $waitTimeBetweenOozieJobStatusCheck=10

    #Hive action variables
    $hiveScript = "$storageUri/tutorials/useoozie/useooziewf.hql"
    $hiveTableName = "log4jlogs"
    $hiveDataFolder = "$storageUri/tutorials/useoozie/data"
    $hiveOutputFolder = "$storageUri/tutorials/useoozie/output"

    #Sqoop action variables
    $sqlDatabaseConnectionString = "jdbc:sqlserver://$sqlDatabaseServerName.database.chinacloudapi.cn;user=$sqlDatabaseLogin@$sqlDatabaseServerName;password=$sqlDatabasePassword;database=$sqlDatabaseName"

    $OoziePayload =  @"
    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>

    <property>
        <name>nameNode</name>
        <value>$storageUrI</value>
    </property>

    <property>
        <name>jobTracker</name>
        <value>jobtrackerhost:9010</value>
    </property>

    <property>
        <name>queueName</name>
        <value>default</value>
    </property>

    <property>
        <name>oozie.use.system.libpath</name>
        <value>true</value>
    </property>

    <property>
        <name>hiveScript</name>
        <value>$hiveScript</value>
    </property>

    <property>
        <name>hiveTableName</name>
        <value>$hiveTableName</value>
    </property>

    <property>
        <name>hiveDataFolder</name>
        <value>$hiveDataFolder</value>
    </property>

    <property>
        <name>hiveOutputFolder</name>
        <value>$hiveOutputFolder</value>
    </property>

    <property>
        <name>sqlDatabaseConnectionString</name>
        <value>&quot;$sqlDatabaseConnectionString&quot;</value>
    </property>

    <property>
        <name>sqlDatabaseTableName</name>
        <value>$SQLDatabaseTableName</value>
    </property>

    <property>
        <name>user.name</name>
        <value>$httpUserName</value>
    </property>

    <property>
        <name>oozie.wf.application.path</name>
        <value>$oozieWFPath</value>
    </property>

    </configuration>
    "@

    Write-Host "Checking Oozie server status..." -ForegroundColor Green
    $clusterUriStatus = "https://$hdinsightClusterName.azurehdinsight.cn:443/oozie/v2/admin/status"
    $response = Invoke-RestMethod -Method Get -Uri $clusterUriStatus -Credential $httpCredential -OutVariable $OozieServerStatus

    $jsonResponse = ConvertFrom-Json (ConvertTo-Json -InputObject $response)
    $oozieServerStatus = $jsonResponse[0].("systemMode")
    Write-Host "Oozie server status is $oozieServerStatus."

    # create Oozie job
    Write-Host "Sending the following Payload to the cluster:" -ForegroundColor Green
    Write-Host "`n--------`n$OoziePayload`n--------"
    $clusterUriCreateJob = "https://$hdinsightClusterName.azurehdinsight.cn:443/oozie/v2/jobs"
    $response = Invoke-RestMethod -Method Post -Uri $clusterUriCreateJob -Credential $httpCredential -Body $OoziePayload -ContentType "application/xml" -OutVariable $OozieJobName #-debug

    $jsonResponse = ConvertFrom-Json (ConvertTo-Json -InputObject $response)
    $oozieJobId = $jsonResponse[0].("id")
    Write-Host "Oozie job id is $oozieJobId..."

    # start Oozie job
    Write-Host "Starting the Oozie job $oozieJobId..." -ForegroundColor Green
    $clusterUriStartJob = "https://$hdinsightClusterName.azurehdinsight.cn:443/oozie/v2/job/" + $oozieJobId + "?action=start"
    $response = Invoke-RestMethod -Method Put -Uri $clusterUriStartJob -Credential $httpCredential | Format-Table -HideTableHeaders #-debug

    # get job status
    Write-Host "Sleeping for $waitTimeBetweenOozieJobStatusCheck seconds until the job metadata is populated in the Oozie metastore..." -ForegroundColor Green
    Start-Sleep -Seconds $waitTimeBetweenOozieJobStatusCheck

    Write-Host "Getting job status and waiting for the job to complete..." -ForegroundColor Green
    $clusterUriGetJobStatus = "https://$hdinsightClusterName.azurehdinsight.cn:443/oozie/v2/job/" + $oozieJobId + "?show=info"
    $response = Invoke-RestMethod -Method Get -Uri $clusterUriGetJobStatus -Credential $httpCredential
    $jsonResponse = ConvertFrom-Json (ConvertTo-Json -InputObject $response)
    $JobStatus = $jsonResponse[0].("status")

    while($JobStatus -notmatch "SUCCEEDED|KILLED")
    {
        Write-Host "$(Get-Date -format 'G'): $oozieJobId is in $JobStatus state...waiting $waitTimeBetweenOozieJobStatusCheck seconds for the job to complete..."
        Start-Sleep -Seconds $waitTimeBetweenOozieJobStatusCheck
        $response = Invoke-RestMethod -Method Get -Uri $clusterUriGetJobStatus -Credential $httpCredential
        $jsonResponse = ConvertFrom-Json (ConvertTo-Json -InputObject $response)
        $JobStatus = $jsonResponse[0].("status")
        $jobStatus
    }

    Write-Host "$(Get-Date -format 'G'): $oozieJobId is in $JobStatus state!" -ForegroundColor Green

    #endregion

重新运行教程To re-run the tutorial

若要重新运行该工作流,必须删除以下项:To re-run the workflow, you must delete the following items:

  • Hive 脚本输出文件The Hive script output file
  • log4jLogsCount 表中的数据The data in the log4jLogsCount table

以下是可以使用的一个示例 PowerShell 脚本:Here is a sample PowerShell script that you can use:

    $resourceGroupName = "<AzureResourceGroupName>"

    $defaultStorageAccountName = "<AzureStorageAccountName>"
    $defaultBlobContainerName = "<ContainerName>"

    #SQL database variables
    $sqlDatabaseServerName = "<SQLDatabaseServerName>"
    $sqlDatabaseLogin = "<SQLDatabaseLoginName>"
    $sqlDatabasePassword = "<SQLDatabaseLoginPassword>"
    $sqlDatabaseName = "<SQLDatabaseName>"
    $sqlDatabaseTableName = "log4jLogsCount"

    Write-host "Delete the Hive script output file ..." -ForegroundColor Green
    $defaultStorageAccountKey = (Get-AzStorageAccountKey `
                                -ResourceGroupName $resourceGroupName `
                                -Name $defaultStorageAccountName)[0].Value
    $destContext = New-AzStorageContext -StorageAccountName $defaultStorageAccountName -StorageAccountKey $defaultStorageAccountKey
    Remove-AzStorageBlob -Context $destContext -Blob "tutorials/useoozie/output/000000_0" -Container $defaultBlobContainerName

    Write-host "Delete all the records from the log4jLogsCount table ..." -ForegroundColor Green
    $conn = New-Object System.Data.SqlClient.SqlConnection
    $conn.ConnectionString = "Data Source=$sqlDatabaseServerName.database.chinacloudapi.cn;Initial Catalog=$sqlDatabaseName;User ID=$sqlDatabaseLogin;Password=$sqlDatabasePassword;Encrypt=true;Trusted_Connection=false;"
    $conn.open()
    $cmd = New-Object System.Data.SqlClient.SqlCommand
    $cmd.connection = $conn
    $cmd.commandtext = "delete from $sqlDatabaseTableName"
    $cmd.executenonquery()

    $conn.close()

后续步骤Next steps

本教程已介绍了如何定义 Apache Oozie 工作流,以及如何使用 PowerShell 运行 Oozie 作业。In this tutorial, you learned how to define an Apache Oozie workflow and how to run an Oozie job by using PowerShell. 要了解更多信息,请参阅下列文章:To learn more, see the following articles: