在基于 Linux 的 Azure HDInsight 中将 Apache Oozie 与 Apache Hadoop 配合使用以定义和运行工作流Use Apache Oozie with Apache Hadoop to define and run a workflow on Linux-based Azure HDInsight

了解如何将 Apache Oozie 与 Azure HDInsight 上的 Apache Hadoop 配合使用。Learn how to use Apache Oozie with Apache Hadoop on Azure HDInsight. Oozie 是一个管理 Hadoop 作业的工作流和协调系统。Oozie is a workflow and coordination system that manages Hadoop jobs. Oozie 与 Hadoop 堆栈集成,并支持以下作业:Oozie is integrated with the Hadoop stack, and it supports the following jobs:

  • Apache Hadoop MapReduceApache Hadoop MapReduce
  • Apache PigApache Pig
  • Apache HiveApache Hive
  • Apache SqoopApache Sqoop

还可以使用 Oozie 来计划特定于某系统的作业,例如 Java 程序或 shell 脚本You can also use Oozie to schedule jobs that are specific to a system, like Java programs or shell scripts.

必备条件Prerequisites

示例工作流Example workflow

本文档中使用的工作流包含两个操作。The workflow used in this document contains two actions. 操作是任务的定义,例如运行 Hive、Sqoop、MapReduce 或其他进程:Actions are definitions for tasks, such as running Hive, Sqoop, MapReduce, or other processes:

工作流关系图

  1. Hive 操作运行 HiveQL 脚本,以从 HDInsight 随附的 hivesampletable 中提取记录。A Hive action runs an HiveQL script to extract records from the hivesampletable that's included with HDInsight. 每个数据行描述特定移动设备的访问。Each row of data describes a visit from a specific mobile device. 显示的记录格式如下文所示:The record format appears like the following text:

     8       18:54:20        en-US   Android Samsung SCH-i500        California     United States    13.9204007      0       0
     23      19:19:44        en-US   Android HTC     Incredible      Pennsylvania   United States    NULL    0       0
     23      19:19:46        en-US   Android HTC     Incredible      Pennsylvania   United States    1.4757422       0       1
    

    本文档中使用的 Hive 脚本将统计每个平台(例如 Android 或 iPhone)的总访问次数,并将计数存储到新的 Hive 表中。The Hive script used in this document counts the total visits for each platform, such as Android or iPhone, and stores the counts to a new Hive table.

    有关 Hive 的详细信息,请参阅将 Apache Hive 与 HDInsight 配合使用For more information about Hive, see Use Apache Hive with HDInsight.

  2. Sqoop 操作将新 Hive 表的内容导出到在 Azure SQL 数据库中创建的表。A Sqoop action exports the contents of the new Hive table to a table created in Azure SQL Database. 有关 Sqoop 的详细信息,请参阅将 Apache Sqoop 与 HDInsight 配合使用For more information about Sqoop, see Use Apache Sqoop with HDInsight.

Note

有关在 HDInsight 群集上支持的 Oozie 版本,请参阅 HDInsight 提供的 Hadoop 群集版本有哪些新增功能?For supported Oozie versions on HDInsight clusters, see What's new in the Hadoop cluster versions provided by HDInsight.

创建工作目录Create the working directory

Oozie 希望将作业所需的所有资源存储在同一个目录中。Oozie expects you to store all the resources required for a job in the same directory. 本示例使用 wasbs:///tutorials/useoozieThis example uses wasbs:///tutorials/useoozie. 若要创建此目录,请完成以下步骤:To create this directory, complete the following steps:

  1. 编辑以下代码,将 sshuser 替换为群集的 SSH 用户名,将 clustername 替换为群集名称。Edit the code below to replace sshuser with the SSH user name for the cluster, and replace clustername with the name of the cluster. 然后输入相应的代码,以使用 SSH 连接到 HDInsight 群集。Then enter the code to connect to the HDInsight cluster by using SSH.

    ssh sshuser@clustername-ssh.azurehdinsight.cn
    
  2. 若要创建此目录,请使用以下命令:To create the directory, use the following command:

    hdfs dfs -mkdir -p /tutorials/useoozie/data
    

    Note

    -p 参数用于在路径中创建所有目录。The -p parameter causes the creation of all directories in the path. data 目录用于保存 useooziewf.hql 脚本使用的数据。The data directory is used to hold the data used by the useooziewf.hql script.

  3. 编辑以下代码,将 sshuser 替换为你的 SSH 用户名。Edit the code below to replace sshuser with your SSH user name. 若要确保 Oozie 可以模拟用户帐户,请使用以下命令:To make sure that Oozie can impersonate your user account, use the following command:

    sudo adduser sshuser users
    

    Note

    可以忽略指出用户已是 users 组的成员的错误。You can ignore errors that indicate the user is already a member of the users group.

添加数据库驱动程序Add a database driver

由于此工作流使用 Sqoop 将数据导出到 SQL 数据库,因此必须提供用来与 SQL 数据库交互的 JDBC 驱动程序的副本。Because this workflow uses Sqoop to export data to the SQL database, you must provide a copy of the JDBC driver used to interact with the SQL database. 若要将 JDBC 驱动程序复制到工作目录,请在 SSH 会话中使用以下命令:To copy the JDBC driver to the working directory, use the following command from the SSH session:

hdfs dfs -put /usr/share/java/sqljdbc_7.0/enu/mssql-jdbc*.jar /tutorials/useoozie/

Important

验证 /usr/share/java/ 中是否实际存在 JDBC 驱动程序。Verify the actual JDBC driver that exists at /usr/share/java/.

如果工作流使用了其他资源,例如包含 MapReduce 应用程序的 jar,则还需要添加这些资源。If your workflow used other resources, such as a jar that contains a MapReduce application, you need to add those resources as well.

定义 Hive 查询Define the Hive query

使用以下步骤创建用于定义查询的 Hive 查询语言 (HiveQL) 脚本。Use the following steps to create a Hive query language (HiveQL) script that defines a query. 在本文档稍后的 Oozie 工作流中要使用此查询。You will use the query in an Oozie workflow later in this document.

  1. 在 SSH 连接中,使用以下命令创建名为 useooziewf.hql 的文件:From the SSH connection, use the following command to create a file named useooziewf.hql:

    nano useooziewf.hql
    
  2. 打开 GNU nano 编辑器后,使用以下查询作为该文件的内容:After the GNU nano editor opens, use the following query as the contents of the file:

    DROP TABLE ${hiveTableName};
    CREATE EXTERNAL TABLE ${hiveTableName}(deviceplatform string, count string) ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '${hiveDataFolder}';
    INSERT OVERWRITE TABLE ${hiveTableName} SELECT deviceplatform, COUNT(*) as count FROM hivesampletable GROUP BY deviceplatform;
    

    脚本中使用了两个变量:There are two variables used in the script:

    • ${hiveTableName}:包含要创建的表的名称。${hiveTableName}: Contains the name of the table to be created.

    • ${hiveDataFolder}:包含存储表数据文件的位置。${hiveDataFolder}: Contains the location to store the data files for the table.

      工作流定义文件(本文中的 workflow.xml)在运行时会将这些值传递到此 HiveQL 脚本。The workflow definition file, workflow.xml in this article, passes these values to this HiveQL script at runtime.

  3. 若要保存文件,请按 Ctrl+X,输入 Y,再按 EnterTo save the file, select Ctrl+X, enter Y, and then select Enter.

  4. 使用以下命令将 useooziewf.hql 复制到 wasbs:///tutorials/useoozie/useooziewf.hqlUse the following command to copy useooziewf.hql to wasbs:///tutorials/useoozie/useooziewf.hql:

    hdfs dfs -put useooziewf.hql /tutorials/useoozie/useooziewf.hql
    

    此命令将 useooziewf.hql 文件存储在群集的 HDFS 兼容存储上。This command stores the useooziewf.hql file in the HDFS-compatible storage for the cluster.

定义工作流Define the workflow

Oozie 工作流定义以 Hadoop 过程定义语言(缩写为 hPDL,一种 XML 过程定义语言)编写。Oozie workflow definitions are written in Hadoop Process Definition Language (hPDL), which is an XML process definition language. 使用以下步骤定义工作流:Use the following steps to define the workflow:

  1. 使用以下语句创建并编辑新文件:Use the following statement to create and edit a new file:

    nano workflow.xml
    
  2. 打开 nano 编辑器后,输入以下 XML 作为文件内容:After the nano editor opens, enter the following XML as the file contents:

    <workflow-app name="useooziewf" xmlns="uri:oozie:workflow:0.2">
        <start to = "RunHiveScript"/>
        <action name="RunHiveScript">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
            <property>
                <name>mapred.job.queue.name</name>
                <value>${queueName}</value>
            </property>
            </configuration>
            <script>${hiveScript}</script>
            <param>hiveTableName=${hiveTableName}</param>
            <param>hiveDataFolder=${hiveDataFolder}</param>
        </hive>
        <ok to="RunSqoopExport"/>
        <error to="fail"/>
        </action>
        <action name="RunSqoopExport">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
            <property>
                <name>mapred.compress.map.output</name>
                <value>true</value>
            </property>
            </configuration>
            <arg>export</arg>
            <arg>--connect</arg>
            <arg>${sqlDatabaseConnectionString}</arg>
            <arg>--table</arg>
            <arg>${sqlDatabaseTableName}</arg>
            <arg>--export-dir</arg>
            <arg>${hiveDataFolder}</arg>
            <arg>-m</arg>
            <arg>1</arg>
            <arg>--input-fields-terminated-by</arg>
            <arg>"\t"</arg>
            <archive>mssql-jdbc-7.0.0.jre8.jar</archive>
            </sqoop>
        <ok to="end"/>
        <error to="fail"/>
        </action>
        <kill name="fail">
        <message>Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}] </message>
        </kill>
        <end name="end"/>
    </workflow-app>
    

    该工作流中定义了两个操作:There are two actions defined in the workflow:

    • RunHiveScript:此操作是启动操作,它运行 useooziewf.hql Hive 脚本。RunHiveScript: This action is the start action and runs the useooziewf.hql Hive script.

    • RunSqoopExport:此操作使用 Sqoop 将创建的数据从 Hive 脚本导出到 SQL 数据库。RunSqoopExport: This action exports the data created from the Hive script to a SQL database by using Sqoop. 仅当 RunHiveScript 操作成功时才运行此操作。This action only runs if the RunHiveScript action is successful.

      工作流包含多个条目,例如 ${jobTracker}The workflow has several entries, such as ${jobTracker}. 这些条目将替换为在作业定义中使用的值。You will replace these entries with the values you use in the job definition. 稍后会在本文档中创建作业定义。You will create the job definition later in this document.

      另请注意 Sqoop 节中的 <archive>mssql-jdbc-7.0.0.jre8.jar</archive> 条目。Also note the <archive>mssql-jdbc-7.0.0.jre8.jar</archive> entry in the Sqoop section. 该条目指示在运行此操作时 Oozie 要将此存档提供给 Sqoop 使用。This entry instructs Oozie to make this archive available for Sqoop when this action runs.

  3. 若要保存文件,请按 Ctrl+X,输入 Y,再按 EnterTo save the file, select Ctrl+X, enter Y, and then select Enter.

  4. 使用以下命令将 workflow.xml 文件复制到 /tutorials/useoozie/workflow.xmlUse the following command to copy the workflow.xml file to /tutorials/useoozie/workflow.xml:

    hdfs dfs -put workflow.xml /tutorials/useoozie/workflow.xml
    

创建表Create a table

Note

有多种方法可连接到 SQL 数据库以创建表。There are many ways to connect to SQL Database to create a table. 以下步骤从 HDInsight 群集使用 FreeTDSThe following steps use FreeTDS from the HDInsight cluster.

  1. 使用以下命令在 HDInsight 群集上安装 FreeTDS:Use the following command to install FreeTDS on the HDInsight cluster:

    sudo apt-get --assume-yes install freetds-dev freetds-bin
    
  2. 编辑以下代码,将 <serverName> 替换为 Azure SQL 服务器名称,将 <sqlLogin> 替换为 Azure SQL 服务器登录名。Edit the code below to replace <serverName> with your Azure SQL server name, and <sqlLogin> with the Azure SQL server login. 输入相应的命令以连接到必备的 SQL 数据库。Enter the command to connect to the prerequisite SQL database. 在提示符下输入密码。Enter the password at the prompt.

    TDSVER=8.0 tsql -H <serverName>.database.chinacloudapi.cn -U <sqlLogin> -p 1433 -D oozietest
    

    会收到类似于以下文本的输出:You receive output like the following text:

     locale is "en_US.UTF-8"
     locale charset is "UTF-8"
     using default charset "UTF-8"
     Default database being set to oozietest
     1>
    
  3. 1> 提示符下,输入以下行:At the 1> prompt, enter the following lines:

    CREATE TABLE [dbo].[mobiledata](
    [deviceplatform] [nvarchar](50),
    [count] [bigint])
    GO
    CREATE CLUSTERED INDEX mobiledata_clustered_index on mobiledata(deviceplatform)
    GO
    

    输入 GO 语句后,会评估前面的语句。When the GO statement is entered, the previous statements are evaluated. 这些语句创建一个名为 mobiledata 的表,供工作流使用。These statements create a table, named mobiledata, that's used by the workflow.

    若要验证是否已创建该表,请使用以下命令:To verify that the table has been created, use the following commands:

    SELECT * FROM information_schema.tables
    GO
    

    会看到类似于以下文本的输出:You see output like the following text:

    TABLE_CATALOG   TABLE_SCHEMA    TABLE_NAME      TABLE_TYPE
    oozietest       dbo     mobiledata      BASE TABLE
    
  4. 1> 提示符下输入 exit 以退出 tsql 实用工具。Exit the tsql utility by entering exit at the 1> prompt.

创建作业定义Create the job definition

作业定义说明了可在何处找到 workflow.xml。The job definition describes where to find the workflow.xml. 它还说明可在何处找到工作流使用的其他文件(例如 useooziewf.hql)。It also describes where to find other files used by the workflow, such as useooziewf.hql. 此外,它还定义工作流中使用的属性值以及关联的文件。In addition, it defines the values for properties used within the workflow and the associated files.

  1. 若要获取默认存储的完整地址,请使用以下命令。To get the full address of the default storage, use the following command. 在下一步骤中创建的配置文件中会使用此地址。This address is used in the configuration file you create in the next step.

    sed -n '/<name>fs.default/,/<\/value>/p' /etc/hadoop/conf/core-site.xml
    

    此命令将返回类似以下 XML 的信息:This command returns information like the following XML:

    <name>fs.defaultFS</name>
    <value>wasb://mycontainer@mystorageaccount.blob.core.chinacloudapi.cn</value>
    

    Note

    如果 HDInsight 群集使用 Azure 存储作为默认存储,则 <value> 元素内容以 wasbs:// 开头。If the HDInsight cluster uses Azure Storage as the default storage, the <value> element contents begin with wasbs://. 如果使用 Azure Data Lake Storage Gen2,则以 abfs:// 开头。If Azure Data Lake Storage Gen2 is used, it begins with abfs://.

    保存 <value> 元素的内容,因为下一个步骤中将使用该内容。Save the contents of the <value> element, as it's used in the next steps.

  2. 按如下所示编辑以下 xml:Edit the xml below as follows:

    占位符值Placeholder value 替换的值Replaced value
    wasbs://mycontainer@mystorageaccount.blob.core.chinacloudapi.cn 从步骤 1 获取的值。Value received from step 1.
    adminadmin HDInsight 群集的登录名(如果不是 admin)。Your login name for the HDInsight cluster if not admin.
    serverNameserverName Azure SQL 数据库服务器名称。Azure SQL database server name.
    sqlLoginsqlLogin Azure SQL 数据库服务器登录名。Azure SQL database server login.
    sqlPasswordsqlPassword Azure SQL 数据库服务器登录密码。Azure SQL database server login password.
    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
    
        <property>
        <name>nameNode</name>
        <value>wasb://mycontainer@mystorageaccount.blob.core.chinacloudapi.cn</value>
        </property>
    
        <property>
        <name>jobTracker</name>
        <value>headnodehost:8050</value>
        </property>
    
        <property>
        <name>queueName</name>
        <value>default</value>
        </property>
    
        <property>
        <name>oozie.use.system.libpath</name>
        <value>true</value>
        </property>
    
        <property>
        <name>hiveScript</name>
        <value>wasb://mycontainer@mystorageaccount.blob.core.chinacloudapi.cn/tutorials/useoozie/useooziewf.hql</value>
        </property>
    
        <property>
        <name>hiveTableName</name>
        <value>mobilecount</value>
        </property>
    
        <property>
        <name>hiveDataFolder</name>
        <value>wasb://mycontainer@mystorageaccount.blob.core.chinacloudapi.cn/tutorials/useoozie/data</value>
        </property>
    
        <property>
        <name>sqlDatabaseConnectionString</name>
        <value>"jdbc:sqlserver://serverName.database.chinacloudapi.cn;user=sqlLogin;password=sqlPassword;database=oozietest"</value>
        </property>
    
        <property>
        <name>sqlDatabaseTableName</name>
        <value>mobiledata</value>
        </property>
    
        <property>
        <name>user.name</name>
        <value>admin</value>
        </property>
    
        <property>
        <name>oozie.wf.application.path</name>
        <value>wasb://mycontainer@mystorageaccount.blob.core.chinacloudapi.cn/tutorials/useoozie</value>
        </property>
    </configuration>
    

    此文件中的大多数信息用于填充 workflow.xml 或 ooziewf.hql 文件中使用的值(例如 ${nameNode})。Most of the information in this file is used to populate the values used in the workflow.xml or ooziewf.hql files, such as ${nameNode}. 如果该路径是 wasbs 路径,则必须使用完整路径。If the path is a wasbs path, you must use the full path. 不要将其缩短为 wasbs:///Do not shorten it to just wasbs:///. oozie.wf.application.path 条目定义查找 workflow.xml 文件的位置。The oozie.wf.application.path entry defines where to find the workflow.xml file. 此文件包含此作业运行的工作流。This file contains the workflow that was run by this job.

  3. 若要创建 Oozie 作业定义配置,请使用以下命令:To create the Oozie job definition configuration, use the following command:

    nano job.xml
    
  4. 打开 nano 编辑器后,粘贴编辑后的 XML 作为文件内容。After the nano editor opens, paste the edited XML as the contents of the file.

  5. 若要保存文件,请按 Ctrl+X,输入 Y,再按 EnterTo save the file, select Ctrl+X, enter Y, and then select Enter.

提交和管理作业Submit and manage the job

以下步骤使用 Oozie 命令提交和管理群集上的 Oozie 工作流。The following steps use the Oozie command to submit and manage Oozie workflows on the cluster. Oozie 命令是基于 Oozie REST API的友好界面。The Oozie command is a friendly interface over the Oozie REST API.

Important

使用 Oozie 命令时,必须使用 HDInsight 头节点的 FQDN。When you use the Oozie command, you must use the FQDN for the HDInsight head node. 只能从群集访问此 FQDN,如果群集位于 Azure 虚拟网络中,则必须从同一个网络中的其他计算机来访问它。This FQDN is only accessible from the cluster, or if the cluster is on an Azure virtual network, from other machines on the same network.

  1. 若要获取 Oozie 服务的 URL,请使用以下命令:To obtain the URL to the Oozie service, use the following command:

    sed -n '/<name>oozie.base.url/,/<\/value>/p' /etc/oozie/conf/oozie-site.xml
    

    此命令返回类似以下 XML 的信息:This returns information like the following XML:

    <name>oozie.base.url</name>
    <value>http://ACTIVE-HEADNODE-NAME.UNIQUEID.cx.internal.chinacloudapp.cn:11000/oozie</value>
    

    http://ACTIVE-HEADNODE-NAME.UNIQUEID.cx.internal.chinacloudapp.cn:11000/oozie 部分是要配合 Oozie 命令使用的 URL。The http://ACTIVE-HEADNODE-NAME.UNIQUEID.cx.internal.chinacloudapp.cn:11000/oozie portion is the URL to use with the Oozie command.

  2. 编辑代码,将 URL 替换为前面收到的 URL。Edit the code to replace the URL with the one you received earlier. 若要创建 URL 的环境变量,请使用以下命令,这样就不需要为每个命令输入该 URL:To create an environment variable for the URL, use the following, so you don't have to enter it for every command:

    export OOZIE_URL=http://HOSTNAMEt:11000/oozie
    
  3. 若要提交作业,请使用以下命令:To submit the job, use the following:

    oozie job -config job.xml -submit
    

    此命令从 job.xml 加载作业信息,并将作业信息提交到 Oozie,但不运行该作业。This command loads the job information from job.xml and submits it to Oozie, but does not run it.

    命令完成后,应返回作业的 ID,例如 0000005-150622124850154-oozie-oozi-WAfter the command finishes, it should return the ID of the job, for example, 0000005-150622124850154-oozie-oozi-W. 此 ID 用于管理作业。This ID is used to manage the job.

  4. 编辑以下代码,将 <JOBID> 替换为上一步骤返回的 ID。Edit the code below to replace <JOBID> with the ID returned in the previous step. 若要查看作业的状态,请使用以下命令:To view the status of the job, use the following command:

    oozie job -info <JOBID>
    

    此命令返回类似于以下文本的信息:This returns information like the following text:

    Job ID : 0000005-150622124850154-oozie-oozi-W
    ------------------------------------------------------------------------------------------------------------------------------------
    Workflow Name : useooziewf
    App Path      : wasb:///tutorials/useoozie
    Status        : PREP
    Run           : 0
    User          : USERNAME
    Group         : -
    Created       : 2015-06-22 15:06 GMT
    Started       : -
    Last Modified : 2015-06-22 15:06 GMT
    Ended         : -
    CoordAction ID: -
    ------------------------------------------------------------------------------------------------------------------------------------
    

    此作业的状态为 PREPThis job has a status of PREP. 此状态指示作业已创建,但是未启动。This status indicates that the job was created, but not started.

  5. 编辑以下代码,将 <JOBID> 替换为前面返回的 ID。Edit the code below to replace <JOBID> with the ID returned previously. 若要启动作业,请使用以下命令:To start the job, use the following command:

    oozie job -start <JOBID>
    

    如果在运行此命令后检查状态,会发现作业处于正在运行状态,并且返回了作业中操作的信息。If you check the status after this command, it's in a running state, and information is returned for the actions within the job. 该作业可能需要几分钟时间才能完成。The job will take a few minutes to complete.

  6. 编辑以下代码,将 <serverName> 替换为 Azure SQL 服务器名称,将 <sqlLogin> 替换为 Azure SQL 服务器登录名。Edit the code below to replace <serverName> with your Azure SQL server name, and <sqlLogin> with the Azure SQL server login. 任务成功完成后,可使用以下命令验证是否已生成数据并且已将导出到 SQL 数据库表。After the task finishes successfully, you can verify that the data was generated and exported to the SQL database table by using the following command. 在提示符下输入密码。Enter the password at the prompt.

    TDSVER=8.0 tsql -H <serverName>.database.chinacloudapi.cn -U <sqlLogin> -p 1433 -D oozietest
    

    1> 提示符下,输入以下查询:At the 1> prompt, enter the following query:

    SELECT * FROM mobiledata
    GO
    

    返回的信息类似于以下文本:The information returned is like the following text:

     deviceplatform  count
     Android 31591
     iPhone OS       22731
     proprietary development 3
     RIM OS  3464
     Unknown 213
     Windows Phone   1791
     (6 rows affected)
    

有关 Oozie 命令的详细信息,请参阅 Apache Oozie 命令行工具For more information on the Oozie command, see Apache Oozie command-line tool.

Oozie REST APIOozie REST API

借助 Oozie REST API 可以构建自己的工具来使用 Oozie。With the Oozie REST API, you can build your own tools that work with Oozie. 下面是有关在 HDInsight 中使用 Oozie REST API 的具体信息:The following is HDInsight-specific information about the use of the Oozie REST API:

  • URI:可从群集(位于 https://CLUSTERNAME.azurehdinsight.cn/oozie)外部访问 REST API。URI: You can access the REST API from outside the cluster at https://CLUSTERNAME.azurehdinsight.cn/oozie.

  • 身份验证:若要进行身份验证,请结合群集 HTTP 帐户(管理员)和密码使用该 API。Authentication: To authenticate, use the API the cluster HTTP account (admin) and password. 例如:For example:

    curl -u admin:PASSWORD https://CLUSTERNAME.azurehdinsight.cn/oozie/versions
    

有关如何使用 Oozie REST API 的详细信息,请参阅 Apache Oozie Web 服务 APIFor more information on how to use the Oozie REST API, see Apache Oozie Web Services API.

Oozie Web UIOozie web UI

Oozie Web UI 提供基于 Web 的视图来显示群集上 Oozie 作业的状态。The Oozie web UI provides a web-based view into the status of Oozie jobs on the cluster. 使用 Web UI 可查看以下信息:With the web UI you can view the following information:

  • 作业状态Job status
  • 作业定义Job definition
  • 配置Configuration
  • 作业中操作的关系图A graph of the actions in the job
  • 作业的日志Logs for the job

还可以查看作业中操作的详细信息。You can also view the details for the actions within a job.

若要访问 Oozie Web UI,请完成以下步骤:To access the Oozie web UI, complete the following steps:

  1. 与 HDInsight 群集建立 SSH 隧道。Create an SSH tunnel to the HDInsight cluster. 有关详细信息,请参阅将 SSH 隧道与 HDInsight 配合使用For more information, see Use SSH Tunneling with HDInsight.

  2. 创建隧道后,在 Web 浏览器中使用 URI http://headnodehost:8080 打开 Ambari Web UI。After you create a tunnel, open the Ambari web UI in your web browser using URI http://headnodehost:8080.

  3. 在页面左侧,选择“Oozie” > “快速链接” > “Oozie Web UI”。 From the left side of the page, select Oozie > Quick Links > Oozie Web UI.

    菜单图像

  4. Oozie Web UI 默认显示正在运行的工作流作业。The Oozie web UI defaults to display the running workflow jobs. 若要查看所有工作流作业,请选择“所有作业” 。To see all the workflow jobs, select All Jobs.

    显示了所有作业

  5. 若要查看有关某个作业的详细信息,请选择该作业。To view more information about a job, select the job.

    作业信息

  6. 可以在“作业信息”选项卡中查看基本作业信息,以及作业中的各个操作。 From the Job Info tab, you can see the basic job information and the individual actions within the job. 可以使用顶部的选项卡查看“作业定义”和“作业配置”,访问“作业日志”,或者在“作业 DAG”下查看作业的有向无环图 (DAG)。 You can use the tabs at the top to view the Job Definition, Job Configuration, access the Job Log, or view a directed acyclic graph (DAG) of the job under Job DAG.

    • 作业日志:选择“获取日志” 按钮获取作业的所有日志,或使用“输入搜索条件” 字段来筛选日志。Job Log: Select the Get Logs button to get all logs for the job, or use the Enter Search Filter field to filter the logs.

      作业日志

    • 作业 DAG:DAG 是整个工作流中使用的数据路径的图形概览。Job DAG: The DAG is a graphical overview of the data paths taken through the workflow.

      作业 DAG

  7. 如果在“作业信息” 选项卡中选择一个操作,会显示有关该操作的信息。If you select one of the actions from the Job Info tab, it brings up information for the action. 例如,选择 RunSqoopExport 操作。For example, select the RunSqoopExport action.

    操作信息

  8. 可以看到操作的详细信息,例如控制台 URL 的链接。You can see details for the action, such as a link to the Console URL. 使用此链接可查看作业的作业跟踪器信息。Use this link to view job tracker information for the job.

计划作业Schedule jobs

可以使用协调器指定作业的开始时间、结束时间和发生频率。You can use the coordinator to specify a start, an end, and the occurrence frequency for jobs. 若要定义工作流的计划,请完成以下步骤:To define a schedule for the workflow, complete the following steps:

  1. 使用以下命令创建名为 coordinator.xml 的文件:Use the following command to create a file named coordinator.xml:

    nano coordinator.xml
    

    将以下 XML 用作该文件的内容:Use the following XML as the contents of the file:

    <coordinator-app name="my_coord_app" frequency="${coordFrequency}" start="${coordStart}" end="${coordEnd}" timezone="${coordTimezone}" xmlns="uri:oozie:coordinator:0.4">
        <action>
        <workflow>
            <app-path>${workflowPath}</app-path>
        </workflow>
        </action>
    </coordinator-app>
    

    Note

    ${...} 变量在运行时会替换为作业定义中的值。The ${...} variables are replaced by values in the job definition at runtime. 变量包括:The variables are:

    • ${coordFrequency}:运行作业实例的间隔时间。${coordFrequency}: The time between running instances of the job.
    • ${coordStart}:作业开始时间。${coordStart}: The job start time.
    • ${coordEnd}:作业结束时间。${coordEnd}: The job end time.
    • ${coordTimezone}:在没有夏时制的固定时区(通常用 UTC 表示)处理协调器作业。${coordTimezone}: Coordinator jobs are in a fixed time zone with no daylight savings time, typically represented by using UTC. 此时区被称为“Oozie 处理时区”。 This time zone is referred as the Oozie processing timezone.
    • ${wfPath}:workflow.xml 的路径。${wfPath}: The path to the workflow.xml.
  2. 若要保存文件,请按 Ctrl+X,输入 Y,再按 EnterTo save the file, select Ctrl+X, enter Y, and then select Enter.

  3. 若要将该文件复制到此作业的工作目录,请使用以下命令:To copy the file to the working directory for this job, use the following command:

    hadoop fs -put coordinator.xml /tutorials/useoozie/coordinator.xml
    
  4. 若要修改前面创建的 job.xml 文件,请使用以下命令:To modify the job.xml file you created earlier, use the following command:

    nano job.xml
    

    进行以下更改:Make the following changes:

    • 若要指示 Oozie 运行协调器文件而不是工作流,请将 <name>oozie.wf.application.path</name> 更改为 <name>oozie.coord.application.path</name>To instruct oozie to run the coordinator file instead of the workflow, change <name>oozie.wf.application.path</name> to <name>oozie.coord.application.path</name>.

    • 若要设置协调器使用的 workflowPath 变量,请添加以下 XML:To set the workflowPath variable used by the coordinator, add the following XML:

      <property>
          <name>workflowPath</name>
          <value>wasbs://mycontainer@mystorageaccount.blob.core.chinacloudapi.cn/tutorials/useoozie</value>
      </property>
      

      wasbs://mycontainer@mystorageaccount.blob.core.windows 文本替换为 job.xml 文件中的其他条目使用的值。Replace the wasbs://mycontainer@mystorageaccount.blob.core.windows text with the value used in the other entries in the job.xml file.

    • 若要定义协调器的开始时间、结束时间和频率,请添加以下 XML:To define the start, end, and frequency for the coordinator, add the following XML:

      <property>
          <name>coordStart</name>
          <value>2018-05-10T12:00Z</value>
      </property>
      
      <property>
          <name>coordEnd</name>
          <value>2018-05-12T12:00Z</value>
      </property>
      
      <property>
          <name>coordFrequency</name>
          <value>1440</value>
      </property>
      
      <property>
          <name>coordTimezone</name>
          <value>UTC</value>
      </property>
      

      这些值将开始时间设置为 2018 年 5 月 10 日中午 12:00,将结束时间设置为 2018 年 5 月 12 日。These values set the start time to 12:00 PM on May 10, 2018, and the end time to May 12, 2018. 此作业的运行时间间隔已设置为“每日”。The interval for running this job is set to daily. 频率以分钟为单位,因此 24 小时 x 60 分钟 = 1440 分钟。The frequency is in minutes, so 24 hours x 60 minutes = 1440 minutes. 最后,将时区设置为 UTC。Finally, the time zone is set to UTC.

  5. 若要保存文件,请按 Ctrl+X,输入 Y,再按 EnterTo save the file, select Ctrl+X, enter Y, and then select Enter.

  6. 若要提交并启动该作业,请使用以下命令:To submit and start the job, use the following command:

    oozie job -config job.xml -run
    
  7. 如果转到 Oozie Web UI 并选择“协调器作业”选项卡,会看到下图所示的信息 :If you go to the Oozie web UI and select the Coordinator Jobs tab, you see information like in the following image:

    “协调器作业”选项卡

    “下一次具体化”条目包含下次运行作业的时间。 The Next Materialization entry contains the next time that the job runs.

  8. 与前面的工作流作业一样,在 Web UI 中选择作业条目会显示有关该作业的信息:Like the earlier workflow job, if you select the job entry in the web UI it displays information on the job:

    协调器作业信息

    Note

    此图像只显示了作业的成功运行结果,而未显示计划工作流中的单个操作。This image only shows successful runs of the job, not the individual actions within the scheduled workflow. 若要查看单个操作,请选择某个“操作”条目。 To see the individual actions, select one of the Action entries.

    协调器操作信息

故障排除Troubleshooting

使用 Oozie UI 可查看 Oozie 日志。With the Oozie UI, you can view Oozie logs. Oozie UI 还包含一些链接,指向工作流启动的 MapReduce 任务的 JobTracker 日志。The Oozie UI also contains links to the JobTracker logs for the MapReduce tasks that were started by the workflow. 故障排除的模式应为:The pattern for troubleshooting should be:

  1. 在 Oozie Web UI 中查看作业。View the job in Oozie Web UI.

  2. 如果特定的操作出错或失败,请选择该操作,以查看“错误消息” 字段是否提供了有关失败的详细信息。If there is an error or failure for a specific action, select the action to see if the Error Message field provides more information on the failure.

    1. 如果已提供,请使用操作中的 URL 查看该操作的更多详细信息(例如 JobTracker 日志)。If available, use the URL from the action to view more details, such as the JobTracker logs, for the action.

下面是可能会遇到的特定错误及其解决方法。The following are specific errors you might encounter and how to resolve them.

JA009:无法初始化群集JA009: Cannot initialize cluster

症状:作业状态变为“SUSPENDED” 。Symptoms: The job status changes to SUSPENDED. 作业详细信息中的 RunHiveScript 状态显示为“START_MANUAL”。 Details for the job show the RunHiveScript status as START_MANUAL. 选择该操作会显示以下错误消息:Selecting the action displays the following error message:

JA009: Cannot initialize Cluster. Please check your configuration for map

原因: job.xml 文件中使用的 Azure Blob 存储地址不包含存储容器或存储帐户名。Cause: The Azure Blob storage addresses used in the job.xml file doesn't contain the storage container or storage account name. Blob 存储地址格式必须是 wasbs://containername@storageaccountname.blob.core.chinacloudapi.cnThe Blob storage address format must be wasbs://containername@storageaccountname.blob.core.chinacloudapi.cn.

解决方法:更改作业使用的 Blob 存储地址。Resolution: Change the Blob storage addresses that the job uses.

JA002:不允许 Oozie 模拟 <USER>JA002: Oozie is not allowed to impersonate <USER>

症状:作业状态变为“SUSPENDED” 。Symptoms: The job status changes to SUSPENDED. 作业详细信息中的 RunHiveScript 状态显示为“START_MANUAL”。 Details for the job show the RunHiveScript status as START_MANUAL. 选择操作会显示以下错误消息:If you select the action, it shows the following error message:

JA002: User: oozie is not allowed to impersonate <USER>

原因: 当前的权限设置不允许 Oozie 模拟指定的用户帐户。Cause: The current permission settings don't allow Oozie to impersonate the specified user account.

解决方法:允许 Oozie 模拟“用户” 组中的用户。Resolution: Oozie can impersonate users in the users group. 使用 groups USERNAME 查看用户帐户所属的组。Use the groups USERNAME to see the groups that the user account is a member of. 如果该用户不是用户组的成员,请使用以下命令将该用户添加到该组:If the user is not a member of the users group, use the following command to add the user to the group:

sudo adduser USERNAME users

Note

可能需要几分钟,HDInsight 才能识别用户已添加到该组。It can take several minutes before HDInsight recognizes that the user has been added to the group.

启动器错误 (Sqoop)Launcher ERROR (Sqoop)

症状:作业状态变为“KILLED” 。Symptoms: The job status changes to KILLED. 作业详细信息中的 RunSqoopExport 状态显示为“ERROR”。 Details for the job show the RunSqoopExport status as ERROR. 选择操作会显示以下错误消息:If you select the action, it shows the following error message:

Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]

原因: Sqoop 无法加载访问数据库时所需的数据库驱动程序。Cause: Sqoop is unable to load the database driver required to access the database.

解决方法:从 Oozie 作业使用 Sqoop 时,必须同时包含数据库驱动程序和作业使用的其他资源(例如 workflow.xml)。Resolution: When you use Sqoop from an Oozie job, you must include the database driver with the other resources, such as the workflow.xml, the job uses. 此外,通过 workflow.xml 的 <sqoop>...</sqoop> 节引用包含数据库驱动程序的存档。Also, reference the archive that contains the database driver from the <sqoop>...</sqoop> section of the workflow.xml.

例如,可以针对本文档中的作业使用以下步骤:For example, for the job in this document, you would use the following steps:

  1. mssql-jdbc-7.0.0.jre8.jar 文件复制到 /tutorials/useoozie 目录:Copy the mssql-jdbc-7.0.0.jre8.jar file to the /tutorials/useoozie directory:

    hdfs dfs -put /usr/share/java/sqljdbc_7.0/enu/mssql-jdbc-7.0.0.jre8.jar /tutorials/useoozie/mssql-jdbc-7.0.0.jre8.jar
    
  2. 修改 workflow.xml,在 </sqoop> 上方的新行中添加以下 XML:Modify the workflow.xml to add the following XML on a new line above </sqoop>:

    <archive>mssql-jdbc-7.0.0.jre8.jar</archive>
    

后续步骤Next steps

本文介绍了如何定义 Oozie 工作流,以及如何运行 Oozie 作业。In this article, you learned how to define an Oozie workflow and how to run an Oozie job. 若要详细了解如何使用 HDInsight,请参阅以下文章:To learn more about how to work with HDInsight, see the following articles: