在具有企业安全性套餐的 Azure HDInsight 群集中运行 Apache OozieRun Apache Oozie in Azure HDInsight clusters with Enterprise Security Package

Apache Oozie 是一个管理 Apache Hadoop 作业的工作流和协调系统。Apache Oozie is a workflow and coordination system that manages Apache Hadoop jobs. Oozie 与 Hadoop 堆栈集成,并支持以下作业:Oozie is integrated with the Hadoop stack, and it supports the following jobs:

  • Apache MapReduceApache MapReduce
  • Apache PigApache Pig
  • Apache HiveApache Hive
  • Apache SqoopApache Sqoop

还可以使用 Oozie 来计划特定于某系统的作业,例如 Java 程序或 shell 脚本You can also use Oozie to schedule jobs that are specific to a system, like Java programs or shell scripts.

先决条件Prerequisite

具有企业安全性套餐 (ESP) 的 Azure HDInsight Hadoop 群集。An Azure HDInsight Hadoop cluster with Enterprise Security Package (ESP). 请参阅配置具有 ESP 的 HDInsight 群集See Configure HDInsight clusters with ESP.

备注

有关在非 ESP 群集中使用 Oozie 的详细说明,请参阅在基于 Linux 的 Azure HDInsight 中使用 Apache Oozie 工作流For detailed instructions on how to use Oozie on non-ESP clusters, see Use Apache Oozie workflows in Linux-based Azure HDInsight.

连接到 ESP 群集Connect to an ESP cluster

有关安全外壳 (SSH) 的详细信息,请参阅使用 SSH 连接到 HDInsight (Hadoop)For more information on Secure Shell (SSH), see Connect to HDInsight (Hadoop) using SSH.

  1. 使用 SSH 连接到 HDInsight 群集:Connect to the HDInsight cluster by using SSH:

    ssh [DomainUserName]@<clustername>-ssh.azurehdinsight.cn
    
  2. 若要验证 Kerberos 身份验证是否成功,请使用 klist 命令。To verify successful Kerberos authentication, use the klist command. 如果不成功,请使用 kinit 启动 Kerberos 身份验证。If not, use kinit to start Kerberos authentication.

  3. 登录到 HDInsight 网关,注册访问 Azure Data Lake Storage 所需的 OAuth 标记:Sign in to the HDInsight gateway to register the OAuth token required to access Azure Data Lake Storage:

    curl -I -u [DomainUserName@Domain.com]:[DomainUserPassword] https://<clustername>.azurehdinsight.cn
    

    状态响应代码 200 OK 表示注册成功。A status response code of 200 OK indicates successful registration. 如果收到未经授权的响应(如 401),请检查用户名和密码。Check the username and password if an unauthorized response is received, such as 401.

定义工作流Define the workflow

Oozie 工作流定义是用 Apache Hadoop 过程定义语言 (hPDL) 编写的。Oozie workflow definitions are written in Apache Hadoop Process Definition Language (hPDL). hPDL 是一种 XML 过程定义语言。hPDL is an XML process definition language. 按照以下步骤定义工作流:Take the following steps to define the workflow:

  1. 设置域用户的工作区:Set up a domain user’s workspace:

    hdfs dfs -mkdir /user/<DomainUser>
    cd /home/<DomainUserPath>
    cp /usr/hdp/<ClusterVersion>/oozie/doc/oozie-examples.tar.gz .
    tar -xvf oozie-examples.tar.gz
    hdfs dfs -put examples /user/<DomainUser>/
    

    DomainUser 替换为域用户名。Replace DomainUser with the domain user name. DomainUserPath 替换为域用户的主目录路径。Replace DomainUserPath with the home directory path for the domain user. ClusterVersion 替换为群集数据平台版本。Replace ClusterVersion with your cluster data platform version.

  2. 使用以下语句创建并编辑新文件:Use the following statement to create and edit a new file:

    nano workflow.xml
    
  3. 打开 nano 编辑器后,输入以下 XML 作为文件内容:After the nano editor opens, enter the following XML as the file contents:

    <?xml version="1.0" encoding="UTF-8"?>
    <workflow-app xmlns="uri:oozie:workflow:0.4" name="map-reduce-wf">
       <credentials>
          <credential name="metastore_token" type="hcat">
             <property>
                <name>hcat.metastore.uri</name>
                <value>thrift://<active-headnode-name>-<clustername>.<Domain>.com:9083</value>
             </property>
             <property>
                <name>hcat.metastore.principal</name>
                <value>hive/_HOST@<Domain>.COM</value>
             </property>
          </credential>
          <credential name="hs2-creds" type="hive2">
             <property>
                <name>hive2.server.principal</name>
                <value>${jdbcPrincipal}</value>
             </property>
             <property>
                <name>hive2.jdbc.url</name>
                <value>${jdbcURL}</value>
             </property>
          </credential>
       </credentials>
       <start to="mr-test" />
       <action name="mr-test">
          <map-reduce>
             <job-tracker>${jobTracker}</job-tracker>
             <name-node>${nameNode}</name-node>
             <prepare>
                <delete path="${nameNode}/user/${wf:user()}/examples/output-data/mrresult" />
             </prepare>
             <configuration>
                <property>
                   <name>mapred.job.queue.name</name>
                   <value>${queueName}</value>
                </property>
                <property>
                   <name>mapred.mapper.class</name>
                   <value>org.apache.oozie.example.SampleMapper</value>
                </property>
                <property>
                   <name>mapred.reducer.class</name>
                   <value>org.apache.oozie.example.SampleReducer</value>
                </property>
                <property>
                   <name>mapred.map.tasks</name>
                   <value>1</value>
                </property>
                <property>
                   <name>mapred.input.dir</name>
                   <value>/user/${wf:user()}/${examplesRoot}/input-data/text</value>
                </property>
                <property>
                   <name>mapred.output.dir</name>
                   <value>/user/${wf:user()}/${examplesRoot}/output-data/mrresult</value>
                </property>
             </configuration>
          </map-reduce>
          <ok to="myHive2" />
          <error to="fail" />
       </action>
       <action name="myHive2" cred="hs2-creds">
          <hive2 xmlns="uri:oozie:hive2-action:0.2">
             <job-tracker>${jobTracker}</job-tracker>
             <name-node>${nameNode}</name-node>
             <jdbc-url>${jdbcURL}</jdbc-url>
             <script>${hiveScript2}</script>
             <param>hiveOutputDirectory2=${hiveOutputDirectory2}</param>
          </hive2>
          <ok to="myHive" />
          <error to="fail" />
       </action>
       <action name="myHive" cred="metastore_token">
          <hive xmlns="uri:oozie:hive-action:0.2">
             <job-tracker>${jobTracker}</job-tracker>
             <name-node>${nameNode}</name-node>
             <configuration>
                <property>
                   <name>mapred.job.queue.name</name>
                   <value>${queueName}</value>
                </property>
             </configuration>
             <script>${hiveScript1}</script>
             <param>hiveOutputDirectory1=${hiveOutputDirectory1}</param>
          </hive>
          <ok to="end" />
          <error to="fail" />
       </action>
       <kill name="fail">
          <message>Oozie job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
       </kill>
       <end name="end" />
    </workflow-app>
    
  4. clustername 替换为群集的名称。Replace clustername with the name of the cluster.

  5. 若要保存该文件,请选择“Ctrl+X”。To save the file, select Ctrl+X. 输入“Y”。然后选择 Enter。Enter Y. Then select Enter.

    工作流分为两部分:The workflow is divided into two parts:

    • CREDENTIAL。Credential. 此部分接收用于验证 Oozie 操作的凭据:This section takes in the credentials that are used for authenticating Oozie actions:

      此示例对 Hive 操作进行身份验证。This example uses authentication for Hive actions. 若要了解详细信息,请参阅操作身份验证To learn more, see Action Authentication.

      凭据服务允许 Oozie 操作模拟用户访问 Hadoop 服务。The credential service allows Oozie actions to impersonate the user for accessing Hadoop services.

    • 操作。Action. 此部分包含三个操作:map-reduce、Hive server 2 和 Hive server 1:This section has three actions: map-reduce, Hive server 2, and Hive server 1:

      • map-reduce 操作针对输出聚合字数统计的映射化简运行来自 Oozie 包的示例。The map-reduce action runs an example from an Oozie package for map-reduce that outputs the aggregated word count.

      • Hive server 2 和 Hive server 1 操作在随 HDInsight 提供的示例 Hive 表上执行查询。The Hive server 2 and Hive server 1 actions run a query on a sample Hive table provided with HDInsight.

      Hive 操作借助操作元素中的关键字 cred,使用凭据部分中定义的凭据进行身份验证。The Hive actions use the credentials defined in the credentials section for authentication by using the keyword cred in the action element.

  6. 使用以下命令将 workflow.xml 文件复制到 /user/<domainuser>/examples/apps/map-reduce/workflow.xmlUse the following command to copy the workflow.xml file to /user/<domainuser>/examples/apps/map-reduce/workflow.xml:

    hdfs dfs -put workflow.xml /user/<domainuser>/examples/apps/map-reduce/workflow.xml
    
  7. domainuser 替换为你的域用户名。Replace domainuser with your username for the domain.

定义 Oozie 作业的属性文件Define the properties file for the Oozie job

  1. 使用以下语句为作业属性创建并编辑新文件:Use the following statement to create and edit a new file for job properties:

    nano job.properties
    
  2. 打开 nano 编辑器后,使用以下 XML 作为该文件的内容:After the nano editor opens, use the following XML as the contents of the file:

    nameNode=adl://home
    jobTracker=headnodehost:8050
    queueName=default
    examplesRoot=examples
    oozie.wf.application.path=${nameNode}/user/[domainuser]/examples/apps/map-reduce/workflow.xml
    hiveScript1=${nameNode}/user/${user.name}/countrowshive1.hql
    hiveScript2=${nameNode}/user/${user.name}/countrowshive2.hql
    oozie.use.system.libpath=true
    user.name=[domainuser]
    jdbcPrincipal=hive/<active-headnode-name>.<Domain>.com@<Domain>.COM
    jdbcURL=[jdbcurlvalue]
    hiveOutputDirectory1=${nameNode}/user/${user.name}/hiveresult1
    hiveOutputDirectory2=${nameNode}/user/${user.name}/hiveresult2
    
    • 如果使用的是 Azure Blob 存储,则更改为 wasb://homeIf you're using Azure Blob Storage, then change to wasb://home. 如果使用的是 Azure Data Lake Storage Gen2,则更改为 abfs://homeIf you're using Azure Data Lake Storage Gen2, then change to abfs://home.
    • domainuser 替换为你的域用户名。Replace domainuser with your username for the domain.
    • ClusterShortName 替换为群集的短名称。Replace ClusterShortName with the short name for the cluster. 例如,如果群集名称为 https:// [示例链接] sechadoopcontoso.azurehdinsight.cn,则 clustershortname 为群集的前 6 个字符:sechad。For example, if the cluster name is https:// [example link] sechadoopcontoso.azurehdinsight.cn, the clustershortname is the first six characters of the cluster: sechad.
    • jdbcurlvalue 替换为 Hive 配置中的 JDBC URL。Replace jdbcurlvalue with the JDBC URL from the Hive configuration. 例如,jdbc:hive2://headnodehost:10001/;transportMode=http。An example is jdbc:hive2://headnodehost:10001/;transportMode=http.
    • 若要保存文件,请按 Ctrl+X,输入 Y,再按 EnterTo save the file, select Ctrl+X, enter Y, and then select Enter.

    运行 Oozie 作业时,必须在本地提供此属性文件。This properties file needs to be present locally when running Oozie jobs.

为 Oozie 作业创建自定义 Hive 脚本Create custom Hive scripts for Oozie jobs

可以为 Hive server 1 和 Hive server 2 创建这两个 Hive 脚本,如以下各节中所示。You can create the two Hive scripts for Hive server 1 and Hive server 2 as shown in the following sections.

Hive server 1 文件Hive server 1 file

  1. 创建和编辑 Hive server 1 操作文件:Create and edit a file for Hive server 1 actions:

    nano countrowshive1.hql
    
  2. 创建脚本:Create the script:

    INSERT OVERWRITE DIRECTORY '${hiveOutputDirectory1}'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    select devicemake from hivesampletable limit 2;
    
  3. 将文件保存到 Apache Hadoop 分布式文件系统 (HDFS):Save the file to Apache Hadoop Distributed File System (HDFS):

    hdfs dfs -put countrowshive1.hql countrowshive1.hql
    

Hive server 2 文件Hive server 2 file

  1. 创建和编辑 Hive server 2 操作字段:Create and edit a field for Hive server 2 actions:

    nano countrowshive2.hql
    
  2. 创建脚本:Create the script:

    INSERT OVERWRITE DIRECTORY '${hiveOutputDirectory1}' 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    select devicemodel from hivesampletable limit 2;
    
  3. 将文件保存到 Hdfs:Save the file to HDFS:

    hdfs dfs -put countrowshive2.hql countrowshive2.hql
    

提交 Oozie 作业Submit Oozie jobs

为 ESP 群集提交 Oozie 作业与在非 ESP 群集中提交 Oozie 作业类似。Submitting Oozie jobs for ESP clusters is like submitting Oozie jobs in non-ESP clusters.

有关详细信息,请参阅在基于 Linux 的 Azure HDInsight 中将 Apache Oozie 与 Apache Hadoop 配合使用以定义和运行工作流For more information, see Use Apache Oozie with Apache Hadoop to define and run a workflow on Linux-based Azure HDInsight.

Oozie 作业提交结果Results from an Oozie job submission

Oozie 作业为用户而运行。Oozie jobs are run for the user. 因此,Apache Hadoop YARN 和 Apache Ranger 审核日志会显示正在以被模拟用户身份运行的作业。So both Apache Hadoop YARN and Apache Ranger audit logs show the jobs being run as the impersonated user. Oozie 作业的命令行界面输出如以下代码所示:The command-line interface output of an Oozie job looks like the following code:

Job ID : 0000015-180626011240801-oozie-oozi-W
------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf
App Path      : adl://home/user/alicetest/examples/apps/map-reduce/wf.xml
Status        : SUCCEEDED
Run           : 0
User          : alicetest
Group         : -
Created       : 2018-06-26 19:25 GMT
Started       : 2018-06-26 19:25 GMT
Last Modified : 2018-06-26 19:30 GMT
Ended         : 2018-06-26 19:30 GMT
CoordAction ID: -

Actions
------------------------------------------------------------------------------------------------
ID                      Status  Ext ID          ExtStatus   ErrCode
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@:start:    OK  -           OK      -
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@mr-test    OK  job_1529975666160_0051  SUCCEEDED   -
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@myHive2    OK  job_1529975666160_0053  SUCCEEDED   -
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@myHive OK  job_1529975666160_0055  SUCCEEDED   -
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@end    OK  -           OK      -
-----------------------------------------------------------------------------------------------

Hive server 2 操作的 Ranger 审核日志显示 Oozie 为用户运行操作。The Ranger audit logs for Hive server 2 actions show Oozie running the action for the user. Ranger 和 YARN 视图仅对群集管理员可见。The Ranger and YARN views are visible only to the cluster admin.

在 Oozie 中配置用户授权Configure user authorization in Oozie

Oozie 本身具有用户授权配置,可以阻止用户停止或删除其他用户的作业。Oozie by itself has a user authorization configuration that can block users from stopping or deleting other users' jobs. 若要启用此配置,请将 oozie.service.AuthorizationService.security.enabled 设置为 trueTo enable this configuration, set the oozie.service.AuthorizationService.security.enabled to true.

有关详细信息,请参阅 Apache Oozie Installation and Configuration(Apache Oozie 安装和配置)。For more information, see Apache Oozie Installation and Configuration.

对于不提供或不支持 Ranger 插件的组件(如 Hive server 1),只能进行粗粒度的 HDFS 授权。For components like Hive server 1 where the Ranger plug-in isn't available or supported, only coarse-grained HDFS authorization is possible. 细粒度授权只能通过 Ranger 插件进行。Fine-grained authorization is available only through Ranger plug-ins.

获取 Oozie Web UIGet the Oozie web UI

Oozie Web UI 提供基于 Web 的视图来显示群集上 Oozie 作业的状态。The Oozie web UI provides a web-based view into the status of Oozie jobs on the cluster. 若要获取 Web UI,请在 ESP 群集中执行以下步骤:To get the web UI, take the following steps in ESP clusters:

  1. 添加边缘节点并启用 SSH Kerberos 身份验证Add an edge node and enable SSH Kerberos authentication.

  2. 按照 Oozie Web UI 步骤启用 SSH 隧道以连接到边缘节点并访问 Web UI。Follow the Oozie web UI steps to enable SSH tunneling to the edge node and access the web UI.

后续步骤Next steps