Run Apache Oozie in Azure HDInsight clusters with Enterprise Security Package
Apache Oozie is a workflow and coordination system that manages Apache Hadoop jobs. Oozie is integrated with the Hadoop stack, and it supports the following jobs:
- Apache MapReduce
- Apache Pig
- Apache Hive
- Apache Sqoop
You can also use Oozie to schedule jobs that are specific to a system, like Java programs or shell scripts.
Prerequisite
An Azure HDInsight Hadoop cluster with Enterprise Security Package (ESP). See Configure HDInsight clusters with ESP.
Note
For detailed instructions on how to use Oozie on non-ESP clusters, see Use Apache Oozie workflows in Linux-based Azure HDInsight.
Connect to an ESP cluster
For more information on Secure Shell (SSH), see Connect to HDInsight (Hadoop) using SSH.
Connect to the HDInsight cluster by using SSH:
ssh [DomainUserName]@<clustername>-ssh.azurehdinsight.cn
To verify successful Kerberos authentication, use the
klist
command. If not, usekinit
to start Kerberos authentication.Sign in to the HDInsight gateway to register the OAuth token required to access Azure Data Lake Storage:
curl -I -u [DomainUserName@Domain.com]:[DomainUserPassword] https://<clustername>.azurehdinsight.cn
A status response code of 200 OK indicates successful registration. Check the username and password if an unauthorized response is received, such as 401.
Define the workflow
Oozie workflow definitions are written in Apache Hadoop Process Definition Language (hPDL). hPDL is an XML process definition language. Take the following steps to define the workflow:
Set up a domain user's workspace:
hdfs dfs -mkdir /user/<DomainUser> cd /home/<DomainUserPath> cp /usr/hdp/<ClusterVersion>/oozie/doc/oozie-examples.tar.gz . tar -xvf oozie-examples.tar.gz hdfs dfs -put examples /user/<DomainUser>/
Replace
DomainUser
with the domain user name. ReplaceDomainUserPath
with the home directory path for the domain user. ReplaceClusterVersion
with your cluster data platform version.Use the following statement to create and edit a new file:
nano workflow.xml
After the nano editor opens, enter the following XML as the file contents:
<?xml version="1.0" encoding="UTF-8"?> <workflow-app xmlns="uri:oozie:workflow:0.4" name="map-reduce-wf"> <credentials> <credential name="metastore_token" type="hcat"> <property> <name>hcat.metastore.uri</name> <value>thrift://<active-headnode-name>-<clustername>.<Domain>.com:9083</value> </property> <property> <name>hcat.metastore.principal</name> <value>hive/_HOST@<Domain>.COM</value> </property> </credential> <credential name="hs2-creds" type="hive2"> <property> <name>hive2.server.principal</name> <value>${jdbcPrincipal}</value> </property> <property> <name>hive2.jdbc.url</name> <value>${jdbcURL}</value> </property> </credential> </credentials> <start to="mr-test" /> <action name="mr-test"> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/${wf:user()}/examples/output-data/mrresult" /> </prepare> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> <property> <name>mapred.mapper.class</name> <value>org.apache.oozie.example.SampleMapper</value> </property> <property> <name>mapred.reducer.class</name> <value>org.apache.oozie.example.SampleReducer</value> </property> <property> <name>mapred.map.tasks</name> <value>1</value> </property> <property> <name>mapred.input.dir</name> <value>/user/${wf:user()}/${examplesRoot}/input-data/text</value> </property> <property> <name>mapred.output.dir</name> <value>/user/${wf:user()}/${examplesRoot}/output-data/mrresult</value> </property> </configuration> </map-reduce> <ok to="myHive2" /> <error to="fail" /> </action> <action name="myHive2" cred="hs2-creds"> <hive2 xmlns="uri:oozie:hive2-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <jdbc-url>${jdbcURL}</jdbc-url> <script>${hiveScript2}</script> <param>hiveOutputDirectory2=${hiveOutputDirectory2}</param> </hive2> <ok to="myHive" /> <error to="fail" /> </action> <action name="myHive" cred="metastore_token"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <script>${hiveScript1}</script> <param>hiveOutputDirectory1=${hiveOutputDirectory1}</param> </hive> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>Oozie job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end" /> </workflow-app>
Replace
clustername
with the name of the cluster.To save the file, select Ctrl+X. Enter Y. Then select Enter.
The workflow is divided into two parts:
Credential. This section takes in the credentials that are used for authenticating Oozie actions:
This example uses authentication for Hive actions. To learn more, see Action Authentication.
The credential service allows Oozie actions to impersonate the user for accessing Hadoop services.
Action. This section has three actions: map-reduce, Hive server 2, and Hive server 1:
The map-reduce action runs an example from an Oozie package for map-reduce that outputs the aggregated word count.
The Hive server 2 and Hive server 1 actions run a query on a sample Hive table provided with HDInsight.
The Hive actions use the credentials defined in the credentials section for authentication by using the keyword
cred
in the action element.
Use the following command to copy the
workflow.xml
file to/user/<domainuser>/examples/apps/map-reduce/workflow.xml
:hdfs dfs -put workflow.xml /user/<domainuser>/examples/apps/map-reduce/workflow.xml
Replace
domainuser
with your username for the domain.
Define the properties file for the Oozie job
Use the following statement to create and edit a new file for job properties:
nano job.properties
After the nano editor opens, use the following XML as the contents of the file:
nameNode=adl://home jobTracker=headnodehost:8050 queueName=default examplesRoot=examples oozie.wf.application.path=${nameNode}/user/[domainuser]/examples/apps/map-reduce/workflow.xml hiveScript1=${nameNode}/user/${user.name}/countrowshive1.hql hiveScript2=${nameNode}/user/${user.name}/countrowshive2.hql oozie.use.system.libpath=true user.name=[domainuser] jdbcPrincipal=hive/<active-headnode-name>.<Domain>.com@<Domain>.COM jdbcURL=[jdbcurlvalue] hiveOutputDirectory1=${nameNode}/user/${user.name}/hiveresult1 hiveOutputDirectory2=${nameNode}/user/${user.name}/hiveresult2
- Use the
adl://home
URI for thenameNode
property if you have Azure Data Lake Storage Gen1 as your primary cluster storage. If you're using Azure Blob Storage, then change towasb://home
. If you're using Azure Data Lake Storage Gen2, then change toabfs://home
. - Replace
domainuser
with your username for the domain. - Replace
ClusterShortName
with the short name for the cluster. For example, if the cluster name is https:// [example link] sechadoopcontoso.azurehdinsight.cn, theclustershortname
is the first six characters of the cluster: sechad. - Replace
jdbcurlvalue
with the JDBC URL from the Hive configuration. An example is jdbc:hive2://headnodehost:10001/;transportMode=http. - To save the file, select Ctrl+X, enter
Y
, and then select Enter.
This properties file needs to be present locally when running Oozie jobs.
- Use the
Create custom Hive scripts for Oozie jobs
You can create the two Hive scripts for Hive server 1 and Hive server 2 as shown in the following sections.
Hive server 1 file
Create and edit a file for Hive server 1 actions:
nano countrowshive1.hql
Create the script:
INSERT OVERWRITE DIRECTORY '${hiveOutputDirectory1}' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select devicemake from hivesampletable limit 2;
Save the file to Apache Hadoop Distributed File System (HDFS):
hdfs dfs -put countrowshive1.hql countrowshive1.hql
Hive server 2 file
Create and edit a field for Hive server 2 actions:
nano countrowshive2.hql
Create the script:
INSERT OVERWRITE DIRECTORY '${hiveOutputDirectory1}' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select devicemodel from hivesampletable limit 2;
Save the file to HDFS:
hdfs dfs -put countrowshive2.hql countrowshive2.hql
Submit Oozie jobs
Submitting Oozie jobs for ESP clusters is like submitting Oozie jobs in non-ESP clusters.
For more information, see Use Apache Oozie with Apache Hadoop to define and run a workflow on Linux-based Azure HDInsight.
Results from an Oozie job submission
Oozie jobs are run for the user. So both Apache Hadoop YARN and Apache Ranger audit logs show the jobs being run as the impersonated user. The command-line interface output of an Oozie job looks like the following code:
Job ID : 0000015-180626011240801-oozie-oozi-W
------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf
App Path : adl://home/user/alicetest/examples/apps/map-reduce/wf.xml
Status : SUCCEEDED
Run : 0
User : alicetest
Group : -
Created : 2018-06-26 19:25 GMT
Started : 2018-06-26 19:25 GMT
Last Modified : 2018-06-26 19:30 GMT
Ended : 2018-06-26 19:30 GMT
CoordAction ID: -
Actions
------------------------------------------------------------------------------------------------
ID Status Ext ID ExtStatus ErrCode
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@:start: OK - OK -
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@mr-test OK job_1529975666160_0051 SUCCEEDED -
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@myHive2 OK job_1529975666160_0053 SUCCEEDED -
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@myHive OK job_1529975666160_0055 SUCCEEDED -
------------------------------------------------------------------------------------------------
0000015-180626011240801-oozie-oozi-W@end OK - OK -
-----------------------------------------------------------------------------------------------
The Ranger audit logs for Hive server 2 actions show Oozie running the action for the user. The Ranger and YARN views are visible only to the cluster admin.
Configure user authorization in Oozie
Oozie by itself has a user authorization configuration that can block users from stopping or deleting other users' jobs. To enable this configuration, set the oozie.service.AuthorizationService.security.enabled
to true
.
For more information, see Apache Oozie Installation and Configuration.
For components like Hive server 1 where the Ranger plug-in isn't available or supported, only coarse-grained HDFS authorization is possible. Fine-grained authorization is available only through Ranger plug-ins.
Get the Oozie web UI
The Oozie web UI provides a web-based view into the status of Oozie jobs on the cluster. To get the web UI, take the following steps in ESP clusters:
Add an edge node and enable SSH Kerberos authentication.
Follow the Oozie web UI steps to enable SSH tunneling to the edge node and access the web UI.