Installing and Running Apache Oozie (3.2.x and possibly later)

Having heard about Apache Oozie for a while now, I decided to give it a try this weekend.  Oozie, an Apache incubator project, addresses the problem of Hadoop workflow management and scheduling.

The current stable version of Oozie, 3.2.0, is available for download from one of the many Apache mirrors as a tarball. In terms of installation instructions, unfortunately, there are no less than 10 different pages on Google with some sort of instructions. 

The one you will most likely land on is the Apache Incubator Quick Start page. Unfortunately, there are a ton of gaps in this Quick Start guide — frankly, it is not up to snuff. This made it very hard to follow and consumed much of my Sunday. Here’s what worked for me:


My Environment

  • Single-node Hadoop 0.20.205.0 cluster
  • Java 1.6
  • Maven 3.0.3 

Oozie only works with specific versions of Hadoop, 0.20.20x being among them. Another point to note is that all software is installed and run as user sianand (me). This is after all my dev box. Instructions below work on flavors of Unix.


Installation Steps

Step 1 : Download and unpack an oozie distro (e.g. oozie-3.2.0-incubating.tar.gz)

I unpacked it under /Users/sianand/Apps. For the purpose of this tutorial, I will refer to /Users/sianand/Apps/oozie-3.2.0-incubating as $OOZIE_HOME.


Step 2 : Download ExtJS (e.g. extjs-2.2.zip). Do not unzip it right away.

At this point, the instructions in the Quick Start Guide ( a.k.a. QSG ) jump to calling bin/setup.sh, which will not work. Also, the path to setup.sh is incorrectly specified in the QSG.


Step 3 : Build an Oozie Distro

Under $OOZIE_HOME, you will find a Maven pom.xml file. Calling “mvn clean package” did not really lead me anywhere useful and “mvn tomcat:deploy” failed for some of the sub-modules.

Instead, I found this page helpful.

Run the following:

> cd $OOZIE_HOME

>./bin/mkdistro.sh -DskipTests

If successful, you will see the following print out:

Oozie distro created, DATE[2012.07.02-01:38:05GMT] VC-REV[unavailable], available at [/Users/sianand/Apps/oozie-3.2.0-incubating/distro/target]

Step 4 : Building with ExtJS library for Oozie Web Console

  • Copy the extjs-2.2.zip over to $OOZIE_HOME/webapp/src/main/webapp/
  • Unzip it


Step 5 : Run oozie-setup.sh to create an oozie.war file

> cd $OOZIE_HOME 

> ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/bin/oozie-setup.sh -hadoop 0.20.200 /Users/sianand/Apps/hadoop-0.20.205.0 -extjs ~/Apps/ext-2.2.zip
If successful, you will see the following print out:
New Oozie WAR file with added ‘Hadoop JARs, ExtJS library’ at /Users/sianand/Apps/oozie-3.2.0-incubating/distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/oozie-server/webapps/oozie.war

INFO: Oozie is ready to be started

Step 6 : Copy the newly minted oozie.war file to your Tomcat deployment directory

> cd $OOZIE_HOME
> cp ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/oozie-server/webapps/oozie.war  ./webapp/src/main/webapp/oozie.war 

Step 7 : Make a configuration change before starting the Oozie Tomcat App

If the derby database schemas have not been set up when you try to start the Tomcat App, Oozie Tomcat App startup will fail! You can get around this the following way:

 > vi ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/conf/oozie-site.xml

Change the property below to true as shown. It is originally false, which causes startup to fail.

<property>

        <name>oozie.service.JPAService.create.db.schema</name>
        <value>true</value>
        <description>
            Creates Oozie DB.
            If set to true, it creates the DB schema if it does not exist. If the DB schema exists is a NOP.
            If set to false, it does not create the DB schema. If the DB schema does not exist it fails start up.
        </description>
    </property>


Step 8 : Start the Oozie Server

> cd $OOZIE_HOME
> ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/bin/oozie-start.sh
If successful, the Oozie web console will be running on : http://localhost:11000/oozie

Running the Examples


Running the examples were not without challenges.
Step 1 : Add Oozie bin to the path
This step adds the directory containing the “oozie” executable to the system path.  It will help you with the rest of the steps.
> export OOZIE_EXE_BIN=
/Users/sianand/Apps/oozie-3.2.0-incubating/client/target/oozie-client-3.2.0-incubating-client/oozie-client-3.2.0-incubating/bin
> export PATH=$OOZIE_EXE_BIN:$PATH

Step 2 : Configure Hadoop’s Proxy User Settings
You need to set up Hadoop’s proxy user settings. This is a stumbling point for many people (myself included) as is apparent by frequent posts on news groups and blogs. I found these two helpful. 

For my case, since I am running both Hadoop and Oozie as “sianand” on my single machine and under the group “staff”, I did the following:

> vi $HADOOP_HOME/conf/core-site.xml

I added the following configuration and restarted Hadoop.

<property>

     <name>hadoop.proxyuser.sianand.hosts</name>                                               

     <value>localhost</value>

</property>

<property>

     <name>hadoop.proxyuser.sianand.groups</name>

     <value>staff</value>

</property>


Step 4 : Set the NN and JT ports in your Oozie Example job.properties file

In the Oozie example job.properties file, the JobTracker and NameNode ports are set to values that you are likely not using. Refer to your $HADOOP_HOME/conf/mapred-site.xml and $HADOOP_HOME/conf/core-site.xml for your Job Tracker port and Name Node port, respectively, for the right values and set them in the Oozie job.properties file.

For my case, my Name Node is running on 9000 and my Job Tracker is running on 9001, the defaults for my Hadoop distribution.

Step 4 : Follow the Steps 

At this point, you should be good to go to run the steps outlined in the link below.

http://incubator.apache.org/oozie/docs/3.1.3/docs/DG_Examples.html#Command_Line_Examples

It helped me to tail the oozie.log while debugging my problems:
> cd $OOZIE_HOME
> tail -f ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/logs/oozie.log
Step N : Stopping Oozie
> cd $OOZIE_HOME
> ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/bin/oozie-stop.sh

Closing Thoughts

At LinkedIn, we use Azkaban, another tool to schedule and manage Hadoop Workflows. Azkaban is much easier to set up and use, while offering less powerful semantics around scheduling and dependencies. I can’t help but feel there is a lot of room for improvement in the middle. 

-s


  1. plus8designs reblogged this from rooksfury
  2. rooksfury posted this
blog comments powered by Disqus
About Me
A blog describing my work in building websites that hundreds of millions of people visit. I'm Chief Architect at ClipMine, an innovative video mining and search company. I previously held technical and leadership roles at LinkedIn, Netflix, Etsy, eBay & Siebel Systems. In addition to the nerdy stuff, I've included some stunning photography for your pure enjoyment!
Tumblelogs I follow: