Installing and Running Apache Oozie (3.2.x and possibly later)
Having heard about Apache Oozie for a while now, I decided to give it a try this weekend. Oozie, an Apache incubator project, addresses the problem of Hadoop workflow management and scheduling.
The current stable version of Oozie, 3.2.0, is available for download from one of the many Apache mirrors as a tarball. In terms of installation instructions, unfortunately, there are no less than 10 different pages on Google with some sort of instructions.
The one you will most likely land on is the Apache Incubator Quick Start page. Unfortunately, there are a ton of gaps in this Quick Start guide — frankly, it is not up to snuff. This made it very hard to follow and consumed much of my Sunday. Here’s what worked for me:
- Single-node Hadoop 0.20.205.0 cluster
- Java 1.6
- Maven 3.0.3
Oozie only works with specific versions of Hadoop, 0.20.20x being among them. Another point to note is that all software is installed and run as user sianand (me). This is after all my dev box. Instructions below work on flavors of Unix.
Step 1 : Download and unpack an oozie distro (e.g. oozie-3.2.0-incubating.tar.gz)
I unpacked it under /Users/sianand/Apps. For the purpose of this tutorial, I will refer to /Users/sianand/Apps/oozie-3.2.0-incubating as $OOZIE_HOME.
Step 2 : Download ExtJS (e.g. extjs-2.2.zip). Do not unzip it right away.
At this point, the instructions in the Quick Start Guide ( a.k.a. QSG ) jump to calling bin/setup.sh, which will not work. Also, the path to setup.sh is incorrectly specified in the QSG.
Step 3 : Build an Oozie Distro
Under $OOZIE_HOME, you will find a Maven pom.xml file. Calling “mvn clean package” did not really lead me anywhere useful and “mvn tomcat:deploy” failed for some of the sub-modules.
Instead, I found this page helpful.
Run the following:
> cd $OOZIE_HOME
If successful, you will see the following print out:
Oozie distro created, DATE[2012.07.02-01:38:05GMT] VC-REV[unavailable], available at [/Users/sianand/Apps/oozie-3.2.0-incubating/distro/target]
Step 4 : Building with ExtJS library for Oozie Web Console
- Copy the extjs-2.2.zip over to $OOZIE_HOME/webapp/src/main/webapp/
- Unzip it
Step 5 : Run oozie-setup.sh to create an oozie.war file
> cd $OOZIE_HOME> ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/bin/oozie-setup.sh -hadoop 0.20.200 /Users/sianand/Apps/hadoop-0.20.205.0 -extjs ~/Apps/ext-2.2.zip
New Oozie WAR file with added ‘Hadoop JARs, ExtJS library’ at /Users/sianand/Apps/oozie-3.2.0-incubating/distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/oozie-server/webapps/oozie.warINFO: Oozie is ready to be started
Step 6 : Copy the newly minted oozie.war file to your Tomcat deployment directory
> cd $OOZIE_HOME > cp ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/oozie-server/webapps/oozie.war ./webapp/src/main/webapp/oozie.war
Step 7 : Make a configuration change before starting the Oozie Tomcat App
If the derby database schemas have not been set up when you try to start the Tomcat App, Oozie Tomcat App startup will fail! You can get around this the following way:
> vi ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/conf/oozie-site.xml
Change the property below to true as shown. It is originally false, which causes startup to fail.
<property><name>oozie.service.JPAService.create.db.schema</name><value>true</value><description>Creates Oozie DB.If set to true, it creates the DB schema if it does not exist. If the DB schema exists is a NOP.If set to false, it does not create the DB schema. If the DB schema does not exist it fails start up.</description></property>
Step 8 : Start the Oozie Server
> cd $OOZIE_HOME > ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/bin/oozie-start.sh
Running the Examples
Running the examples were not without challenges.
> export OOZIE_EXE_BIN=/Users/sianand/Apps/oozie-3.2.0-incubating/client/target/oozie-client-3.2.0-incubating-client/oozie-client-3.2.0-incubating/bin
> export PATH=$OOZIE_EXE_BIN:$PATH
For my case, since I am running both Hadoop and Oozie as “sianand” on my single machine and under the group “staff”, I did the following:
> vi $HADOOP_HOME/conf/core-site.xml
I added the following configuration and restarted Hadoop.
Step 4 : Set the NN and JT ports in your Oozie Example job.properties file
In the Oozie example job.properties file, the JobTracker and NameNode ports are set to values that you are likely not using. Refer to your $HADOOP_HOME/conf/mapred-site.xml and $HADOOP_HOME/conf/core-site.xml for your Job Tracker port and Name Node port, respectively, for the right values and set them in the Oozie job.properties file.
For my case, my Name Node is running on 9000 and my Job Tracker is running on 9001, the defaults for my Hadoop distribution.
Step 4 : Follow the Steps
At this point, you should be good to go to run the steps outlined in the link below.
> cd $OOZIE_HOME> tail -f ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/logs/oozie.log
> cd $OOZIE_HOME> ./distro/target/oozie-3.2.0-incubating-distro/oozie-3.2.0-incubating/bin/oozie-stop.sh
At LinkedIn, we use Azkaban, another tool to schedule and manage Hadoop Workflows. Azkaban is much easier to set up and use, while offering less powerful semantics around scheduling and dependencies. I can’t help but feel there is a lot of room for improvement in the middle.