Thursday, May 31, 2012

Quick Hadoop 2-node Cluster Setup for Oracle DBAs


OS Version: Red Hat Enterprise Linux Server release 5.6 
Hadoop Version: 1.0.1
Hosts: host1, host2


Step 1: Install Java on the Hosts

yum install java-1.6.0-openjdk-devel.x86_64

Check for Java version
java -version

Step 2: Create OS Group and User on hosts

Add Group:
/usr/sbin/groupadd hdfs

Add User:
/usr/sbin/useradd -G hdfs -d /home/hadoop hadoop

Change user password:
passwd hadoop

Step 3: Setup password less ssh between the hadoop users on both the hosts

login to hadoop user on both the hosts:
hadoop@host1:
ssh-keygen -t rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Copy file $HOME/.ssh/authorized_keys to host2 /tmp/
hadoop@host2:
ssh-keygen -t rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
cat /tmp/authorized_keys >> $HOME/.ssh/authorized_keys
Now copy the $HOME/.ssh/authorized_keys to host1 /tmp
chmod 700 $HOME/.ssh
cd $HOME/.ssh
chmod 600 *
hadoop@host1:
cp /tmp/authorized_keys $HOME/.ssh/authorized_keys
chmod 700 $HOME/.ssh
chmod 700 $HOME/.ssh
cd $HOME/.ssh
chmod 600 *
ssh host1
ssh host2
hadoop@host2:
ssh host1
ssh host2

This is setup password ssh between hadoop user on both the hosts

Step 4: Download and install hadoop

Download location: http://www.apache.org/dyn/closer.cgi/hadoop/common/
Download the release of your choice.
Download hadoop-1.0.1.tar.gz, ungzip and untar

mv hadoop-1.0.1 hadoop

Hadoop location: /home/hadoop/hadoop

Step 5: Configure Hadoop


5.a: Update JAVA_HOME in /home/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64

5.b: Add following entry to /home/hadoop/conf/core-site.xml



<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>



5.c: Add following to /home/hadoop/conf/mapred-site.xml


<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>



5.d: Add following to /home/hadoop/conf/hdfs-site.xml



<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>


5.e: Update /home/hadoop/conf/masters with the name of the master node

[hadoop@host1 conf]$ more masters
host1

5.f: Update /home/hadoop/conf/slaves with the name of the slave nodes

[hadoop@host1 conf]$ more slaves
host1
host2

5.g: Format name node


[hadoop@host1 ~]$ hadoop namenode -format
12/05/09 09:42:03 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = host1/66.228.160.91
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1243785; compiled by 'hortonfo' on Tue Feb 14 08:15:38 UTC 2012
************************************************************/
12/05/09 09:42:03 INFO util.GSet: VM type       = 64-bit
12/05/09 09:42:03 INFO util.GSet: 2% max memory = 17.77875 MB
12/05/09 09:42:03 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/05/09 09:42:03 INFO util.GSet: recommended=2097152, actual=2097152
12/05/09 09:42:03 INFO namenode.FSNamesystem: fsOwner=hadoop
12/05/09 09:42:03 INFO namenode.FSNamesystem: supergroup=supergroup
12/05/09 09:42:03 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/05/09 09:42:03 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/05/09 09:42:03 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/05/09 09:42:03 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/05/09 09:42:03 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/05/09 09:42:03 INFO common.Storage: Storage directory /home/hadoop/tmp/dfs/name has been successfully formatted.
12/05/09 09:42:03 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at host1/66.228.160.91
************************************************************/


Step 6: Startup hadoop


[hadoop@host1 bin]$ ./start-all.sh 
starting namenode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-namenode-host1.out
host2: starting datanode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-host2.out
host1: starting datanode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-host1.out
host1: starting secondarynamenode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-host1.out
host1: Exception in thread "main" java.io.IOException: Cannot lock storage /home/hadoop/temp/dfs/namesecondary. The directory is already locked.
host1: at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
host1: at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
host1: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.recoverCreate(SecondaryNameNode.java:615)
host1: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:175)
host1: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:129)
host1: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:567)
starting jobtracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-host1.out
host2: starting tasktracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-host2.out
host1: starting tasktracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-host1.out


Checks:
[hadoop@host1 bin]$ jps
20358 JobTracker
20041 DataNode
20528 TaskTracker
19869 NameNode
27905 Jps
26450 oc4j.jar
20229 SecondaryNameNode

[hadoop@host2 ~]$ jps
12522 Jps
1737 TaskTracker
1572 DataNode

jps - Java Virtual Machine Process Status Tool, show the java VMs running.

Step 7: Running a basic Map Reduce Test


I am using the sample test from: http://wiki.apache.org/hadoop/WordCount

7.a: Create directories in Hadoop:

[hadoop@host1 ~]$ hadoop dfs -mkdir input
[hadoop@host1 ~]$ hadoop dfs -mkdir output

Hadoop shell reference: http://hadoop.apache.org/common/docs/r0.17.1/hdfs_shell.html

7.b: Copy the script from the web to /home/hadoop/WordCount.java and compile

create directory wordcount
mkdir -p /home/hadoop/wordcount

[hadoop@host1 ~]$ javac -classpath /usr/share/hadoop/hadoop-core-1.0.1.jar -d wordcount WordCount.java
[hadoop@host1 ~]$ jar -cvf /home/hadoop/wordcount.jar -C wordcount/ .
[hadoop@host1 ~]$ jar -cvf /home/hadoop/wordcount.jar -C wordcount/ .
added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)
adding: org/myorg/WordCount.class(in = 1546) (out= 749)(deflated 51%)
adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)

This will create the jar file to be used in the test.

7.c: Create input files:

file1:
Hello World Bye World
file2:
Hello Hadoop Goodbye Hadoop

7.d: Copy the files over to Hadoop:

[hadoop@host1 ~]$ hadoop dfs -copyFromLocal file1 input
[hadoop@host1 ~]$ hadoop dfs -copyFromLocal file2 input
[hadoop@host1 ~]$ hadoop dfs -ls input
Found 2 items
-rw-r--r--   1 hadoop supergroup         23 2012-05-09 09:44 /user/hadoop/input/file1
-rw-r--r--   1 hadoop supergroup         28 2012-05-09 09:44 /user/hadoop/input/file2

7.e: Run the Map Reduce Java program:


[hadoop@host1 ~]$ hadoop jar /home/hadoop/wordcount.jar org.myorg.WordCount /user/hadoop/input /user/hadoop/output
12/05/30 16:03:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/05/30 16:03:44 INFO mapred.FileInputFormat: Total input paths to process : 2
12/05/30 16:03:44 INFO mapred.JobClient: Running job: job_201205091119_0004
12/05/30 16:03:45 INFO mapred.JobClient:  map 0% reduce 0%
12/05/30 16:03:58 INFO mapred.JobClient:  map 100% reduce 0%
12/05/30 16:04:10 INFO mapred.JobClient:  map 100% reduce 100%
12/05/30 16:04:15 INFO mapred.JobClient: Job complete: job_201205091119_0004
12/05/30 16:04:15 INFO mapred.JobClient: Counters: 30
12/05/30 16:04:15 INFO mapred.JobClient:   Job Counters 
12/05/30 16:04:16 INFO mapred.JobClient:     Launched reduce tasks=1
12/05/30 16:04:16 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=19996
12/05/30 16:04:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/05/30 16:04:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/05/30 16:04:16 INFO mapred.JobClient:     Launched map tasks=3
12/05/30 16:04:16 INFO mapred.JobClient:     Data-local map tasks=3
12/05/30 16:04:16 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10031
12/05/30 16:04:16 INFO mapred.JobClient:   File Input Format Counters 
12/05/30 16:04:16 INFO mapred.JobClient:     Bytes Read=55
12/05/30 16:04:16 INFO mapred.JobClient:   File Output Format Counters 
12/05/30 16:04:16 INFO mapred.JobClient:     Bytes Written=41
12/05/30 16:04:16 INFO mapred.JobClient:   FileSystemCounters
12/05/30 16:04:16 INFO mapred.JobClient:     FILE_BYTES_READ=79
12/05/30 16:04:16 INFO mapred.JobClient:     HDFS_BYTES_READ=412
12/05/30 16:04:16 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=86679
12/05/30 16:04:16 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=41
12/05/30 16:04:16 INFO mapred.JobClient:   Map-Reduce Framework
12/05/30 16:04:16 INFO mapred.JobClient:     Map output materialized bytes=91
12/05/30 16:04:16 INFO mapred.JobClient:     Map input records=2
12/05/30 16:04:16 INFO mapred.JobClient:     Reduce shuffle bytes=51
12/05/30 16:04:16 INFO mapred.JobClient:     Spilled Records=12
12/05/30 16:04:16 INFO mapred.JobClient:     Map output bytes=82
12/05/30 16:04:16 INFO mapred.JobClient:     Total committed heap usage (bytes)=443023360
12/05/30 16:04:16 INFO mapred.JobClient:     CPU time spent (ms)=1440
12/05/30 16:04:16 INFO mapred.JobClient:     Map input bytes=51
12/05/30 16:04:16 INFO mapred.JobClient:     SPLIT_RAW_BYTES=357
12/05/30 16:04:16 INFO mapred.JobClient:     Combine input records=8
12/05/30 16:04:16 INFO mapred.JobClient:     Reduce input records=6
12/05/30 16:04:16 INFO mapred.JobClient:     Reduce input groups=5
12/05/30 16:04:16 INFO mapred.JobClient:     Combine output records=6
12/05/30 16:04:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=576110592
12/05/30 16:04:16 INFO mapred.JobClient:     Reduce output records=5
12/05/30 16:04:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2767060992
12/05/30 16:04:16 INFO mapred.JobClient:     Map output records=8


7.f: Check for the output in hadoop:

[hadoop@host1 ~]$ hadoop dfs -ls output
Found 3 items
-rw-r--r--   2 hadoop supergroup          0 2012-05-09 11:22 /user/hadoop/output/_SUCCESS
drwxr-xr-x   - hadoop supergroup          0 2012-05-09 11:22 /user/hadoop/output/_logs
-rw-r--r--   2 hadoop supergroup         41 2012-05-09 11:22 /user/hadoop/output/part-00000

7.g: Check the output:

[hadoop@host1 ~]$ hadoop dfs -cat /user/hadoop/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

This completes the quick setup of a 2 node hadoop cluster.

While trying to setup and learn I referred to the following links. My thanks and regards for all the help:

For single node setup: http://hadoop.apache.org/common/docs/current/single_node_setup.html#Local
For cluster setup: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
http://hortonworks.com/blog/set-up-apache-hadoop-in-minutes-with-rpms/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

2 Comments:

Radu Laurentiu Ialovoi said...

Very good material. I was able to setup starting from here.

One comment:

In the configuration files (core-site.xml and mapred-site.xml) the addresses should not be localhost.

Here:
hdfs://localhost:54310
and here:
localhost:54311

replace localhost with your host1 ip address. In both hosts.

Regards,
Yalos

Radu Laurentiu Ialovoi said...

Very good material. I was able to setup starting from here.

One comment:

In the configuration files (core-site.xml and mapred-site.xml) the addresses should not be localhost.

Here:
hdfs://localhost:54310
and here:
localhost:54311

replace localhost with your host1 ip address. In both hosts.

Regards,
Yalos