OS Version: Red Hat Enterprise Linux Server release 5.6
Hadoop Version: 1.0.1
Hosts: host1, host2
Step 1: Install Java on the Hosts
yum install java-1.6.0-openjdk-devel.x86_64Check for Java version
java -version
Step 2: Create OS Group and User on hosts
Add Group:/usr/sbin/groupadd hdfs
Add User:
/usr/sbin/useradd -G hdfs -d /home/hadoop hadoop
Change user password:
passwd hadoop
Step 3: Setup password less ssh between the hadoop users on both the hosts
login to hadoop user on both the hosts:hadoop@host1:
ssh-keygen -t rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Copy file $HOME/.ssh/authorized_keys to host2 /tmp/
hadoop@host2:
ssh-keygen -t rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
cat /tmp/authorized_keys >> $HOME/.ssh/authorized_keys
Now copy the $HOME/.ssh/authorized_keys to host1 /tmp
chmod 700 $HOME/.ssh
cd $HOME/.ssh
chmod 600 *
hadoop@host1:
cp /tmp/authorized_keys $HOME/.ssh/authorized_keys
chmod 700 $HOME/.ssh
chmod 700 $HOME/.ssh
cd $HOME/.ssh
chmod 600 *
ssh host1
ssh host2
hadoop@host2:
ssh host1
ssh host2
This is setup password ssh between hadoop user on both the hosts
Step 4: Download and install hadoop
Download location: http://www.apache.org/dyn/closer.cgi/hadoop/common/Download the release of your choice.
Download hadoop-1.0.1.tar.gz, ungzip and untar
mv hadoop-1.0.1 hadoop
Hadoop location: /home/hadoop/hadoop
Step 5: Configure Hadoop
5.a: Update JAVA_HOME in /home/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64
5.b: Add following entry to /home/hadoop/conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
5.c: Add following to /home/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
5.d: Add following to /home/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
5.e: Update /home/hadoop/conf/masters with the name of the master node
[hadoop@host1 conf]$ more masters
host1
5.f: Update /home/hadoop/conf/slaves with the name of the slave nodes
[hadoop@host1 conf]$ more slaves
host1
host2
5.g: Format name node
[hadoop@host1 ~]$ hadoop namenode -format
12/05/09 09:42:03 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = host1/66.228.160.91
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1243785; compiled by 'hortonfo' on Tue Feb 14 08:15:38 UTC 2012
************************************************************/
12/05/09 09:42:03 INFO util.GSet: VM type = 64-bit
12/05/09 09:42:03 INFO util.GSet: 2% max memory = 17.77875 MB
12/05/09 09:42:03 INFO util.GSet: capacity = 2^21 = 2097152 entries
12/05/09 09:42:03 INFO util.GSet: recommended=2097152, actual=2097152
12/05/09 09:42:03 INFO namenode.FSNamesystem: fsOwner=hadoop
12/05/09 09:42:03 INFO namenode.FSNamesystem: supergroup=supergroup
12/05/09 09:42:03 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/05/09 09:42:03 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/05/09 09:42:03 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/05/09 09:42:03 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/05/09 09:42:03 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/05/09 09:42:03 INFO common.Storage: Storage directory /home/hadoop/tmp/dfs/name has been successfully formatted.
12/05/09 09:42:03 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at host1/66.228.160.91
************************************************************/
Step 6: Startup hadoop
[hadoop@host1 bin]$ ./start-all.sh
starting namenode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-namenode-host1.out
host2: starting datanode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-host2.out
host1: starting datanode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-host1.out
host1: starting secondarynamenode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-host1.out
host1: Exception in thread "main" java.io.IOException: Cannot lock storage /home/hadoop/temp/dfs/namesecondary. The directory is already locked.
host1: at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
host1: at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
host1: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.recoverCreate(SecondaryNameNode.java:615)
host1: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:175)
host1: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:129)
host1: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:567)
starting jobtracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-host1.out
host2: starting tasktracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-host2.out
host1: starting tasktracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-host1.out
Checks:
[hadoop@host1 bin]$ jps
20358 JobTracker
20041 DataNode
20528 TaskTracker
19869 NameNode
27905 Jps
26450 oc4j.jar
20229 SecondaryNameNode
[hadoop@host2 ~]$ jps
12522 Jps
1737 TaskTracker
1572 DataNode
jps - Java Virtual Machine Process Status Tool, show the java VMs running.
Step 7: Running a basic Map Reduce Test
I am using the sample test from: http://wiki.apache.org/hadoop/WordCount
7.a: Create directories in Hadoop:
[hadoop@host1 ~]$ hadoop dfs -mkdir input
[hadoop@host1 ~]$ hadoop dfs -mkdir output
Hadoop shell reference: http://hadoop.apache.org/common/docs/r0.17.1/hdfs_shell.html
7.b: Copy the script from the web to /home/hadoop/WordCount.java and compile
create directory wordcount
mkdir -p /home/hadoop/wordcount
[hadoop@host1 ~]$ javac -classpath /usr/share/hadoop/hadoop-core-1.0.1.jar -d wordcount WordCount.java
[hadoop@host1 ~]$ jar -cvf /home/hadoop/wordcount.jar -C wordcount/ .
[hadoop@host1 ~]$ jar -cvf /home/hadoop/wordcount.jar -C wordcount/ .
added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)
adding: org/myorg/WordCount.class(in = 1546) (out= 749)(deflated 51%)
adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)
This will create the jar file to be used in the test.
7.c: Create input files:
file1:
Hello World Bye World
file2:
Hello Hadoop Goodbye Hadoop
7.d: Copy the files over to Hadoop:
[hadoop@host1 ~]$ hadoop dfs -copyFromLocal file1 input
[hadoop@host1 ~]$ hadoop dfs -copyFromLocal file2 input
[hadoop@host1 ~]$ hadoop dfs -ls input
Found 2 items
-rw-r--r-- 1 hadoop supergroup 23 2012-05-09 09:44 /user/hadoop/input/file1
-rw-r--r-- 1 hadoop supergroup 28 2012-05-09 09:44 /user/hadoop/input/file2
7.e: Run the Map Reduce Java program:
[hadoop@host1 ~]$ hadoop jar /home/hadoop/wordcount.jar org.myorg.WordCount /user/hadoop/input /user/hadoop/output
12/05/30 16:03:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/05/30 16:03:44 INFO mapred.FileInputFormat: Total input paths to process : 2
12/05/30 16:03:44 INFO mapred.JobClient: Running job: job_201205091119_0004
12/05/30 16:03:45 INFO mapred.JobClient: map 0% reduce 0%
12/05/30 16:03:58 INFO mapred.JobClient: map 100% reduce 0%
12/05/30 16:04:10 INFO mapred.JobClient: map 100% reduce 100%
12/05/30 16:04:15 INFO mapred.JobClient: Job complete: job_201205091119_0004
12/05/30 16:04:15 INFO mapred.JobClient: Counters: 30
12/05/30 16:04:15 INFO mapred.JobClient: Job Counters
12/05/30 16:04:16 INFO mapred.JobClient: Launched reduce tasks=1
12/05/30 16:04:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=19996
12/05/30 16:04:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/05/30 16:04:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/05/30 16:04:16 INFO mapred.JobClient: Launched map tasks=3
12/05/30 16:04:16 INFO mapred.JobClient: Data-local map tasks=3
12/05/30 16:04:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10031
12/05/30 16:04:16 INFO mapred.JobClient: File Input Format Counters
12/05/30 16:04:16 INFO mapred.JobClient: Bytes Read=55
12/05/30 16:04:16 INFO mapred.JobClient: File Output Format Counters
12/05/30 16:04:16 INFO mapred.JobClient: Bytes Written=41
12/05/30 16:04:16 INFO mapred.JobClient: FileSystemCounters
12/05/30 16:04:16 INFO mapred.JobClient: FILE_BYTES_READ=79
12/05/30 16:04:16 INFO mapred.JobClient: HDFS_BYTES_READ=412
12/05/30 16:04:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=86679
12/05/30 16:04:16 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41
12/05/30 16:04:16 INFO mapred.JobClient: Map-Reduce Framework
12/05/30 16:04:16 INFO mapred.JobClient: Map output materialized bytes=91
12/05/30 16:04:16 INFO mapred.JobClient: Map input records=2
12/05/30 16:04:16 INFO mapred.JobClient: Reduce shuffle bytes=51
12/05/30 16:04:16 INFO mapred.JobClient: Spilled Records=12
12/05/30 16:04:16 INFO mapred.JobClient: Map output bytes=82
12/05/30 16:04:16 INFO mapred.JobClient: Total committed heap usage (bytes)=443023360
12/05/30 16:04:16 INFO mapred.JobClient: CPU time spent (ms)=1440
12/05/30 16:04:16 INFO mapred.JobClient: Map input bytes=51
12/05/30 16:04:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=357
12/05/30 16:04:16 INFO mapred.JobClient: Combine input records=8
12/05/30 16:04:16 INFO mapred.JobClient: Reduce input records=6
12/05/30 16:04:16 INFO mapred.JobClient: Reduce input groups=5
12/05/30 16:04:16 INFO mapred.JobClient: Combine output records=6
12/05/30 16:04:16 INFO mapred.JobClient: Physical memory (bytes) snapshot=576110592
12/05/30 16:04:16 INFO mapred.JobClient: Reduce output records=5
12/05/30 16:04:16 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2767060992
12/05/30 16:04:16 INFO mapred.JobClient: Map output records=8
7.f: Check for the output in hadoop:
[hadoop@host1 ~]$ hadoop dfs -ls output
Found 3 items
-rw-r--r-- 2 hadoop supergroup 0 2012-05-09 11:22 /user/hadoop/output/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2012-05-09 11:22 /user/hadoop/output/_logs
-rw-r--r-- 2 hadoop supergroup 41 2012-05-09 11:22 /user/hadoop/output/part-00000
7.g: Check the output:
[hadoop@host1 ~]$ hadoop dfs -cat /user/hadoop/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
This completes the quick setup of a 2 node hadoop cluster.
While trying to setup and learn I referred to the following links. My thanks and regards for all the help:
For single node setup: http://hadoop.apache.org/common/docs/current/single_node_setup.html#Local
For cluster setup: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
http://hortonworks.com/blog/set-up-apache-hadoop-in-minutes-with-rpms/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/