Saturday, August 24, 2013

Installing CDH4

Download Java from oracle site

Install it.

It will typicaly install to
/usr/java/

Now use alternatives to point java to the oracle jdk based java
alternatives --install /usr/bin/java java /usr/java/jdk1.6.0_45/bin/java 1600

Update the user's ~/.bash_profile to point
export JAVA_HOME=/usr/java/jdk1.6.0_45/bin/java

Do this in all the nodes where hadoop runs

Add cloudera Repo
sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm
 
Install Jobtracker
sudo yum install hadoop-0.20-mapreduce-jobtrackercker
  
This install the following files to following dirs
 
logs -  /var/log/hadoop-0.20-mapreduce/
libs - /usr/lib/hadoop/ , /usr/lib/hadoop-0.20-mapreduce/
service daemon -  /etc/rc.d/init.d/hadoop-0.20-mapreduce-jobtracker
configs - /etc/hadoop , /etc/default/hadoop

Install Namenode
sudo yum install hadoop-hdfs-namenode
 
service daemon /etc/rc.d/init.d/hadoop-hdfs-namenode

script - /usr/lib/hadoop-hdfs/sbin/refresh-namenodes.sh

Install secondaryNamenode
Idealy should be in a machine different from the master.

In slave nodes install :
sudo yum install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

daemon service - /etc/rc.d/init.d/hadoop-0.20-mapreduce-tasktracker
library - /usr/lib/hadoop-hdfs/ /usr/lib/hadoop-0.20-mapreduce/  /usr/lib/hadoop-0.20-mapreduce/contrib/

others:
/usr/lib/hadoop-0.20-mapreduce/bin/hadoop-config.sh
/usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemons.sh
/usr/lib/hadoop-0.20-mapreduce/bin/hadoop
/usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemon.sh


After this follow the instructions in
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_4.html#../CDH4-Installation-Guide/cdh4ig_topic_11_2.html?scroll=topic_11_2_2_unique_1

line by line and do the things.
 

1-st configure HDFS in the system and configure the jobracker/tasktracker daemons

some point to note is the the name node directories and data node directories need to be set permission for hdfs:hdfs and the mapred local directory should be set permission for mapred:hadoop .



When configuring the directories  for the namenodes and datanodes there are few points to be taken care of :

  • Namenode directories store the metadata and edit logs
  • And the Datanode directories store the block-pools
 You can also mention multiple entries for the namenodes metadata directory.
the 2nd one can idealy be a high performance NFS so that if the local disk fails then it can depend on the NFS drive.


You can configure mutliple volumes for the datanode like
 dfs.datanode.data.dir
 /data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/dn

So if  /data/1/dfs/dn fails then it trys the next volume.
You can also mention the toleration for the failed nodesby setting the
parameter in the hdfs-site.xml
dfs.datanode.failed.volumes.tolerated .
Which means that hadoop will only return an error if the above mentioned number of volumes fail. Hence in abv example we can mention 3. So that it return an error if and only if it fails to write in all of -
/data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/
 
If multiple volumes are not configured thenthis volume tolerated whoulc be idealy 0 and shouldnt be set.


scp the configuration to various host nodes.
sudo scp -r /etc/hadoop/conf.test_cluster/* testuser@test.com:/etc/hadoop/conf.test_cluster/

  Starting all service in cloudera

First time you suse the hadoop format the hdfs . The command is
 
sudo -u hdfs hadoop namenode -format
We format the user as hdfs user which is important.


for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

Some FAQ and common issues

Whenever you are getting below error, trying to start a DN on a slave machine:
>  java.io.IOException: Incompatible clusterIDs in /home/hadoop/dfs/data: namenode clusterID
= ****; datanode clusterID = ****
>
> It is because after you set up your cluster, you, for whatever reason, decided to reformat
your NN. Your DNs on slaves still bear reference to the old NN. To resolve this simply delete
and recreate data folder on that machine in local Linux FS, namely /home/hadoop/dfs/data.
Restarting that DN's daemon on that machine will recreate data/ folder's content and resolve
the problem.


In CDH4 every hadoop daemon is configured as service .
And hdfs daemons start with - hadoop-hdfs
while map-reduce daemons -  start with - hadoop-0.20-mapreduce


Often the firewall can block access b/w the nodes . We may have to add rules to enable and disable various ports to overcome this.



Daemon Default Port Configuration Parameter
HDFS Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
Backup/Checkpoint node? 50105 dfs.backup.http.address
MR Jobracker 50030 mapred.job.tracker.http.address
Tasktrackers 50060 mapred.task.tracker.http.address

Daemon Default Port Configuration Parameter Protocol Used for
Namenode 8020 fs.default.name? IPC: ClientProtocol Filesystem metadata operations.
Datanode 50010 dfs.datanode.address Custom Hadoop Xceiver: DataNode and DFSClient DFS data transfer
Datanode 50020 dfs.datanode.ipc.address IPC: InterDatanodeProtocol, ClientDatanodeProtocol
ClientProtocol
Block metadata operations and recovery
Backupnode 50100 dfs.backup.address Same as namenode HDFS Metadata Operations
Jobtracker Ill-defined.? mapred.job.tracker IPC: JobSubmissionProtocol, InterTrackerProtocol Job submission, task tracker heartbeats.
Tasktracker 127.0.0.1:0¤ mapred.task.tracker.report.address IPC: TaskUmbilicalProtocol Communicating with child jobs
? This is the port part of hdfs://host:8020/.
? Default is not well-defined. Common values are 8021, 9001, or 8012. See MAPREDUCE-566.
Binds to an unused local port.


The above infor is taken from cloudera site :
http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/

We may have to write ip rules to enable these ports.

Idealy you will have to add these rules to the line just above
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited

that is

ACCEPT     all  --  hdnode1.test.com  anywhere           
ACCEPT     all  --  hdnode2.test.com  anywhere
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited 

In my case the OS is Centos and default INPUT FILTER ends with
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited

And hence added the ACCEPT just before them. This would be OS specific. The point is just that the namenode should accept communication from ports defined for hadoop services.

Now on checking the hdfs status at https://:50070 it shows that
the service is available and the declared hadooop datanode are LIVE
 

In some cases you will witness erors like

-10-01 12:43:02,084 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1574577678-***************-1377251137260 (storage id DS-1585458778-***************-50010-1379678037696) service to ***.***.***.com/***************:8020 beginning handshake with NN
2013-10-01 12:43:02,097 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1574577678-***************-1377251137260 (storage id DS-1585458778-***************-50010-1379678037696) service to ***.***.***.com/***************:8020
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException): Datanode denied communication with namenode: DatanodeRegistration(0.0.0.0, storageID=DS-1585458778-***************-50010-1379678037696, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=CID-ac14694a-bacb-4180-a747-464778d2d382;nsid=680798099;c=0)
    at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:648)

This happened for us when we where using DNS for hostname and IP resolution rather than adding individual nodes in /etc/hosts file.
However we found out that for the NN we needed to add all slave node hostname ip details into /etc/hosts
In case of slave we need to add entries for the NN and the respective slave alone.
 Also manualy congifure the dfs.hosts file and add the full path to it in hdfs-site.xml

Above all try restarting the cluster. It should work now

The error basicaly occures when NN or Slave fails in identifying or resolving the name. For us it resolved form terminal but when running the hdfs it failed.

Eventhough we were not able to pinpoint the troublemaker. The above steps rectified it..

No comments: