1st: August 2013

Download Java from oracle site

Install it.

It will typicaly install to
/usr/java/

Now use alternatives to point java to the oracle jdk based java
alternatives --install /usr/bin/java java /usr/java/jdk1.6.0_45/bin/java 1600

Update the user's ~/.bash_profile to point
export JAVA_HOME=/usr/java/jdk1.6.0_45/bin/java

Do this in all the nodes where hadoop runs

Add cloudera Repo

sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

Install Jobtracker

sudo yum install hadoop-0.20-mapreduce-jobtrackercker

This install the following files to following dirs

logs -  /var/log/hadoop-0.20-mapreduce/

libs - /usr/lib/hadoop/ , /usr/lib/hadoop-0.20-mapreduce/

service daemon -  /etc/rc.d/init.d/hadoop-0.20-mapreduce-jobtracker

configs - /etc/hadoop , /etc/default/hadoop

Install Namenode

sudo yum install hadoop-hdfs-namenode

service daemon /etc/rc.d/init.d/hadoop-hdfs-namenode

script - /usr/lib/hadoop-hdfs/sbin/refresh-namenodes.sh

Install secondaryNamenode
Idealy should be in a machine different from the master.

In slave nodes install :
sudo yum install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

daemon service - /etc/rc.d/init.d/hadoop-0.20-mapreduce-tasktracker
library - /usr/lib/hadoop-hdfs/ /usr/lib/hadoop-0.20-mapreduce/ /usr/lib/hadoop-0.20-mapreduce/contrib/

others:
/usr/lib/hadoop-0.20-mapreduce/bin/hadoop-config.sh
/usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemons.sh
/usr/lib/hadoop-0.20-mapreduce/bin/hadoop
/usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemon.sh

After this follow the instructions in
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_4.html#../CDH4-Installation-Guide/cdh4ig_topic_11_2.html?scroll=topic_11_2_2_unique_1

line by line and do the things.

1-st configure HDFS in the system and configure the jobracker/tasktracker daemons

some point to note is the the name node directories and data node directories need to be set permission for hdfs:hdfs and the mapred local directory should be set permission for mapred:hadoop .

When configuring the directories for the namenodes and datanodes there are few points to be taken care of :

Namenode directories store the metadata and edit logs
And the Datanode directories store the block-pools

You can also mention multiple entries for the namenodes metadata directory.
the 2nd one can idealy be a high performance NFS so that if the local disk fails then it can depend on the NFS drive.

You can configure mutliple volumes for the datanode like

 dfs.datanode.data.dir
 /data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/dn

So if /data/1/dfs/dn fails then it trys the next volume.
You can also mention the toleration for the failed nodesby setting the
parameter in the hdfs-site.xml
dfs.datanode.failed.volumes.tolerated .
Which means that hadoop will only return an error if the above mentioned number of volumes fail. Hence in abv example we can mention 3. So that it return an error if and only if it fails to write in all of -

/data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/

If multiple volumes are not configured thenthis volume tolerated whoulc be idealy 0 and shouldnt be set.

scp the configuration to various host nodes.
sudo scp -r /etc/hadoop/conf.test_cluster/* testuser@test.com:/etc/hadoop/conf.test_cluster/

Starting all service in cloudera

First time you suse the hadoop format the hdfs . The command is

sudo -u hdfs hadoop namenode -format

We format the user as hdfs user which is important.

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

Some FAQ and common issues

Whenever you are getting below error, trying to start a DN on a slave machine:
>  java.io.IOException: Incompatible clusterIDs in /home/hadoop/dfs/data: namenode clusterID
= ****; datanode clusterID = ****
>
> It is because after you set up your cluster, you, for whatever reason, decided to reformat
your NN. Your DNs on slaves still bear reference to the old NN. To resolve this simply delete
and recreate data folder on that machine in local Linux FS, namely /home/hadoop/dfs/data.
Restarting that DN's daemon on that machine will recreate data/ folder's content and resolve
the problem.

In CDH4 every hadoop daemon is configured as service .
And hdfs daemons start with - hadoop-hdfs
while map-reduce daemons - start with - hadoop-0.20-mapreduce

Often the firewall can block access b/w the nodes . We may have to add rules to enable and disable various ports to overcome this.

	Daemon	Default Port	Configuration Parameter
HDFS	Namenode	50070	`dfs.http.address`
	Datanodes	50075	`dfs.datanode.http.address`
	Secondarynamenode	50090	`dfs.secondary.http.address`
	Backup/Checkpoint node^?	50105	`dfs.backup.http.address`
MR	Jobracker	50030	`mapred.job.tracker.http.address`
MR	Tasktrackers	50060	`mapred.task.tracker.http.address`

Daemon	Default Port	Configuration Parameter	Protocol	Used for
Namenode	8020	`fs.default.name`^?	IPC: `ClientProtocol`	Filesystem metadata operations.
Datanode	50010	`dfs.datanode.address`	Custom Hadoop Xceiver: `DataNode` and `DFSClient`	DFS data transfer
Datanode	50020	`dfs.datanode.ipc.address`	IPC: `InterDatanodeProtocol`, `ClientDatanodeProtocol` `ClientProtocol`	Block metadata operations and recovery
Backupnode	50100	`dfs.backup.address`	Same as namenode	HDFS Metadata Operations
Jobtracker	Ill-defined.^?	`mapred.job.tracker`	IPC: `JobSubmissionProtocol`, `InterTrackerProtocol`	Job submission, task tracker heartbeats.
Tasktracker	127.0.0.1:0^¤	`mapred.task.tracker.report.address`	IPC: `TaskUmbilicalProtocol`	Communicating with child jobs
^? This is the port part of `hdfs://host:8020/`. ^? Default is not well-defined. Common values are 8021, 9001, or 8012. See MAPREDUCE-566. Binds to an unused local port. The above infor is taken from cloudera site : http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/ We may have to write ip rules to enable these ports. Idealy you will have to add these rules to the line just above REJECT all -- anywhere anywhere reject-with icmp-host-prohibited that is ACCEPT all -- hdnode1.test.com anywhere ACCEPT all -- hdnode2.test.com anywhere REJECT all -- anywhere anywhere reject-with icmp-host-prohibited In my case the OS is Centos and default INPUT FILTER ends with REJECT all -- anywhere anywhere reject-with icmp-host-prohibited And hence added the ACCEPT just before them. This would be OS specific. The point is just that the namenode should accept communication from ports defined for hadoop services. Now on checking the hdfs status at https://:50070 it shows that the service is available and the declared hadooop datanode are LIVE In some cases you will witness erors like -10-01 12:43:02,084 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1574577678-*************-1377251137260 (storage id DS-1585458778-***********-50010-1379678037696) service to ...com/************:8020 beginning handshake with NN 2013-10-01 12:43:02,097 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1574577678-***********-1377251137260 (storage id DS-1585458778-***********-50010-1379678037696) service to ...com/************:8020 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException): Datanode denied communication with namenode: DatanodeRegistration(0.0.0.0, storageID=DS-1585458778-*************-50010-1379678037696, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=CID-ac14694a-bacb-4180-a747-464778d2d382;nsid=680798099;c=0) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:648) This happened for us when we where using DNS for hostname and IP resolution rather than adding individual nodes in /etc/hosts file. However we found out that for the NN we needed to add all slave node hostname ip details into /etc/hosts In case of slave we need to add entries for the NN and the respective slave alone. Also manualy congifure the dfs.hosts file and add the full path to it in hdfs-site.xml Above all try restarting the cluster. It should work now The error basicaly occures when NN or Slave fails in identifying or resolving the name. For us it resolved form terminal but when running the hdfs it failed. Eventhough we were not able to pinpoint the troublemaker. The above steps rectified it..

when creating our first cluster we dint listen to the advices. we made a number of mistakes :

We assigned the hadoop data and log directories to subdirectories within the root(/). It ran fine for some months when finally the hdfs got full.The hdfs consumed entire space in (/) leaving the Os to go for a slow death.
I had other empty partions availabel which i configured as alternaive to the dfs.data.dir argument.

But that couldnt save my root fs. Frequent read write by hdfs into the (/) fragmented my filesystem that a good chunk of the free space became non recoverable. atlast we decided to pull down the cluster.

Learning from the experience we decided that there are two problems when we move with conventional approach to hadoop node.

1) We need to have a mechanism by which we should be able to add extra hard disk to the nodes grow their size, without affecting the cluster existing filesystem. We also want such a system to effectively remove the space allocated when the cluster is not much heavily used. The solution was to go for LVM - Logical Volume Management.

2) Never ever use a root directory for storing the log and data directories of Hadoop.

With this we delved into creating a LVM in each node of our cluster.

I basicaly followed the instruction as given in
http://www.howtoforge.com/linux_lvm

some commands to keep in touch for the exercise are :

fdisk

create and display physical volume

pvcreate
pvdisplay

create and display volume group

vgcreate
vgdisplay

create and display logical volume

lvcreate
lvdisplay

mkfs

Once these logical volumes are created we can dynamically increase and decrease their size by issuing commands like:

lvextend -L80G /dev/fileserver/media
extends the drive form current size to 80GB

and resize the filesystem -
resize2fs /dev/fileserver/media

for reducing the device..
first unmoun it
umount /media/data1
resize2fs /dev/fileserver/media 70GB
reduce the fs to 70GB

lvreduce -L70G /dev/fileserver/media
reduce size by 10G to 70GB

That way if our hdfs runs out of storage we can add disk in the fly and keep increasing the logicalvolumes .

1st

Saturday, August 24, 2013

Installing CDH4

Wednesday, August 21, 2013

LVM setup - Some hadoop stories

Friday, August 2, 2013

Hadoop rack configurations