Monday, April 22, 2013

Map-reduce logic

Number of maps are based on the number of splits.
Default split size is that of the block size. So that data is correctly partiioned ad block boundries.
Job - > task splitting is equal to number of splits made.
Hence this cannot be controlled. Except by changing the number of splits

The number max of task that run in a node is by default 2 .
This can be changes at each node by setting the parameter - mapred.tasktracker.map.tasks.maximum

Hence if you have a 4 core machine you can force the hadoop to run more than 2 task in a node.
And also its applicable if you are not running any reduce task.
The number of reduce task to be run can be set to zero by setting  job.setNumReduceTasks(0);

Also if in your cluster you have one node a VM with a single core. you can set the max
map task to be run on it to 1or 2 by again setting the - mapred.tasktracker.map.tasks.maximum
in the mapred-site.xml of that node.

The default timeout interval for jobtrackers waiting for job completion is 600s this can be reset by adding
    <property>
          <name>mapred.task.timeout</name>
          <value>3600000</value> <!--1hr -->
    </property>

to mapred-site.xml


No comments: