1st: April 2013

Monday, April 22, 2013

Map-reduce logic

Number of maps are based on the number of splits.
Default split size is that of the block size. So that data is correctly partiioned ad block boundries.
Job - > task splitting is equal to number of splits made.
Hence this cannot be controlled. Except by changing the number of splits

The number max of task that run in a node is by default 2 .
This can be changes at each node by setting the parameter - mapred.tasktracker.map.tasks.maximum

Hence if you have a 4 core machine you can force the hadoop to run more than 2 task in a node.
And also its applicable if you are not running any reduce task.
The number of reduce task to be run can be set to zero by setting job.setNumReduceTasks(0);

Also if in your cluster you have one node a VM with a single core. you can set the max
map task to be run on it to 1or 2 by again setting the - mapred.tasktracker.map.tasks.maximum
in the mapred-site.xml of that node.

The default timeout interval for jobtrackers waiting for job completion is 600s this can be reset by adding
    <property>
        <name>mapred.task.timeout</name>
        <value>3600000</value> 
    </property>

to mapred-site.xml

json custom and conditional deserializing

Problem : Need to conditionaly deserialize to string or object based on the incoming json feed :

@JsonDeserialize(using = LocationDeserializer.class)
public void setLocation(Location location) {

And the custom Deserializer looks Like

@Override
    public Location deserialize(JsonParser jp, DeserializationContext ctxt)
            throws IOException, JsonProcessingException {
        ObjectMapper mapper = new ObjectMapper();
        mapper.setDeserializationConfig(ctxt.getConfig());
        jp.setCodec(mapper);
        String city = null;
        Location location = null;
        if(jp.getCurrentToken() == JsonToken.VALUE_STRING) {
           city = jp.readValueAs(String.class);
           location = new Location();
           location.setCity(city);
        } else {
           location = jp.readValueAs(Location.class);
        }
        return location;
    }

Monday, April 8, 2013

Hbase

start hbase using - start-hbase.sh
get hbase shell - bin/hbase shell
hbase(main):001:0> status
1 servers, 0 dead, 5.0000 average load

hbase(main):004:0> list
TABLE
socialdata
test
testtable
3 row(s) in 0.0380 seconds
hbase(main):002:0> create 'socialdata', 'connection','feeds','personal'
0 row(s) in 0.2930 seconds
create table with name socialdata and columnfamily - connection feeds personal.
hbase(main):009:0> scan 'socialdata'
display contents of a file
drop 'socialdata' - remove the table
exit - exit hbase

delete a row from hbase
- delete all the columns - delete '','',':'

delete all contents of a row -
deleteall 'socialdata','facebook:123123123'
Inorder to configure a mapreduce job for a remote hbase cluster we have configuration like :

conf = HBaseConfiguration.create(conf);
conf.set(HConstants.ZOOKEEPER_QUORUM, "athena");
conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2222");

//
server configurations

Hbase-site.xml

      hbase.rootdir
      hdfs://******:54310/hbase/sprout

      hbase.zookeeper.property.clientPort
      2222

      hbase.cluster.distributed
      true

      hbase.zookeeper.quorum
      athena

      hbase.zookeeper.property.dataDir
      /home/*****/Deploy/zookeeper
Hbase needs zookeper running to manage the cluster of masters and slaves.
ALso sync the system time across the machine -The time deifference shouldnt be above .5 minute
Ubuntu
ntpdate
Inside hbase-env.sh
set - export HBASE_MANAGES_ZK=true
To tell Hbase to manage its own zookeeper ensemble and specify the zookeeper properties in the hbase-site.xml
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

When connecting from Hbase client this zookeeper quorum is used in our case quorum name is athena

Connecting programmaticaly to hbase is easier. You need to add the respectibe hbase-site.xml into your base classpath and then call the following statement
Configuration conf = HBaseConfiguration.create();
That creates the configuration for you.
piping output from hbase to a text file
echo "get '*****data','facebook:123123123'"|hbase shell > test