Monday, April 22, 2013

Map-reduce logic

Number of maps are based on the number of splits.
Default split size is that of the block size. So that data is correctly partiioned ad block boundries.
Job - > task splitting is equal to number of splits made.
Hence this cannot be controlled. Except by changing the number of splits

The number max of task that run in a node is by default 2 .
This can be changes at each node by setting the parameter - mapred.tasktracker.map.tasks.maximum

Hence if you have a 4 core machine you can force the hadoop to run more than 2 task in a node.
And also its applicable if you are not running any reduce task.
The number of reduce task to be run can be set to zero by setting  job.setNumReduceTasks(0);

Also if in your cluster you have one node a VM with a single core. you can set the max
map task to be run on it to 1or 2 by again setting the - mapred.tasktracker.map.tasks.maximum
in the mapred-site.xml of that node.

The default timeout interval for jobtrackers waiting for job completion is 600s this can be reset by adding
    <property>
          <name>mapred.task.timeout</name>
          <value>3600000</value> <!--1hr -->
    </property>

to mapred-site.xml


json custom and conditional deserializing

Problem : Need to conditionaly deserialize to string or object based on the incoming json feed :

    @JsonDeserialize(using = LocationDeserializer.class)
    public void setLocation(Location location) {


And the custom Deserializer looks Like

@Override
    public Location deserialize(JsonParser jp, DeserializationContext ctxt)
            throws IOException, JsonProcessingException {
        ObjectMapper mapper = new ObjectMapper();
        mapper.setDeserializationConfig(ctxt.getConfig());
        jp.setCodec(mapper);
        String city = null;
        Location location = null;
        if(jp.getCurrentToken() == JsonToken.VALUE_STRING) {
           city = jp.readValueAs(String.class);
           location = new Location();
           location.setCity(city);
        } else {
           location = jp.readValueAs(Location.class);
        }
        return location;
    }




Monday, April 8, 2013

Hbase

start hbase using -  start-hbase.sh
get hbase shell - bin/hbase shell
hbase(main):001:0> status
1 servers, 0 dead, 5.0000 average load

hbase(main):004:0> list
TABLE                                                                                                                                                                          
socialdata                                                                                                                                                                     
test                                                                                                                                                                           
testtable                                                                                                                                                                      
3 row(s) in 0.0380 seconds
hbase(main):002:0> create 'socialdata', 'connection','feeds','personal'
0 row(s) in 0.2930 seconds
create table with name socialdata and columnfamily  - connection feeds personal.
hbase(main):009:0> scan 'socialdata'
display contents of a file
drop 'socialdata' - remove the table
exit - exit hbase

delete a row from hbase
- delete all the columns - delete '','',':'

 delete all contents of a row -
deleteall 'socialdata','facebook:123123123'
Inorder to configure a mapreduce job for a remote hbase cluster we have configuration like :

conf = HBaseConfiguration.create(conf);
conf.set(HConstants.ZOOKEEPER_QUORUM, "athena");
conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2222");

//
server configurations

Hbase-site.xml

      hbase.rootdir
      hdfs://******:54310/hbase/sprout

 
      hbase.zookeeper.property.clientPort
      2222


      hbase.cluster.distributed
      true


      hbase.zookeeper.quorum
      athena


      hbase.zookeeper.property.dataDir
      /home/*****/Deploy/zookeeper

Hbase needs zookeper running to manage the cluster of masters and slaves.
ALso sync the system time across the machine -The time deifference shouldnt be above .5 minute
Ubuntu
ntpdate
Inside hbase-env.sh
set - export HBASE_MANAGES_ZK=true
To tell Hbase to manage its own zookeeper ensemble and specify the zookeeper properties in the hbase-site.xml
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

When connecting from Hbase client this zookeeper quorum is used in our case quorum name is athena


Connecting programmaticaly to hbase is easier. You need to add the respectibe hbase-site.xml into your base classpath and then call the following statement
Configuration conf = HBaseConfiguration.create();
That creates the configuration for you.
piping output from hbase to a text file
echo "get '*****data','facebook:123123123'"|hbase shell > test