Monday, December 3, 2012

Hadoop with Spring Data and Spring Batch

This is a beta post :)

Following the blogs - http://www.petrikainulainen.net/programming/apache-hadoop/creating-hadoop-mapreduce-job-with-spring-data-apache-hadoop/
http://bighadoop.wordpress.com/2012/03/25/spring-data-apache-hadoop/
http://static.springsource.org/spring-hadoop/docs/current/reference/html/index.html
My use case was :
I needed to run the Hadood job triggered from my application. For this i could have called hadoop job directly from my application. However i my framework has been heavily relying on Spring. And keeping to the best practise of Spring i have intergrated everyting via Spring.
Check this to get an update of the features provided by Spring-hadoop intergration - http://www.springsource.org/spring-data/hadoop
So the BigData Intergration is also via spring-data-hadoop.
Add following to your POM
org.springframework.data
spring-data-hadoop
1.0.0.RC1
org.apache.hadoop
hadoop-core
1.0.0

</dependencies>

<repositories>
<repository>
<id>repository.springsource.milestone</id>
<name>SpringSource Milestone Repository</name>
<url>http://repo.springsource.org/milestone</url>
</repository>

As mentioned in the Spring-datahadoop ,This is built for versions which are above .20 of Hadoop. SO if you have been writing your mapreduce programs by implementing the Mapper inside mapred package it may not work with spring-data-hadoop.I used to have my sample mapred program written so. so i had to change the implementation by extending the Mapper class within the mapreduce package

In my case my aspring-data application was residing in a different account . And my hadoop installation was running in a different account. This was done to isolate the complexities of installation and hosting of hadoop. But when i run my hadoop job from spring-data in my application i get the exception like
org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied:  access=write
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:199)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:180)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:128)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5212)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5
For info on how to tackle this refer : http://hadoop.apache.org/docs/r0.20.2/hdfs_permissions_guide.html
When mapred workd it try to access the files from the hdfs. For every access it tries to check the access permission of the files in the hdfs to the account from within which you are running the job. so to enable a new user to access the hdfs you just need to add the user to the supergroup. For this create a group in my case 'hadoop' and add all the user who want to access hdfs to this group.
sudo groupadd hadoop

usermod -a -G hadoop hadoopadmin
usermod -a -G hadoop jobadmin

here the hadoop administrator is hadoopadmin
and the person running the job from spring-data-hadoop application is jobadmin

and then set hadoop group as supergroup by adding following to hdfs-site.xml and restarting the hadoop,hdfs

<property>
<name>dfs.permissions.supergroup</name>
<value>hadoop</value>
</property>

mapred.job.tracker

By setting the tracker, you delegate the mapper/reducer code to run on
that tracker (if you're using the single-node cluster). Most likely on
that tracker you don't have hadoop-examples available in the classpath
and thus the Wordcount$TokenizerMapper is not found. Make sure these
classes are on the classpath or use the jar/jar-by-class attributes and
specify a jar enclosing the class (such as hadoop-examples.jar).

Thus in my case the hdp:cpfiguration looks like:

<hdp:configuration>
fs.default.name=hdfs://hosta:54410
mapred.job.tracker=hosta:54411
</hdp:configuration>

This ensures that the job i run in spring hadoop would be delegated to run in the map/red code thats running @hosta:54411 which is a multi-node hadoop
cluster.

Finaly to run my job in the remote multinode cluster my job description is something like

<hdp:job id="wordcount-job"
input-path="/sample/test-data.txt" output-path="/sample/out2"
mapper="xxx.social.mine.mapr.WordCount2.TokenizerMapper2"
reducer="xxx.social.mine.mapr.WordCount2.IntSumReducer2"
input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
jar-by-class="xxxx.social.mine.mapr.WordCount2"
jar="classpath:xxx-job.jar"
key="org.apache.hadoop.io.Text"
value="org.apache.hadoop.io.IntWritable"
/>

And i have added the job as a jar'xxx-job.jar' into my class path. This is referred in jar="classpath:xxx-job.jar" :)
and that should solve your problems....

But If we need to write relavant map reduce code then simple hdfs file read and word count is not enough.
We need much more like the DBInputFormat which is used for reading from DB from within Hadoop code.
When we need to intergrate such codes having DBInputFormat the TextinputFormat which is the default and only support
provided by spring data become a paintpoint. However thanks to springs flexibility. We have alternate options :

Create a Hadoop Job by Code and add it to JobRunner.
But in that case the jar issue like adding the jar to class path need to be done by us.

A seconnd and efficient approach is let the Spring built the job object for you. but you can modigy the job object to add
support for DBInput format. But in this case i came across some issues like the spring-hadoop.xsd insists one to set the
input-path and output-path which are mandatory for fileInputText. However for DB readin this can be ignored. So i had to change the
xsd to remove the required attribute for them.

After a little research i was able to find that the spring Data was primarily for latter verisons of Hadoop. the Hadoop-1.0.0 lacks DBInput Format
in the mapreduce package which is the approach going forward as DBInputFormat within mapred is deprecated. And moreover
Spring-data-hadoop has been built for latter versions those using the mapreduce package.So will need to migrate to Hadoop 1.1.0

However when my map-red started working for reading from postgresql . i got this error

java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.postgresql.Driver
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.setConf(DBInputFormat.java:164)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:723)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.postgresql.Driver
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.getConnection(DBInputFormat.java:190)
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.setConf(DBInputFormat.java:158)
... 9 more
Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:186)
at org.apache.hadoop.mapreduce.lib.db.DBConfiguration.getConnection(DBConfiguration.java:148)
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.getConnection(DBInputFormat.java:184)
... 10 mor

I hope this is because the node running the jobes desnt have postgresql.jar. I may have to put it into the distributed cache so that
every node will have the jar.

I solved by adding : libs

<hdp:job id="wordcountJob"
output-path="/sample/out2"
jar-by-class="xxx.mine.mapr.WordCount2"
jar="classpath:dbjob-job.jar"
key="org.apache.hadoop.io.LongWritable"
value="xxx.mine.mapr.SocialRecord"
libs="file://repository/postgresql/postgresql/9.0-801.jdbc4/postgresql-9.0-801.jdbc4.jar"
/>

And my job configuration is something like

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations={"/WEB-INF/spring/ApplicationContext.xml"})
public class UpdateJobServiceImplTest {

@Inject
JobRunner jobRunner;

@Inject
Job wordcountJob;

@Test
public void testUpdateMine() {
System.out.println("Hello!!");

Configuration conf = wordcountJob.getConfiguration();

String [] fields = {"xxxxx", "xxxxxx"};
DBConfiguration.configureDB(conf,"org.postgresql.Driver" ,"jdbc:postgresql://xx.xx.xx.xx:5432/testdb/postgres", "xxxxx","xxxxx");
DBInputFormat.setInput(wordcountJob,SocialRecord.class,"xxx",null,"xxxx",fields);
wordcountJob.setInputFormatClass(DBInputFormat.class);
wordcountJob.setMapperClass(DBMapR.DBMapper.class);

try {
jobRunner.call();
} catch (Exception e) {

e.printStackTrace();
}

}

}

Now it ran successfuly!!!!


NB : if jobtracker is local then the map is by default set to one. -mapred.map.tasks
If you accidently start a hadoop job the only way to stop it is like kill the job
hadoop job -kill job_201212061636_0013

Friday, October 12, 2012

CORS and PSY Gangnan on saturday morning...

I need to run a python script which will serve images. And this image servlet will also accept POST which will upload images to the servlet. But the challenge is the post can come from a different domain.

SO my python script should accept crossdomain post and get for image

Consider an eg) like

Two domain :

  • http://localhost:80/
  • http://localhost:8081/

You have a web page hpsted in http://localhost:8081/ . From this page you would like to submit data or get data from a server running @ http://localhost:80/.

@ http://localhost:8081/

Here you use a normal html page and a jquery script to do the post .  

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Insert title here</title>
<script type="text/javascript" src="js/jquery-1.7.2.js"></script>
</head>
<body>
<script type="text/javascript">
$().ready(function(){
   
    console.log("Helooworld");
    $.ajax({
          url: "http://localhost:80/cgi-bin/upload.cgi"
        }).done(function(result) {
          console.log("result = "+result);
        });
    $("[name='submit']").live('click',function(event){
        console.log('clicked');
        event.preventDefault();
        var formdata = new FormData($("form").get()[0])
        $.ajax({
            url: "http://localhost:80/cgi-bin/upload.cgi",
            type:'POST'    ,
            data:formdata,
            cache: false,
            contentType: false,
            processData: false
        }).done(function(data){
            console.log("success!!");
        });
    });


});
</script>
<form enctype="multipart/form-data" method="post" name="fileinfo">
File name: <input name="file_1" type="file"><br>
File name: <input name="file_2" type="file"><br>
File name: <input name="file_3" type="file"><br>
<input name="submit" type="submit">
</form>
</body>
</html> 

 

Not the FormData element, this is part of the xhr2 spec of html5. Which allows cross domain file upload. Just like CORS this is only supported in latest browsers.

I have used jquery for the ajax submit . Versions of jquery after 1.5.1 support CORS so the hassles of CORS to a limit are handled by jquery. Making the like of developer

much easy. Also to prevent jquery from processing the FormData object i have mentioned processData:false and also the content type is also set to false. These are imp!!.

Now this code should be able to submit the files to a server located in another domain. We shall look into the 2nd server now

@ http://localhost:80/

At my second server i have used a Python cgi script for handling the file upload. Here  look at the script

print "Access-Control-Allow-Origin: http://localhost:8080"
print "Content-Type: text/html\n"
print HTML_TEMPLATE % {'SCRIPT_NAME':os.environ['SCRIPT_NAME']}

.

.

.

form = cgi.FieldStorage()
    if not form.has_key(form_field): return
    fileitem = form[form_field]
    if not fileitem.file: return
    fout = file (os.path.join(upload_dir, fileitem.filename), 'wb')

 

 

revent Here not that i have set the header option print "Access-Control-Allow-Origin: http://localhost:8080"  for the response which allow this cgi to

accept as well as serve content for the domain http://localhost:8080 . This is specifially for python. For other technologies on how to set the response header

check this . Thus i was able to remotely upload files to a different domain as well as get data from their without the need of comples ajax file upload plugins

or jsonp. Thi is one of the signs of rising power on the client tier. Now its sunday morning 10.19. The delay i would say was due to the

/tmp problem..I uploaded my files to /tmp in fedora . And couldnt find it ther. Then i came across this blog which said in fedora /tmp is mounted with tmpfs.

Anywas Happy coding

Thursday, October 4, 2012

Backbone as simple as angular

Often people misunderstand javascript to be too inferior . And they fear the more code we write in js the more will be the maintenance nightmare. So my pursuit is to convince how much readable code are the new genre of javascript frameworks providing us. I shall be mainly dealing with Backbone which i came across when searching for a stable and well proven framework.

 

define([
  'underscore',
  'backbone',
], function(_, Backbone) {
    Profile = Backbone.Model.extend({
        idAttribute: "accountid",
        urlRoot : '/trip-2/tpp/account/',           
        initialize : function(model,options){
            this.bind("change",this.displaylog);
        },
        parse:function(response){
            return response.account;
        },
        displaylog:function(model){
            console.log(model.toJSON());
           }
    }); 
  return Profile;
});

Above code represent a typical Model object in BackBone. THe backbone of Backbone are three standard Objects defined by framework. And they are

The view and model corresponds to the same as in MVC architecture. The model is the link b/w the server side service and the javascript. Somebody asked me is it synonymous to the Business Modl in server side? If you think you are doing critical business functionalities in client side then offcource you can encapsulate them in Model. But i think majority of javascript apps deals with displaying and doing trifle actions like checking,updating,creating,view related actions like drag,drop etc. Out of these the major things affecting our data  are update,save delete etc. Its these update delete create that are organised into model.

The profile model here is linked to service  '/trip-2/tpp/account/' .During fetch a GET action is fired to this url and when data is updated a PUT is fired and when a new profile is added a POST is fired.

This completes our backbone Model.

Now one of the major adavantage of using Backbone is its flexibility. Its optional we use a Router or a View. THese are useful if we are to bring in seprataion of concern in the javascript layer.Like we have a separate js dealing with DOM manipulation namely VIew. And another js for handling model and server communication namely Model. You may ask why separation of concern since it only seeks to increase the number of files associated.

  • It improves code maintainability. WHich is an important factor when you are writing complex client side application.
  • Mix and match of various technologies .

 hus this particular model object gets its data by calling the service at this link via a GET request.

And when to save the data of this model it call the same url but with a PUT.

And when to create a new model it update server side by calling POST to the same url. This is something knows as the REST based convention. Where this object corresponds to a Resource at the server side. And all operations are accessed cia a RESTful url with actions - GET,PUT,POST,DELETE etc. Backbone expects your application to be of RESTful nature by default. But this is not a mandatory. Once can change the urls/format/operations of the save update delete etc. By implementing the save,fetch,destroy methos in the model

Sunday, September 30, 2012

Debuging javascript - some muses

One can use eclipse Helios/Indigo. Which comes with a javascript script debuging support JSDT. JSDT comes with a inbuilt javascript engine known as Rhino .

Rhino could be used for running/debugging script which are not associated with browser DOM. Thus stand alone scripts could be writen tested and degugged using JSDT and built in Rhino script engine.

Attaching other script engines is not known to me. And i am in prowl of it.

But majority of work relating to mine is depending on browser side scripting like developing web application.

Currently this is handled majoritively using :

  • Firebug
  • Mozilla firefox developer tools
  • Chrome developer tools

One disadvantage of such approah is that these tools are not intergrated tom y IDE which is eclipse. As any java influenced devloper would expect a full fledged IDE with source editing and debugging capability.

So what are the new rumblings towards this direction. i came by this slide which gives an overview of improvements in this direction. Having used remote debuggin for debugging cloud based applications  as in GAE i know how it could ease my job to a certain level. A little reading bought me to this conclusion that in near future almost all browser/mobile browser woul be equiped with remote debugging where an Eclipse IDE could be used for debugging the browser dom based scripts. Lively changing the code making javascript development much easier.

Chrome is seen to be more progressing in this  with their chromedevtools . One could get a plugin of this from eclipse update site . On how to debug browser remotlty from eclipse using chromedevtools refer - http://chromedevtools.googlecode.com/svn/wiki/DebuggerTutorial.wiki  .    Chrome also have a new version of this which is an experimental one which allows remote debugging of webkit based engines. this is surely going to benefit a lot of developers who develop for mobile abd tablets.

Apart from chrome Firebug the major cross browser developer tool provide a experimental version of remote debugging under the name crossfire  . Crossfire again is an eclipse plugin with intergration with jsdt which allows you put breakpoints in eclipse web porject javascript source code and the firebug handles the breakpoint correspondingly.

And the last one in this race is mozilla remote debugging protocol. As this is in spec mode at the time of writing this article SO no implementations where found. THis is acceptable due to the fact that mozilla was the last to enter the browser debugging capability after chrome firebug.

.

Sunday, July 8, 2012

Mounting ntfs in Fedora 17

In Fedora 17 the mounting point of external devices have been changed to /run/media/{userhome}/{mountpoint}

As all runtime data from fedora 17 is stored in /run . So Is th emount points changed.

 

By default when you mount a ntfs filesystem the fs is mounted as - noexec. or you cannot execute anything from the fs.

So you chould ideally mount it using the exec options. for this add the following to your fstab. And the next time when you mount fs the scripts as well as binaries within fs will

executable -

 

also dont forget to create the mountpoint folder inside " /run/media/{username}"

sudo vi /etc/fstab

/dev/sda7  /run/media/{username}/{Mountpoint}                   ntfs    defaults,user,exec  0
/dev/sda5  /run/media/{username}/{Mountpoint}    ntfs    defaults,user,exec  0

Tuesday, July 3, 2012

Setting up Fedore 17 and Nvidia Nvidia Quadro 600

Install Fedora from cd. By default the Nouvea based graphic driver wil be installed.

lspci |grep -i VGA
And see your graphics card is getting listed.

To install the driver firstyou need to shutdown the Xserver.
 For this you can boot to runlevel 3 by typing in
init 3
from the terminal .
One in runlevel 3 your Xserver would stop working.Now change to su
su
Now you needto install the required drivers from the repo. I have used the rpmFusin repo

rpm -Uvh http://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-stable.noarch.rpm
rpm -Uvh http://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-stable.noarch.rpm


yum install kmod-nvidia xorg-x11-drv-nvidia-libs

Once installed the kernel boot initramfs should be changed .

## Backup old initramfs nouveau image ##
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
 
## Create new initramfs image ##
dracut /boot/initramfs-$(uname -r).img $(uname -r)

Now on restart you should boot with nvidia driver.....enjoi!!