1st: November 2013

One of the nightmares associated with writing MR code is with the difficulty associated with debugging and tracing the program. Since it is run as a Job in a cluster many newcomers find it very annoying. A solution to this is writing the jobs and run them in local standalone modes , so tat one cna debug and test as normal codes from within the IDE and then deploy them to clusters for running. And all these need to happen from within the environment.

We were able to do it using spring hadop and eclipse IDE. In short I evelop the jobs within the eclipse IDE debug and test them in single standalone jobtrackers running from within the IDE and then finally deploy them to original clusters.

Here is the spring and hadoop configuation and the test java class :

    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:batch="http://www.springframework.org/schema/batch"
    xmlns:hdp="http://www.springframework.org/schema/hadoop"
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:util="http://www.springframework.org/schema/util"
    xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
    http://www.springframework.org/schema/batch http://www.springframework.org/schema/batch/spring-batch-2.1.xsd
    http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd
    http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
    http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util-3.0.xsd">


        hbase.zookeeper.quorum=xxxxx01 // Only required if you are connecting to hbase
        hbase.zookeeper.property.clientPort=2181 //only required if you are connecting to hbase
        hbase.mapred.outputtable=xxxxxx

    output-path="xxxxx"
    jar-by-class="com.xxx.xxx.xxx.xxx.xxxxx"
    jar="classpath:xxxxx-0.0.1-job.jar"
    />

The sample code for testing this job is :

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations={"/ApplicationContext.xml"})
public class XXXXTest {

    @Inject
    JobRunner jobRunner;

    @Inject
    Job xxxxJob;

    @Test
    public void test() {

        Logger log = Logger.getLogger(XXXXTest.class);
        log.info("Started the test!!");

        Configuration conf = xxxxJob.getConfiguration();

       //Any configuration that you need to perform upon the job should be done here !!

        try {
            jobRunner.call();
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

Common pitfalls!!

Caused by: java.lang.RuntimeException: hbase-default.xml file seems to be for and old version of HBase (0.94.6-cdh4.4.0), this version is Unknown
    at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:68)
    at org.apache.hadoop.hbase.HBaseConfiguration.addHbaseResources(HBaseConfiguration.java:100)
    at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:111)
    at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:120)
    at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.setConf(TableOutputFormat.java:181)

This happens due to some class loading problems in with hbase. Idealy the hbase should be loading from the bundled jars but some times this wired thing happens because of it loading from elsewhere may be from the running projects hbase jars. This can be avoided by adding an hbase-default.xml which mentions not to check for version issues.

Before making this update also make sure there are no duplicate hbase jars in the library that is making this problem.

The contents of hbase-defaul.xml is:

    hbase.defaults.for.version.skip
    true
    Set to true to skip the 'hbase.defaults.for.version' check. Setting this to true can be useful in contexts other than the other side of a maven generation; i.e. running in an ide. You'll want to set this boolean to true to avoid seeing the RuntimException complaint: "hbase-default.xml file seems to be for and old version of HBase (0.92.1), this version is X.X.X-SNAPSHOT"

Our requirement was to read data from a database and insert them into Hbase as a backup flow . For postgresql this was straightforward

        String databaseDriver="org.postgresql.Driver";
        String databaseURL = "jdbc:postgresql://xxxx:5432/xxxxdb";
        String databaseUsername="xxxxxx";
        String databasePassword="xxxxxxxx";

        job.setInputFormatClass(DBInputFormat.class);
        String [] fields = {"xwxwxwxw","xwxwxwxw","xwxwxwxw","xwxwxwxw","xwxwxwwxwxw","xwxwxwxw","xwxwxwxw"};
        DBConfiguration.configureDB(conf,databaseDriver ,databaseURL, databaseUsername,databasePassword);
DBInputFormat.setInput(job,DBRecord.class,"xwxwxwxw",null,"xw",fields);
        job.setOutputFormatClass(TableOutputFormat.class);

But then we had another dump which was in SQL server. We got an error saying

java.io.IOException: Incorrect syntax near 'LIMIT'. at org.apache.hadoop.mapreduce.lib.db.DBRecordReader.nextKeyValue(DBRecordReader.java:235)

This was because the default DBInputFormat used LIMIT and OFFSET for creating splits from the database records. In MSSQL and Oracle these were not supported.

There are some DBSpecific record readers. In such cases we can use the DataDrivenInputFormat. Here we are required to give two queries one which retrieve the data for the split and other to retrieve tthe total count of the records. The new configuration looked like this

        String databaseDriver="com.microsoft.sqlserver.jdbc.SQLServerDriver";
        String databaseURL = "jdbc:postgresql://xxxx:1433/xxxxdb";
        String databaseUsername="xxxxxx";
        String databasePassword="xxxxxxxx";
        DBConfiguration.configureDB(conf,databaseDriver ,databaseURL, databaseUsername,databasePassword);

       String inputQuery = "SELECT * FROM 'xxxxxxxx' WHERE $CONDITIONS";
       String boundQuery="SELECT MIN(id),MAX(id) FROM 'xxxxxxx'"
      DataDrivenDBInputFormat.setInput(job, DBRecord.class, inputQuery, boundQuery);

Now that run fine except I had some issues like

main" java.io.IOException: The index 2 is out of range.
at org.apache.hadoop.mapreduce.lib.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:193)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063)

Which was because the DBRecord.class for mapping the table to VO was stale and dint included the id as it was written for the DBInputFormat. Now with the DataDrivenInoutFormat I also had to incorporate the id as that is used for calucalting the bound and splits. And so I also had to incorporate the id in the DBRecord.java. That solved the problem !!

Happy coding

1st

Monday, November 25, 2013

Debugging and Testing MR codes within IDE

Saturday, November 23, 2013

DataDrivenInputFormat and DBInputFormat