Friday, March 8, 2013

BIgdata - compressed/zip/gz reading.



my problem was that i was having a freebase dump of 8GB. Which is gz format.

But when i explode it becomes close to 60GB. that was too large to be read into the hdfs sequentialy.



idealy i needed hadoop to read in in zip format itself . A little google search took me here -

http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/



I found a personal blog of - https://github.com/kevinweil/hadoop-lzo very useful. And i followed his blog completely.

I got the code from Git. And did

ant compile-java

ant compile-native



And then i copied the native files within my - hadoop-lzo-0.4.15/lib/native/Linux-amd64-64

To the hadoop native lib  folder - hadoop-1.1.0/lib/native/Linux-amd64-64

I also copied the hadoop-lzo-0.4.15.jar to  - hadoop-lzo-0.4.15/lib (* i dbt whether this is needed. anyway did it)



Now you need to add the codec to the Hadoop configuration . For this add the following to the core-site.xml


<!--added for lzo decompress support.need to move to spring-flow -->



<property>

    <name>io.compression.codecs</name>

    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>

</property>

<property>

    <name>io.compression.codec.lzo.class</name>

    <value>com.hadoop.compression.lzo.LzoCodec</value>

</property>



Now decompressed my gz and compressed it back to lzo format.



gz -d
If your file is corrupt you can do like

      gunzip < file.gz > file.txt
now i did install lzo by
      sudo apt-get install liblzo2-dev

and then i zipped into lzo format

     lzop file.txt

Now i coped the file.lzo into hdfs using :
hadoop -dfs put /lzofiles

Now i ran the the command
./hadoop jar hadoop-lzo-0.4.15.jar  com.hadoop.compression.lzo.LzoIndexer /lzofiles

It gave output like



3/03/07 09:19:04 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library

13/03/07 09:19:04 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]

13/03/07 09:19:04 INFO lzo.LzoIndexer: LZO Indexing directory /lzofiles...

Initialy i got error like




13/03/07 09:11:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library

java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path

    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)

    at java.lang.Runtime.loadLibrary0(Runtime.java:845)

For this i ensured the JAVA_LIBRARY_PATH is set correctly to hadoop native lib folder.

i tried to echo'ed the variable from within the hadoop command file. and ensured thet variable is set. You can set it by putting

export JAVA_LIBRARY_PATH = in hadoop-env.sh

Also ensure that you build those libgplcompression.so files from within your machine. because these are native calls and depend on your machine architecture and os.

Thats it.

Now this is messy i need to fine tune it to suite to the spring batch way of intergration. .I shall do if time permits.till then

Happy coding.



Installing for cloudera is very simple
install the relavant repo and install the libraries



cd /etc/yum.repos.d/ && wget http://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo
yum install hadoop-lzo-cdh4 hadoop-lzo-cdh4-mr1






No comments: