my problem was that i was having a freebase dump of 8GB. Which is gz format.
But when i explode it becomes close to 60GB. that was too large to be read into the hdfs sequentialy.
idealy i needed hadoop to read in in zip format itself . A little google search took me here -
http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
I found a personal blog of - https://github.com/kevinweil/hadoop-lzo very useful. And i followed his blog completely.
I got the code from Git. And did
ant compile-java
ant compile-native
And then i copied the native files within my - hadoop-lzo-0.4.15/lib/native/Linux-amd64-64
To the hadoop native lib folder - hadoop-1.1.0/lib/native/Linux-amd64-64
I also copied the hadoop-lzo-0.4.15.jar to - hadoop-lzo-0.4.15/lib (* i dbt whether this is needed. anyway did it)
Now you need to add the codec to the Hadoop configuration . For this add the following to the core-site.xml
<!--added for lzo decompress support.need to move to spring-flow -->
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
Now decompressed my gz and compressed it back to lzo format.
gz -d
If your file is corrupt you can do like
gunzip < file.gz > file.txt
now i did install lzo by
sudo apt-get install liblzo2-dev
and then i zipped into lzo format
lzop file.txt
Now i coped the file.lzo into hdfs using :
hadoop -dfs put
Now i ran the the command
./hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /lzofiles
It gave output like
3/03/07 09:19:04 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/03/07 09:19:04 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/03/07 09:19:04 INFO lzo.LzoIndexer: LZO Indexing directory /lzofiles...
Initialy i got error like
13/03/07 09:11:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:845)
For this i ensured the JAVA_LIBRARY_PATH is set correctly to hadoop native lib folder.
i tried to echo'ed the variable from within the hadoop command file. and ensured thet variable is set. You can set it by putting
export JAVA_LIBRARY_PATH =
Also ensure that you build those libgplcompression.so files from within your machine. because these are native calls and depend on your machine architecture and os.
Thats it.
Now this is messy i need to fine tune it to suite to the spring batch way of intergration. .I shall do if time permits.till then
Happy coding.
cd /etc/yum.repos.d/ && wget http://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo
yum install hadoop-lzo-cdh4 hadoop-lzo-cdh4-mr1
No comments:
Post a Comment