Wednesday, April 16, 2014

Wikipedia DBPedia and Extracting information

In semantci web a major source of information is from wikipedia. It stands as the single largest source of semantic information.Other competor is Freebase whose majority data is
retrieved from wikipedia. Wikipedia has another version known as DBPedia which allows to download the dataaset in triple format. This act as a starting point for building your knowledge graphs. But often the most difficult part is retrieving data that is relavant to you.
For instance you want to retrieve the tourist destinations in india..
The long and default way to do this is to get the all the resource instances form dpedia . check if the instance is of type hotel,meuseum etc . And then retrieve them.Then go find the latitude and longitude if given. If found then retrieve those in india alone. This tedious process needs the use of mapreduce programming and lot of iterations to finally retrieve the data.
There is a short and efficient way to do this. That is by in wikipedia every indormation is categorised. Thanks to the active content editores
So in short it results in something like

In short if you could get the contents from within the category tourism in india then you are actualy buildiing the knowledge graph about tourism in india. And this can be done!!!
You need to download the categories data from wikipedia and do the refinement on it. Basically a category description in wikipedia in triples format is :(As seen in DBPedia)

step1 ) Thats means this can be further refined to get the subcatergories as well as the subjects(topics) contained within the category tourism in india.
step 2) Once the subjects are got we can check it to be a category or not.If its a category its url will be of form resource/Category . And then we repeat the step 1
step 3) Is its not a category we take those subjects and their information from instances dump of wikipedia.

This way we would be able to extract the tourist destinations as well as the places of interest in india!!!!