Tanvir Ahmad, R. Ahmad, Sarah Masud, Farheen Nilofer
{"title":"Framework to extract context vectors from unstructured data using big data analytics","authors":"Tanvir Ahmad, R. Ahmad, Sarah Masud, Farheen Nilofer","doi":"10.1109/IC3.2016.7880229","DOIUrl":null,"url":null,"abstract":"When multiple terms in the query point to a single concept, the solution is easy to map. But, when many morphologically similar terms refer to separate concepts (showing fuzzy behavior), then arriving at a solution becomes difficult. Before applying any knowledge generation or representation techniques to such polysemic words, word sense disambiguation becomes imperative. Unfortunately, with an exponential increase in data, the process of information extraction becomes difficult. For text data this information is represented in form of context vectors. But, the generation of context vectors is limited by the memory heap and RAM of traditional systems. The aim of this study is to examine and propose a framework for computing context vectors of large dimensions over Big Data, trying to overcome the bottleneck of traditional systems. The proposed framework is based on set of mappers and reducers, implemented on Apache Hadoop. With increase in the size of the input dataset, the dimensions of the related concepts (in form of resultant matrix) increases beyond the capacity of a single system. This bottleneck of handling large dimensions is resolved by clustering. As observed from the study, transition from a single system to a distributed system ensures that the process of information extraction runs smoothly, even with an increase in data.","PeriodicalId":294210,"journal":{"name":"2016 Ninth International Conference on Contemporary Computing (IC3)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Ninth International Conference on Contemporary Computing (IC3)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC3.2016.7880229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
When multiple terms in the query point to a single concept, the solution is easy to map. But, when many morphologically similar terms refer to separate concepts (showing fuzzy behavior), then arriving at a solution becomes difficult. Before applying any knowledge generation or representation techniques to such polysemic words, word sense disambiguation becomes imperative. Unfortunately, with an exponential increase in data, the process of information extraction becomes difficult. For text data this information is represented in form of context vectors. But, the generation of context vectors is limited by the memory heap and RAM of traditional systems. The aim of this study is to examine and propose a framework for computing context vectors of large dimensions over Big Data, trying to overcome the bottleneck of traditional systems. The proposed framework is based on set of mappers and reducers, implemented on Apache Hadoop. With increase in the size of the input dataset, the dimensions of the related concepts (in form of resultant matrix) increases beyond the capacity of a single system. This bottleneck of handling large dimensions is resolved by clustering. As observed from the study, transition from a single system to a distributed system ensures that the process of information extraction runs smoothly, even with an increase in data.