{"title":"Query Classification using LDA Topic Model and Sparse Representation Based Classifier","authors":"Indrani Bhattacharya, J. Sil","doi":"10.1145/2888451.2888474","DOIUrl":"https://doi.org/10.1145/2888451.2888474","url":null,"abstract":"Users often seek for information by submitting query consisting of keywords may belong to multiple topics, representing overlapping concepts. Objective of the work is to classify the query into a topic class label by considering the query keywords distributed over various topics. The approach effectively reduces the search space in order to retrieve information computationally efficient way. First we apply Latent Dirichlet Allocation (LDA) on the entire corpus to group the documents into topics consisting of unique words. As a next step, a term vocabulary (TRV) has been built with unique words present in the topics. We develop a Topic-Vocabulary Matrix (TVM) by encoding the TRV with respect to each topic. The TVM expresses word distribution among the topics and presented as training data set, which is sparse. The query is encoded by the same way and submitted as test data. We apply sparse representation based classifier (SRC) to classify the query as a topic. The proposed approach shows satisfactory performance with 93% accuracy in classifying query.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123584802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. S. Kumar, Siddharth Goyal, V. Reddy, Ramesh Loganathan
{"title":"Exploiting Local and Global Context In PPI networks For Efficient Protein Function Prediction","authors":"D. S. Kumar, Siddharth Goyal, V. Reddy, Ramesh Loganathan","doi":"10.1145/2888451.2888461","DOIUrl":"https://doi.org/10.1145/2888451.2888461","url":null,"abstract":"Protein-protein interaction (PPI) networks are valuable biological data source which contain rich information useful for protein function prediction. The PPI network data obtained from high-throughput experiments is known to be noisy and incomplete. In the literature, common neighbor, clustering, and classification-based approaches have been proposed to improve the performance of protein function prediction by modeling PPI data as a graph. These approaches exploit the fact that protein shares function with other proteins directly interacting with it. In this paper we have experimented an alternative approach by exploiting the notion that two proteins share a function if they have a well defined group of directly or indirectly connected common neighbors. The experiments conducted on variety of PPI network datasets show that the proposed approach improves protein function prediction accuracy over existing approaches.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126256251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Spatio-temporal Change Pattern using Mathematical Morphology","authors":"Monidipa Das, S. Ghosh","doi":"10.1145/2888451.2888458","DOIUrl":"https://doi.org/10.1145/2888451.2888458","url":null,"abstract":"Detection and assessment of spatio-temporal change pattern is a challenging task, and may provide insights into various spatio-temporal changes, like urban sprawl monitoring, surveillance of epidemics due to infectious diseases etc. The existing spatio-temporal pattern mining techniques mostly deal with the assessment of thematic change patterns. However, analyzing the spatio-temporal pattern of geometric changes is also important for analyzing such kinds of spatial changes on a temporal scale. This paper presents a novel framework for modeling such spatio-temporal change in geometry with the help of mathematical morphology and directional granulometric analysis. Morphological operators have been used to detect the various spatio-temporal change patterns in geometry, like spatial growth (due to Expansion and Merge), spatial shrinkage (due to Contraction and Split) etc. Further, the temporal changes in the orientations of these patterns have been modeled by performing granulometric analyses on them. The proposed framework for spatio-temporal change pattern modeling has been validated considering four cases of spatio-temporal change, namely (i) spatial expansion, (ii) spatial contraction, (iii) spatial merge, and (iv) spatial split in regional distribution of climate zones in Australia.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134252030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning transition models of biological regulatory and signaling networks from noisy data","authors":"Deepika Vatsa, Sumeet Agarwal, A. Srinivasan","doi":"10.1145/2888451.2888469","DOIUrl":"https://doi.org/10.1145/2888451.2888469","url":null,"abstract":"In this paper, we present an extended 2-step probabilistic LGTS (PLGTS) transition system which aims to identify the network structure and stochastic nature of biological processes using time series data. This work is a step towards system identification in a noisy environment using transition systems. Here, the noise implies noise in transitions between states in the observed data. Interestingly, noise in the data helps in assisting system identification. Experimental results on synthetic data show that noise actually helps in understanding the system dynamics as well as constraining the solution space; thus helping to identify the most probable network structure for a given data set.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134341339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Quick Reduct Algorithm: Iterative MapReduce Approach","authors":"P. Singh, P. Prasad","doi":"10.1145/2888451.2888476","DOIUrl":"https://doi.org/10.1145/2888451.2888476","url":null,"abstract":"Feature selection by reduct computation is the key technique for knowledge acquistion using rough set theory. Existing MapReduce based reduct algorithms use Hadoop Map Reduce framework, which is not suitable for iterative algorithms. Paper aims to design and implementation of Iterative MapReduce based Quick reduct algorithm using Twister framework. The proposed In_MRQRA Algorithm has partial granular level computations at mappers and granular computations at reducer. Experimental analysis on KDD-Cup99 dataset empirically established the relevence of proposed approach.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128549862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weighted Linear Loss Twin Support Vector Clustering","authors":"Reshma Khemchandani, Aman Pal","doi":"10.1145/2888451.2888467","DOIUrl":"https://doi.org/10.1145/2888451.2888467","url":null,"abstract":"Traditional point based clustering methods such as k-means [1], k-median [2], etc. work by partitioning the data into clusters based on the cluster prototype points. These methods perform poorly in case when data is not distributed around several cluster points. In contrast to these, plane based clustering methods such as k-plane clustering [3], local k-proximal plane clustering [4], etc. have been proposed in literature. These methods calculate k cluster center planes and partition the data into k clusters according to the proximity of the datapoints with these k planes. Working on the lines of [5], in this paper, we have presented a Weighted Linear Loss Twin Support Vector Clustering termed as WLL-TWSVC for clustering problems. By introducing the weighted linear loss in the formulation of TWSVC leads to solving system of linear equations with lower computational cost as opposed to solving series of quadratic programming problems along with system of linear equations as in TWSVC. We have also introduces a regularization term in the objective function which takes care of structural risk component along with empirical risk.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129951344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating the Potential of Aggregated Tweets as Surrogate Data for Forecasting Civil Protests","authors":"Swati Agarwal, A. Sureka","doi":"10.1145/2888451.2888466","DOIUrl":"https://doi.org/10.1145/2888451.2888466","url":null,"abstract":"Online Micro-blogging Social Media websites like Twitter are being used as a real-time platform for information sharing and communication during planning and mobilization of civil unrest events. We conduct a study of more than 1.5 million English Tweets spanning 5 months on the topic of Immigration and found evidences of Twitter being used as a platform for planning and mobilization of protests and civil disobedience related demonstrations. We believe that Twitter data can be used as a surrogate and open-source precursor for forecasting civil unrest and investigate Machine Learning based techniques for building a prediction model. We present our solution approach consisting of various components such as named entity recognition (temporal, spatial location, people expressions extraction), semantic enrichment of events related tweets (crowd-buzz & commentary and mobilization & planning) location-time-topic correlation miner. We conduct a series of experiments on a real-world and large dataset and investigate the application of trend analysis. We conduct two case studies on civil unrest related events and demonstrate the effectiveness of our approach.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129978919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Multi-source Data to Study Workplace Activity Patterns","authors":"Sachin Patel, Ravi Mahamuni, Meghendra Singh, David Clarance, Mayuri Duggirala, Shivani Sharma, Vinay Katiyar, Gauri Deshpande, Amruta Deshmukh, Vaibhav, Vivek Balaraman","doi":"10.1145/2888451.2888470","DOIUrl":"https://doi.org/10.1145/2888451.2888470","url":null,"abstract":"Examining work activity patterns is a problem of enduring research in organizations. The fortuitous availability of a whole new set of data collection mechanisms such as mobiles, activity loggers, GPS based location detectors, provide us new ways of studying workplace behaviour. We present a data collection framework that helps in collection, anonymization, fusion, processing and mining of behavioural data. We use the framework to study the activities in a research and development team with an aim to find the relationship between behavioural traits, states, and activity patterns. We find partial support for the claim that behavioral states and activity patterns are associated.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125107479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trustworthiness of t-Distributed Stochastic Neighbour Embedding","authors":"Shishir Pandey, R. Vaze","doi":"10.1145/2888451.2888465","DOIUrl":"https://doi.org/10.1145/2888451.2888465","url":null,"abstract":"A well known technique for embedding high dimensional objects in two or three dimensional space is the t-distributed stochastic neighbour embedding (t-SNE). The t-SNE minimizes the Kullback-Liebler (KL) divergence between two probability distributions, one induced on points in the high dimensional space and the other induced on points in the low dimensional embedding space. In this work, we consider a more general framework of using Rényi divergence which is parametrized by the order α, the KL-divergence is a special case when α → 1.We study how various Rényi divergences perform when compared to the KL-divergence. We show that in terms of the metrics of trustworthiness and neighbourhood preservation, the embedding becomes better as Rényi divergence approaches the KL-divergence.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"65 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134092287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SocialStories: Segmenting Stories within Trending Twitter Topics","authors":"Kokil Jaidka, Kaushik Ramachandran, Prakhar Gupta, Sajal Rustagi","doi":"10.1145/2888451.2888453","DOIUrl":"https://doi.org/10.1145/2888451.2888453","url":null,"abstract":"This study present SocialStories - a system based on incremental clustering for streaming tweets, for identifying fine-grained stories within a broader trending topic on Twitter. The contributions include a novel tf-metric, called the inverse cluster frequency, and a decay weighting for entities. We present our experiments on 0.19 million tweets posted in June 2014, revolving around the mentions of a software brand before, during and after a marketing conference and a software release. The novelty of our work is the text-based similarity calculation metrics, including a new similarity metric, called the inverse cluster frequency, and time-specific metrics that allow for the decay of old entities with the passage of time and preserve the homogeneity and the freshness of stories. We report improved performance and higher recall of 80%, against the gold standard (posthoc journalistic reports), as compared to LDA-, and Wavelet-based systems. Our algorithm is able to cluster 80% of all tweets into story-based clusters, which are 86% pure. It also enables earlier detection of trending stories than manual reports, and is far more accurate in identifying fine-grained stories within sub-topics as compared to baseline systems.","PeriodicalId":136431,"journal":{"name":"Proceedings of the 3rd IKDD Conference on Data Science, 2016","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125379682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}