{"title":"Scalability of Correlation Clustering Through Constraint Reduction","authors":"Mamata Samal, V. Saradhi, Sukumar Nandi","doi":"10.1145/2567688.2567695","DOIUrl":"https://doi.org/10.1145/2567688.2567695","url":null,"abstract":"Correlation clustering (CC) is a graph based clustering method. Edges of the graph are labeled either positive or negative depending on the similarity/dissimilarity between the pair of vertices. The objective of CC is to group vertices of the induced complete graph so as to maximize the positively labeled edges that lie within a group and to maximize negatively labeled edges that lie across groups. This objective function is formulated as a semidefinite programming (SDP) problem which is well studied theoretically producing encouraging approximation values. In this work we propose a scalable solution for the SDP formulation of correlation clustering (SDP-CC) by reducing the number of constraints. The proposed formulation is solved efficiently using SDP-NAL tool. The proposed scalable formulation is compared with other scalable variants namely variable reduction based CC. Experimental results on synthetic, real world data sets whose graph sizes range from 100 vertices to 13000 vertices are tested with both the scalable formulations. Large scale bench mark graph data sets are also tested whose sizes range from 2395 vertices to 13992 vertices. The proposed formulation is shown to have an edge over the original SDP-CC formulation, variable reduction variant of SDP-CC and a constraint clustering method, namely constrained spectral clustering.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129921755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Srinivasa, Sweety Agrawal, Chinmay Jog, Jayati Deshmukh
{"title":"Characterizing Utilitarian Aggregation of Open Knowledge","authors":"S. Srinivasa, Sweety Agrawal, Chinmay Jog, Jayati Deshmukh","doi":"10.1145/2567688.2567689","DOIUrl":"https://doi.org/10.1145/2567688.2567689","url":null,"abstract":"Recent initiatives in \"open data\" have resulted in good quality, freely available, tabular datasets on the web. However, such datasets are fragmented and arbitrarily structured and are not of much utility in isolation. To address this, there are several \"open knowledge\" initiatives that aim to stitch together open data elements into semantically meaningful structures. But such efforts are met with unique challenges. We argue in this paper that knowledge aggregation can be of two kinds -- encyclopedic aggregation, which aims to elucidate, and utilitarian aggregation, which aims to create actionable knowledge elements. We also argue that utilitarian aggregation is a characteristically different problem from that of conventional efforts like Freebase or DBpedia that address encyclopedic aggregation. In addition, when it comes to utilitarian knowledge, we observe that openness is not a binary condition and instead there is a need to distinguish between knowledge that is \"open-ended\" and knowledge that is \"open.\" We formalize the notion of open knowledge based on how much knowledge or control does the creator of the knowledge element have about its consumers. Based on these arguments, we propose an underlying knowledge representation framework for encoding open utilitarian knowledge.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121828339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sudip Mittal, Neha Gupta, Prateek Dewan, P. Kumaraguru
{"title":"Pinned it! A Large Scale Study of the Pinterest Network","authors":"Sudip Mittal, Neha Gupta, Prateek Dewan, P. Kumaraguru","doi":"10.1145/2567688.2567692","DOIUrl":"https://doi.org/10.1145/2567688.2567692","url":null,"abstract":"Pinterest is an image-based online social network, which was launched in the year 2010 and has gained a lot of traction, ever since. Within 3 years, Pinterest has attained 48.7 million unique users. This stupendous growth makes it interesting to study Pinterest, and gives rise to multiple questions about it's users, and content. We characterized Pinterest on the basis of large scale crawls of 3.3 million user profiles, and 58.8 million pins. In particular, we explored various attributes of users, pins, boards, pin sources, and user locations, in detail and performed topical analysis of user generated textual content. The characterization revealed most prominent topics among users and pins, top image sources, and geographical distribution of users on Pinterest. We then tried to predict gender of American users based on a set of profile, network, and content features, and achieved an accuracy of 73.17% with a J48 Decision Tree classifier. We then exploited the users' names by comparing them to a corpus of top male and female names in the U.S.A., and achieved an accuracy of 86.18%. To the best of our knowledge, this is the first attempt to predict gender on Pinterest.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132786691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Emotion Recognition from Audio and Visual Data using F-score based Fusion","authors":"Abhishek Gera, Arnab Bhattacharya","doi":"10.1145/2567688.2567690","DOIUrl":"https://doi.org/10.1145/2567688.2567690","url":null,"abstract":"Emotion recognition has been one of the cornerstones of human-computer interaction. Although decades of work has attacked the problem of automatic emotion recognition from either audio or video signals, the fusion of the two modalities is more recent. In this paper, we aim to tackle the problem when both audio and video data are available in a synchronized manner. We address the six basic human emotions, namely, anger, disgust, fear, happiness, sadness, and surprise. We employ an automatic face tracker to extract the different facial points of interest from a video. We then compute feature vectors for each video frame using distances and angles between the tracked points. For audio data, we use the pitch, energy and MFCC to derive feature vectors for each window as well as the entire audio signal. We use two standard techniques, GMM-based HMM and SVM, as the base classifiers. We then design a novel fusion method using the F-score of the base classifiers. We first demonstrate that our fusion approach can increase the accuracy of the base classifiers by as much as 5%. Finally, we show that our fusion-based bi-modal emotion recognition method achieves an overall accuracy of 54% on a publicly available database, which is an improvement upon the current state-of-the-art by 9%.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"167 2 Suppl 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125983091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. M. Tripathy, S. Sharma, Sachindra Joshi, S. Mehta, A. Bagchi
{"title":"Theme Based Clustering of Tweets","authors":"R. M. Tripathy, S. Sharma, Sachindra Joshi, S. Mehta, A. Bagchi","doi":"10.1145/2567688.2567694","DOIUrl":"https://doi.org/10.1145/2567688.2567694","url":null,"abstract":"In this paper, we present overview of our approach for clustering tweets. Due to short text of tweets, traditional text clustering mechanisms alone may not produce optimal results. We believe that there is an underlying theme/topic present in majority of tweets which is evident in growing usage of hashtag feature in the Twitter network. Clustering tweets based on these themes seems a more natural way for grouping. We propose to use Wikipedia topic taxonomy to discover the themes from the tweets and use the themes along with traditional word based similarity metric for clustering. We show some of our initial results to demonstrate the effectiveness of our approach.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"379 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115972131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tyler Clemons, S. M. Faisal, S. Tatikonda, C. Aggarwal, S. Parthasarathy
{"title":"One, Two, Hash! Counting Hash Tables for Flash Devices","authors":"Tyler Clemons, S. M. Faisal, S. Tatikonda, C. Aggarwal, S. Parthasarathy","doi":"10.1145/2567688.2567693","DOIUrl":"https://doi.org/10.1145/2567688.2567693","url":null,"abstract":"In recent years, advances in hardware technology have led to the increasingly wide spread use of flash storage devices. Such devices have clear benefits over traditional hard drives in terms of latency of access, bandwidth, and random access capabilities particularly when reading data. However, there are some interesting tradeoffs. On a relative scale, writing to such devices can be expensive. This is because typical flash devices (NAND technology) are updated in blocks. A minor update to a given block requires the entire block to be erased, also referred to as cleaned, followed by a re-writing of the block. On the other hand, sequential writes can be two orders of magnitude faster than random writes. In addition, random writes are degrading to the life of the flash drive because each block can support only a limited number of cleaning operations. Hash tables are a particularly challenging case for the flash drive because this data structure is inherently dependent upon the randomness of the hash function, as opposed to the spatial locality of the data. Thus it is difficult to avoid random writes. In this paper, we will study the design landscape for the development of a hash table for flash storage devices. We demonstrate design tradeoffs with the design of a hash table by using two related hash functions, one of which exhibits a data placement property with respect to the other. Specifically, we focus on three designs based on this general philosophy and evaluate the trade-offs among them along the axes of query performance, insert and update times, and I/O time.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122800955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient and Effective Route Planning in Road Networks with Probabilistic Data using Skyline Paths","authors":"Arzoo Katiyar, Arnab Bhattacharya, Shubhadip Mitra","doi":"10.1145/2567688.2567691","DOIUrl":"https://doi.org/10.1145/2567688.2567691","url":null,"abstract":"In this paper, we study the problem of effective route search in road networks. Given a pair of source and destination locations, the aim is to find a path from the source to the destination that visits k different types of sites in a particular order as prescribed by the user. The route planning problem has two objectives to optimize: minimize the total path length and maximize the probability of getting served from the k sites. Since the problem has a multi-objective nature, we utilize the skyline setting and retrieve all skyline paths according to the two aggregated attributes. The naïve way of determining the path lengths can involve a large number of shortest path computations. Although the shortest paths between the sites can be pre-computed, the shortest paths from the source to the first type of site and those from the last type of site to the destination cannot be computed in an offline manner as the source and destination are arbitrary points that are available only at runtime. Similarly, the choice and order of the k different types of sites are also specified at runtime only. Since in a large road network, it is prohibitory to compute many shortest paths, we employ a heuristic to approximately solve the problem. The shortest path computation from the source to a site (and similarly, from a site to the destination) is approximated by introducing reference points. The reference points are chosen by employing a grid-based partitioning method on the space underlying the road network. We show that the above heuristic introduces only an additive error to the distance but not to the probability of service while reducing the running times by up to orders of magnitude.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 1st IKDD Conference on Data Sciences","authors":"","doi":"10.1145/2567688","DOIUrl":"https://doi.org/10.1145/2567688","url":null,"abstract":"","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131221914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}