Proceedings of the 1st IKDD Conference on Data Sciences最新文献

Scalability of Correlation Clustering Through Constraint Reduction 基于约束约简的关联聚类可扩展性

Proceedings of the 1st IKDD Conference on Data Sciences Pub Date : 2014-03-21 DOI: 10.1145/2567688.2567695

Mamata Samal, V. Saradhi, Sukumar Nandi

{"title":"Scalability of Correlation Clustering Through Constraint Reduction","authors":"Mamata Samal, V. Saradhi, Sukumar Nandi","doi":"10.1145/2567688.2567695","DOIUrl":"https://doi.org/10.1145/2567688.2567695","url":null,"abstract":"Correlation clustering (CC) is a graph based clustering method. Edges of the graph are labeled either positive or negative depending on the similarity/dissimilarity between the pair of vertices. The objective of CC is to group vertices of the induced complete graph so as to maximize the positively labeled edges that lie within a group and to maximize negatively labeled edges that lie across groups. This objective function is formulated as a semidefinite programming (SDP) problem which is well studied theoretically producing encouraging approximation values. In this work we propose a scalable solution for the SDP formulation of correlation clustering (SDP-CC) by reducing the number of constraints. The proposed formulation is solved efficiently using SDP-NAL tool. The proposed scalable formulation is compared with other scalable variants namely variable reduction based CC. Experimental results on synthetic, real world data sets whose graph sizes range from 100 vertices to 13000 vertices are tested with both the scalable formulations. Large scale bench mark graph data sets are also tested whose sizes range from 2395 vertices to 13992 vertices. The proposed formulation is shown to have an edge over the original SDP-CC formulation, variable reduction variant of SDP-CC and a constraint clustering method, namely constrained spectral clustering.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129921755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing Utilitarian Aggregation of Open Knowledge 开放知识的功利性聚集特征

Proceedings of the 1st IKDD Conference on Data Sciences Pub Date : 2014-03-21 DOI: 10.1145/2567688.2567689

S. Srinivasa, Sweety Agrawal, Chinmay Jog, Jayati Deshmukh

{"title":"Characterizing Utilitarian Aggregation of Open Knowledge","authors":"S. Srinivasa, Sweety Agrawal, Chinmay Jog, Jayati Deshmukh","doi":"10.1145/2567688.2567689","DOIUrl":"https://doi.org/10.1145/2567688.2567689","url":null,"abstract":"Recent initiatives in \"open data\" have resulted in good quality, freely available, tabular datasets on the web. However, such datasets are fragmented and arbitrarily structured and are not of much utility in isolation. To address this, there are several \"open knowledge\" initiatives that aim to stitch together open data elements into semantically meaningful structures. But such efforts are met with unique challenges. We argue in this paper that knowledge aggregation can be of two kinds -- encyclopedic aggregation, which aims to elucidate, and utilitarian aggregation, which aims to create actionable knowledge elements. We also argue that utilitarian aggregation is a characteristically different problem from that of conventional efforts like Freebase or DBpedia that address encyclopedic aggregation. In addition, when it comes to utilitarian knowledge, we observe that openness is not a binary condition and instead there is a need to distinguish between knowledge that is \"open-ended\" and knowledge that is \"open.\" We formalize the notion of open knowledge based on how much knowledge or control does the creator of the knowledge element have about its consumers. Based on these arguments, we propose an underlying knowledge representation framework for encoding open utilitarian knowledge.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121828339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Pinned it! A Large Scale Study of the Pinterest Network 把它!对Pinterest网络的大规模研究

Proceedings of the 1st IKDD Conference on Data Sciences Pub Date : 2014-03-21 DOI: 10.1145/2567688.2567692

Sudip Mittal, Neha Gupta, Prateek Dewan, P. Kumaraguru

{"title":"Pinned it! A Large Scale Study of the Pinterest Network","authors":"Sudip Mittal, Neha Gupta, Prateek Dewan, P. Kumaraguru","doi":"10.1145/2567688.2567692","DOIUrl":"https://doi.org/10.1145/2567688.2567692","url":null,"abstract":"Pinterest is an image-based online social network, which was launched in the year 2010 and has gained a lot of traction, ever since. Within 3 years, Pinterest has attained 48.7 million unique users. This stupendous growth makes it interesting to study Pinterest, and gives rise to multiple questions about it's users, and content. We characterized Pinterest on the basis of large scale crawls of 3.3 million user profiles, and 58.8 million pins. In particular, we explored various attributes of users, pins, boards, pin sources, and user locations, in detail and performed topical analysis of user generated textual content. The characterization revealed most prominent topics among users and pins, top image sources, and geographical distribution of users on Pinterest. We then tried to predict gender of American users based on a set of profile, network, and content features, and achieved an accuracy of 73.17% with a J48 Decision Tree classifier. We then exploited the users' names by comparing them to a corpus of top male and female names in the U.S.A., and achieved an accuracy of 86.18%. To the best of our knowledge, this is the first attempt to predict gender on Pinterest.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132786691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Emotion Recognition from Audio and Visual Data using F-score based Fusion 使用基于f分数的融合从视听数据中识别情感

Proceedings of the 1st IKDD Conference on Data Sciences Pub Date : 2014-03-21 DOI: 10.1145/2567688.2567690

Abhishek Gera, Arnab Bhattacharya

{"title":"Emotion Recognition from Audio and Visual Data using F-score based Fusion","authors":"Abhishek Gera, Arnab Bhattacharya","doi":"10.1145/2567688.2567690","DOIUrl":"https://doi.org/10.1145/2567688.2567690","url":null,"abstract":"Emotion recognition has been one of the cornerstones of human-computer interaction. Although decades of work has attacked the problem of automatic emotion recognition from either audio or video signals, the fusion of the two modalities is more recent. In this paper, we aim to tackle the problem when both audio and video data are available in a synchronized manner. We address the six basic human emotions, namely, anger, disgust, fear, happiness, sadness, and surprise. We employ an automatic face tracker to extract the different facial points of interest from a video. We then compute feature vectors for each video frame using distances and angles between the tracked points. For audio data, we use the pitch, energy and MFCC to derive feature vectors for each window as well as the entire audio signal. We use two standard techniques, GMM-based HMM and SVM, as the base classifiers. We then design a novel fusion method using the F-score of the base classifiers. We first demonstrate that our fusion approach can increase the accuracy of the base classifiers by as much as 5%. Finally, we show that our fusion-based bi-modal emotion recognition method achieves an overall accuracy of 54% on a publicly available database, which is an improvement upon the current state-of-the-art by 9%.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"167 2 Suppl 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125983091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Theme Based Clustering of Tweets 基于主题的Tweets聚类

Proceedings of the 1st IKDD Conference on Data Sciences Pub Date : 2014-03-21 DOI: 10.1145/2567688.2567694

R. M. Tripathy, S. Sharma, Sachindra Joshi, S. Mehta, A. Bagchi

引用次数: 8

One, Two, Hash! Counting Hash Tables for Flash Devices 一，二，哈希!计算Flash设备的哈希表

Proceedings of the 1st IKDD Conference on Data Sciences Pub Date : 2014-03-21 DOI: 10.1145/2567688.2567693

Tyler Clemons, S. M. Faisal, S. Tatikonda, C. Aggarwal, S. Parthasarathy

{"title":"One, Two, Hash! Counting Hash Tables for Flash Devices","authors":"Tyler Clemons, S. M. Faisal, S. Tatikonda, C. Aggarwal, S. Parthasarathy","doi":"10.1145/2567688.2567693","DOIUrl":"https://doi.org/10.1145/2567688.2567693","url":null,"abstract":"In recent years, advances in hardware technology have led to the increasingly wide spread use of flash storage devices. Such devices have clear benefits over traditional hard drives in terms of latency of access, bandwidth, and random access capabilities particularly when reading data. However, there are some interesting tradeoffs. On a relative scale, writing to such devices can be expensive. This is because typical flash devices (NAND technology) are updated in blocks. A minor update to a given block requires the entire block to be erased, also referred to as cleaned, followed by a re-writing of the block. On the other hand, sequential writes can be two orders of magnitude faster than random writes. In addition, random writes are degrading to the life of the flash drive because each block can support only a limited number of cleaning operations. Hash tables are a particularly challenging case for the flash drive because this data structure is inherently dependent upon the randomness of the hash function, as opposed to the spatial locality of the data. Thus it is difficult to avoid random writes. In this paper, we will study the design landscape for the development of a hash table for flash storage devices. We demonstrate design tradeoffs with the design of a hash table by using two related hash functions, one of which exhibits a data placement property with respect to the other. Specifically, we focus on three designs based on this general philosophy and evaluate the trade-offs among them along the axes of query performance, insert and update times, and I/O time.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122800955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient and Effective Route Planning in Road Networks with Probabilistic Data using Skyline Paths 基于Skyline路径的概率数据路网的高效路线规划

Proceedings of the 1st IKDD Conference on Data Sciences Pub Date : 2014-03-21 DOI: 10.1145/2567688.2567691

Arzoo Katiyar, Arnab Bhattacharya, Shubhadip Mitra

{"title":"Efficient and Effective Route Planning in Road Networks with Probabilistic Data using Skyline Paths","authors":"Arzoo Katiyar, Arnab Bhattacharya, Shubhadip Mitra","doi":"10.1145/2567688.2567691","DOIUrl":"https://doi.org/10.1145/2567688.2567691","url":null,"abstract":"In this paper, we study the problem of effective route search in road networks. Given a pair of source and destination locations, the aim is to find a path from the source to the destination that visits k different types of sites in a particular order as prescribed by the user. The route planning problem has two objectives to optimize: minimize the total path length and maximize the probability of getting served from the k sites. Since the problem has a multi-objective nature, we utilize the skyline setting and retrieve all skyline paths according to the two aggregated attributes. The naïve way of determining the path lengths can involve a large number of shortest path computations. Although the shortest paths between the sites can be pre-computed, the shortest paths from the source to the first type of site and those from the last type of site to the destination cannot be computed in an offline manner as the source and destination are arbitrary points that are available only at runtime. Similarly, the choice and order of the k different types of sites are also specified at runtime only. Since in a large road network, it is prohibitory to compute many shortest paths, we employ a heuristic to approximately solve the problem. The shortest path computation from the source to a site (and similarly, from a site to the destination) is approximated by introducing reference points. The reference points are chosen by employing a grid-based partitioning method on the space underlying the road network. We show that the above heuristic introduces only an additive error to the distance but not to the probability of service while reducing the running times by up to orders of magnitude.","PeriodicalId":253386,"journal":{"name":"Proceedings of the 1st IKDD Conference on Data Sciences","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Proceedings of the 1st IKDD Conference on Data Sciences 第一届IKDD数据科学会议论文集

Proceedings of the 1st IKDD Conference on Data Sciences Pub Date : 1900-01-01 DOI: 10.1145/2567688

引用次数: 0