{"title":"Redistricting Using Heuristic-Based Polygonal Clustering","authors":"Deepti Joshi, Leen-Kiat Soh, A. Samal","doi":"10.1109/ICDM.2009.126","DOIUrl":"https://doi.org/10.1109/ICDM.2009.126","url":null,"abstract":"Redistricting is the process of dividing a geographic area into districts or zones. This process has been considered in the past as a problem that is computationally too complex for an automated system to be developed that can produce unbiased plans. In this paper we present a novel method for redistricting a geographic area using a heuristic-based approach for polygonal spatial clustering. While clustering geospatial polygons several complex issues need to be addressed – such as: removing order dependency, clustering all polygons assuming no outliers, and strategically utilizing domain knowledge to guide the clustering process. In order to address these special needs, we have developed the Constrained Polygonal Spatial Clustering (CPSC) algorithm that holistically integrates do-main knowledge in the form of cluster-level and instance-level constraints and uses heuristic functions to grow clusters. In order to illustrate the usefulness of our algorithm we have applied it to the problem of formation of unbiased congressional districts. Furthermore, we compare and contrast our algorithm with two other approaches proposed in the literature for redistricting, namely – graph partitioning and simulated annealing.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125646646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations","authors":"U. Kang, Charalampos E. Tsourakakis, C. Faloutsos","doi":"10.1109/ICDM.2009.14","DOIUrl":"https://doi.org/10.1109/ICDM.2009.14","url":null,"abstract":"In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with 6,7 billion edges.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114347627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TrBagg: A Simple Transfer Learning Method and its Application to Personalization in Collaborative Tagging","authors":"Toshihiro Kamishima, Masahiro Hamasaki, S. Akaho","doi":"10.1109/ICDM.2009.9","DOIUrl":"https://doi.org/10.1109/ICDM.2009.9","url":null,"abstract":"The aim of transfer learning is to improve prediction accuracy on a target task by exploiting the training examples for tasks that are related to the target one. Transfer learning has received more attention in recent years, because this technique is considered to be helpful in reducing the cost of labeling. In this paper, we propose a very simple approach to transfer learning: TrBagg, which is the extension of bagging. TrBagg is composed of two stages: Many weak classifiers are first generated as in standard bagging, and these classifiers are then filtered based on their usefulness for the target task. This simplicity makes it easy to work reasonably well without severe tuning of learning parameters. Further, our algorithm equips an algorithmic scheme to avoid negative transfer. We applied TrBagg to personalized tag prediction tasks for social bookmarks Our approach has several convenient characteristics for this task such as adaptation to multiple tasks with low computational cost.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114573427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uncoverning Groups via Heterogeneous Interaction Analysis","authors":"Lei Tang, Xufei Wang, Huan Liu","doi":"10.1109/ICDM.2009.20","DOIUrl":"https://doi.org/10.1109/ICDM.2009.20","url":null,"abstract":"With the pervasive availability of Web 2.0 and social networking sites, people can interact with each other easily through various social media. For instance, popular sites like Del.icio.us, Flickr, and YouTube allow users to comment shared content (bookmark, photos, videos), and users can tag their own favorite content. Users can also connect to each other, and subscribe to or become a fan or a follower of others. These diverse individual activities result in a multi-dimensional network among actors, forming cross-dimension group structures with group members sharing certain similarities. It is challenging to effectively integrate the network information of multiple dimensions in order to discover cross-dimension group structures. In this work, we propose a two-phase strategy to identify the hidden structures shared across dimensions in multi-dimensional networks. We extract structural features from each dimension of the network via modularity analysis, and then integrate them all to find out a robust community structure among actors. Experiments on synthetic and real-world data validate the superiority of our strategy, enabling the analysis of collective behavior underneath diverse individual activities in a large scale.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115042838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Renqiang Min, D. A. Stanley, Zineng Yuan, A. Bonner, Zhaolei Zhang
{"title":"A Deep Non-linear Feature Mapping for Large-Margin kNN Classification","authors":"Martin Renqiang Min, D. A. Stanley, Zineng Yuan, A. Bonner, Zhaolei Zhang","doi":"10.1109/ICDM.2009.27","DOIUrl":"https://doi.org/10.1109/ICDM.2009.27","url":null,"abstract":"KNN is one of the most popular data mining methods for classification, but it often fails to work well with inappropriate choice of distance metric or due to the presence of numerous class-irrelevant features. Linear feature transformation methods have been widely applied to extract class-relevant information to improve kNN classification, which is very limited in many applications. Kernels have also been used to learn powerful non-linear feature transformations, but these methods fail to scale to large datasets. In this paper, we present a scalable non-linear feature mapping method based on a deep neural network pretrained with Restricted Boltzmann Machines for improving kNN classification in a large-margin framework, which we call DNet-kNN. DNet-kNN can be used for both classification and for supervised dimensionality reduction. The experimental results on two benchmark handwritten digit datasets and one newsgroup text dataset show that DNet-kNN has much better performance than large-margin kNN using a linear mapping and kNN based on a deep autoencoder pretrained with Restricted Boltzmann Machines.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131075857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Effective Approach to Inverse Frequent Set Mining","authors":"A. Guzzo, D. Saccá, Edoardo Serra","doi":"10.1109/ICDM.2009.123","DOIUrl":"https://doi.org/10.1109/ICDM.2009.123","url":null,"abstract":"The inverse frequent set mining problem is the problem of computing a database on which a given collection of itemsets must emerge to be frequent. Earlier studies focused on investigating computational and approximability properties of this problem. In this paper, we face it under the pragmatic perspective of defining heuristic solution approaches that are effective and scalable in real scenarios. In particular, a general formulation of the problem is considered where minimum and maximum support constraints can be defined on each itemset, and where no bound is given beforehand on the size of the resulting output database. Within this setting, an algorithm is proposed that always satisfies the maximum support constraints, but which treats minimum support constraints as soft ones that are enforced as long as possible. A thorough experimentation evidences that minimum support constraints are hardly violated in practice, and that such negligible degradation in accuracy (which is unavoidable due to the theoretical intractability of the problem) is compensated by very good scaling performances.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130568855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New MCA-Based Divisive Hierarchical Algorithm for Clustering Categorical Data","authors":"Tengke Xiong, Shengrui Wang, A. Mayers, E. Monga","doi":"10.1109/ICDM.2009.118","DOIUrl":"https://doi.org/10.1109/ICDM.2009.118","url":null,"abstract":"Clustering categorical data faces two challenges, one is lacking of inherent similarity measure, and the other is that the clusters are prone to being embedded in different subspace. In this paper, we propose the first divisive hierarchical clustering algorithm for categorical data. The algorithm, which is based on Multiple Correspondence Analysis (MCA), is systematic, efficient and effective. In our algorithm, MCA plays an important role in analyzing the data globally. The proposed algorithm has five merits. First, our algorithm yields a dendrogram representing nested groupings of patterns and similarity levels at different granularities. Second, it is parameter-free, fully automatic and, most importantly, requires no assumption regarding the number of clusters. Third, it is independent of the order in which the data are processed. Forth, it is scalable to large data sets; and finally, using the novel data representation and Chi-square distance measures makes our algorithm capable of seamlessly discovering the clusters embedded in the subspaces. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130779360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Viet-An Nguyen, Ee-Peng Lim, Jing Jiang, Aixin Sun
{"title":"To Trust or Not to Trust? Predicting Online Trusts Using Trust Antecedent Framework","authors":"Viet-An Nguyen, Ee-Peng Lim, Jing Jiang, Aixin Sun","doi":"10.1109/ICDM.2009.115","DOIUrl":"https://doi.org/10.1109/ICDM.2009.115","url":null,"abstract":"This paper analyzes the trustor and trustee factors that lead to inter-personal trust using a well studied Trust Antecedent framework in management science cite{mayer}. To apply these factors to trust ranking problem in online rating systems, we derive features that correspond to each factor and develop different trust ranking models. The advantage of this approach is that features relevant to trust can be systematically derived so as to achieve good prediction accuracy. Through a series of experiments on real data from Epinions, we show that even a simple model using the derived features yields good accuracy and outperforms MoleTrust, a trust propagation based model. SVM classifiers using these features also show improvements.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132445115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Boyen, F. Neven, D. V. Dyck, A. V. Dijk, R. V. Ham
{"title":"SLIDER: Mining Correlated Motifs in Protein-Protein Interaction Networks","authors":"Peter Boyen, F. Neven, D. V. Dyck, A. V. Dijk, R. V. Ham","doi":"10.1109/ICDM.2009.92","DOIUrl":"https://doi.org/10.1109/ICDM.2009.92","url":null,"abstract":"Correlated motif mining (CMM) is the problem to find overrepresented pairs of patterns, called motif pairs, in interacting protein sequences. Algorithmic solutions for CMM thereby provide a computational method for predicting binding sites for protein interaction. In this paper, we adopt a motif-driven approach where the support of candidate motif pairs is evaluated in the network. We experimentally establish the superiority of the Chi-square-based support measure over other support measures. Furthermore, we obtain that CMM is an NP-hard problem for a large class of support measures (including Chi-square) and reformulate the search for correlated motifs as a combinatorial optimization problem. We then present the method SLIDER which uses local search with a neighborhood function based on sliding motifs and employs the Chi-square-based support measure. We show that SLIDER outperforms existing motif-driven CMM methods and scales to large protein-protein interaction networks.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125523949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Z. Zhu, Weizhu Chen, Chenguang Zhu, Gang Wang, Haixun Wang, Zheng Chen
{"title":"Inverse Time Dependency in Convex Regularized Learning","authors":"Z. Zhu, Weizhu Chen, Chenguang Zhu, Gang Wang, Haixun Wang, Zheng Chen","doi":"10.1109/ICDM.2009.28","DOIUrl":"https://doi.org/10.1109/ICDM.2009.28","url":null,"abstract":"In the conventional regularized learning, training time increases as the training set expands. Recent work on L2 linear SVM challenges this common sense by proposing the inverse time dependency on the training set size. In this paper, we first put forward a Primal Gradient Solver (PGS) to effectively solve the convex regularized learning problem. This solver is based on the stochastic gradient descent method and the Fenchel conjugate adjustment, employing the well-known online strongly convex optimization algorithm with logarithmic regret. We then theoretically prove the inverse dependency property of our PGS, embracing the previous work of the L2 linear SVM as a special case and enable the l_p-norm optimization to run within a bounded sphere, which qualifies more convex loss functions in PGS. We further illustrate this solver in three examples: SVM, logistic regression and regularized least square. Experimental results substantiate the property of the inverse dependency on training data size.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126445681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}