Yanshan Xiao, Bo Liu, Longbing Cao, Xindong Wu, Chengqi Zhang, Z. Hao, Fengzhao Yang, Jie Cao
{"title":"Multi-sphere Support Vector Data Description for Outliers Detection on Multi-distribution Data","authors":"Yanshan Xiao, Bo Liu, Longbing Cao, Xindong Wu, Chengqi Zhang, Z. Hao, Fengzhao Yang, Jie Cao","doi":"10.1109/ICDMW.2009.87","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.87","url":null,"abstract":"SVDD has been proved a powerful tool for outlier detection. However, in detecting outliers on multi-distribution data, namely there are distinctive distributions in the data, it is very challenging for SVDD to generate a hyper-sphere for distinguishing outliers from normal data. Even if such a hyper-sphere can be identified, its performance is usually not good enough. This paper proposes an multi-sphere SVDD approach, named MS-SVDD, for outlier detection on multi-distribution data. First, an adaptive sphere detection method is proposed to detect data distributions in the dataset. The data is partitioned in terms of the identified data distributions, and the corresponding SVDD classifiers are constructed separately. Substantial experiments on both artificial and real-world datasets have demonstrated that the proposed approach outperforms original SVDD.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121834786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GLSVM: Integrating Structured Feature Selection and Large Margin Classification","authors":"Hongliang Fei, Brian Quanz, Jun Huan","doi":"10.1109/ICDMW.2009.39","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.39","url":null,"abstract":"High dimensional data challenges current feature selection methods. For many real world problems we often have prior knowledge about the relationship of features. For example in microarray data analysis, genes from the same biological pathways are expected to have similar relationship to the outcome that we target to predict. Recent regularization methods on Support Vector Machine (SVM) have achieved great success to perform feature selection and model selection simultaneously for high dimensional data, but neglect such relationship among features. To build interpretable SVM models, the structure information of features should be incorporated. In this paper, we propose an algorithm GLSVM that automatically perform model selection and feature selection in SVMs. To incorporate the prior knowledge of feature relationship, we extend standard 2 norm SVM and use a penalty function that employs a L2 norm regularization term including the normalized Laplacian of the graph and L1 penalty. We have demonstrated the effectiveness of our methods and compare them to the state-of-the-art using two real-world benchmarks.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116901376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Jagannathan, Krishnan Pillaipakkamnatt, R. Wright
{"title":"A Practical Differentially Private Random Decision Tree Classifier","authors":"G. Jagannathan, Krishnan Pillaipakkamnatt, R. Wright","doi":"10.1109/ICDMW.2009.93","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.93","url":null,"abstract":"In this paper, we study the problem of constructing private classifiers using decision trees, within the framework of differential privacy. We first construct privacy-preserving ID3 decision trees using differentially private sum queries. Our experiments show that for many data sets a reasonable privacy guarantee can only be obtained via this method at a steep cost of accuracy in predictions. We then present a differentially private decision tree ensemble algorithm using the random decision tree approach. We demonstrate experimentally that our approach yields good prediction accuracy even when the size of the datasets is small. We also present a differentially private algorithm for the situation in which new data is periodically appended to an existing database. Our experiments show that our differentially private random decision tree classifier handles data updates in a way that maintains the same level of privacy guarantee.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128400663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uncertainty Quantification in the Presence of Limited Climate Model Data with Discontinuities","authors":"K. Sargsyan, C. Safta, B. Debusschere, H. Najm","doi":"10.1109/ICDMW.2009.111","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.111","url":null,"abstract":"Uncertainty quantification in climate models is challenged by the sparsity of the available climate data due to the high computational cost of the model runs. Another feature that prevents classical uncertainty analyses from being easily applicable is the bifurcative behavior in the climate data with respect to certain parameters. A typical example is the Meridional Overturning Circulation in the Atlantic Ocean. The maximum overturning stream function exhibits discontinuity across a curve in the space of two uncertain parameters, namely climate sensitivity and CO2 forcing. We develop a methodology that performs uncertainty quantification in this context in the presence of limited data.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121949707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Personal Image Collection for Social Group Suggestion","authors":"Jie Yu, Xin Jin, Jiawei Han, Jiebo Luo","doi":"10.1109/ICDMW.2009.77","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.77","url":null,"abstract":"Popular photo-sharing sites have attracted millions of people and helped construct massive social networks in cyberspace. Different from traditional social relationship, users actively interact within groups where common interests are shared on certain types of events or topics captured by photos and videos. Contributing images to a group would greatly promote the interactions between users and expand their social networks. In this work, we intend to produce accurate predictions of suitable photo-sharing groups from a user's images by mining images both on the Web and in the user’s personal collection. To this end, we designed a new approach to cluster popular groups into categories by analyzing the similarity of groups via SimRank. Both visual content and its annotations are integrated to understand the events or topics depicted in the images. Experiments on real user images demonstrate the feasibility of the proposed approach.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"357 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115939530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pattern Mining over Star Schemas in the Onto4AR Framework","authors":"C. Antunes","doi":"10.1109/ICDMW.2009.68","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.68","url":null,"abstract":"Storing data according to the multidimensional model, in particular following star schemas, has demonstrated to be one of the most adequate forms to ease the exploration of data. However, this exploration has been limited to be query-based, leaving the discovery of hidden information to a second plan. The main reason for this, relates to the inability of traditional mining techniques to deal with several data tables at the same time. In this paper, we propose a new approach to mine patterns among data stored as a star schema, based in a domain driven framework, where available knowledge is represented in a domain ontology. Pattern mining is performed by an apriori-based algorithm - the D2Apriori, but more efficient algorithms are being implemented and tested, in order to solve performance issues related with the large amount of data stored in data warehouses.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124340303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparse Least-Squares Methods in the Parallel Machine Learning (PML) Framework","authors":"R. Natarajan, Vikas Sindhwani, S. Tatikonda","doi":"10.1109/ICDMW.2009.106","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.106","url":null,"abstract":"We describe parallel methods for solving large-scale, high-dimensional, sparse least-squares problems that arise in machine learning applications such as document classification. The basic idea is to solve a two-class response problem using a fast regression technique based on minimizing a loss function, which consists of an empirical squared-error term, and one or more regularization terms. We consider the use of Lenclos-based methods for solving these regularized least-squares problems, with the parallel implementation in the Parallel MachineLearning (PML) framework, and performance results on the IBM Blue Gene/P parallel computer.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116163163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Greedy Optimization for Contiguity-Constrained Hierarchical Clustering","authors":"Diansheng Guo","doi":"10.1109/ICDMW.2009.75","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.75","url":null,"abstract":"The discovery and construction of inherent regions in large spatial datasets is an important task for many research domains such as climate zoning, eco-region analysis, public health mapping, and political redistricting. From the perspective of cluster analysis, it requires that each cluster is geographically contiguous. This paper presents a contiguity constrained hierarchical clustering and optimization method that can partition a set of spatial objects into a hierarchy of contiguous regions while optimizing an objective function. The method consists of two steps: contiguity constrained hierarchical clustering and two-way fine-tuning. The above two steps are repeated to create a hierarchy of regions. Evaluations and comparison show that the proposed method consistently and significantly outperforms existing methods by a large margin in terms of optimizing the objective function. Moreover, the method is flexible to accommodate different objective functions and additional constraints (such as the minimum size of each region), which are useful to for various application domains.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125750911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge Transfer among Heterogeneous Information Networks","authors":"E. Xiang, N. Liu, Sinno Jialin Pan, Qiang Yang","doi":"10.1109/ICDMW.2009.100","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.100","url":null,"abstract":"Online recommendation systems are becoming more and more popular with the development of web. However, a critical problem of such system is that new users and items are always added to the system with time. How to overcome the data sparseness for such new incoming entities become an important issue. In this paper, we try to reduce the data sparseness in the link prediction problem via involving heterogeneous information network as auxiliary information sources. We developed two models based on the Collective Matrix Factorization (CMF) framework. We also provided a detailed empirical study on how effectively different information networks could help with two real world link prediction tasks. We will report some preliminary results of our current work and also point our several potential research issues.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129503059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding Climate Change Patterns with Multivariate Geovisualization","authors":"Hai Jin, Diansheng Guo","doi":"10.1109/ICDMW.2009.91","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.91","url":null,"abstract":"Climate change has been a challenging and urgent research problem for many related research fields. Climate change trends and patterns are complex, which may involve many factors and vary across space and time. However, most existing visualization and mapping approaches for climate data analysis are limited to one variable or one perspective at a time. For example, it is common to map the surface temperature anomaly at different locations or plot trends of time series. Although such approaches are useful in presenting information and knowledge, they have limited capability to support discovery and understanding of unknown complex patterns from data that span across multiple dimensions. This paper introduces the application of a multivariate geovisualization approach to explore and understand complex climate change patterns across multiple perspectives, including the geographic space, time, and multiple variables.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129136412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}