{"title":"Compressed Nonnegative Sparse Coding","authors":"Fei Wang, Ping Li","doi":"10.1109/ICDM.2010.162","DOIUrl":"https://doi.org/10.1109/ICDM.2010.162","url":null,"abstract":"Sparse Coding (SC), which models the data vectors as sparse linear combinations over basis vectors, has been widely applied in machine learning, signal processing and neuroscience. In this paper, we propose a dual random projection method to provide an efficient solution to Nonnegative Sparse Coding (NSC) using small memory. Experiments on real world data demonstrate the effectiveness of the proposed method.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130137579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeyu Zheng, Jun Yan, Shuicheng Yan, Ning Liu, Zheng Chen, Ming Zhang
{"title":"A Novel Contrast Co-learning Framework for Generating High Quality Training Data","authors":"Zeyu Zheng, Jun Yan, Shuicheng Yan, Ning Liu, Zheng Chen, Ming Zhang","doi":"10.1109/ICDM.2010.23","DOIUrl":"https://doi.org/10.1109/ICDM.2010.23","url":null,"abstract":"The good performances of most classical learning algorithms are generally founded on high quality training data, which are clean and unbiased. The availability of such data is however becoming much harder than ever in many real world problems due to the difficulties in collecting large scale unbiased data and precisely labeling them for training. In this paper, we propose a general Contrast Co-learning (CCL) framework to refine the biased and noisy training data when an unbiased yet unlabeled data pool is available. CCL starts with multiple sets of probably biased and noisy training data and trains a set of classifiers individually. Then under the assumption that the confidently classified data samples may have higher probabilities to be correctly classified, CCL iteratively and automatically filtering out possible data noises as well as adding those confidently classified samples from the unlabeled data pool to correct the bias. Through this process, we can generate a cleaner and unbiased training dataset with theoretical guarantees. Extensive experiments on two public text datasets clearly show that CCL consistently improves the algorithmic classification performance on biased and noisy training data compared with several state-of-the-art classical algorithms.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121320572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vikas Sindhwani, S. Bucak, Jianying Hu, A. Mojsilovic
{"title":"One-Class Matrix Completion with Low-Density Factorizations","authors":"Vikas Sindhwani, S. Bucak, Jianying Hu, A. Mojsilovic","doi":"10.1109/ICDM.2010.164","DOIUrl":"https://doi.org/10.1109/ICDM.2010.164","url":null,"abstract":"Consider a typical recommendation problem. A company has historical records of products sold to a large customer base. These records may be compactly represented as a sparse customer-times-product ``who-bought-what\" binary matrix. Given this matrix, the goal is to build a model that provides recommendations for which products should be sold next to the existing customer base. Such problems may naturally be formulated as collaborative filtering tasks. However, this is a {it one-class} setting, that is, the only known entries in the matrix are one-valued. If a customer has not bought a product yet, it does not imply that the customer has a low propensity to {it potentially} be interested in that product. In the absence of entries explicitly labeled as negative examples, one may resort to considering unobserved customer-product pairs as either missing data or as surrogate negative instances. In this paper, we propose an approach to explicitly deal with this kind of ambiguity by instead treating the unobserved entries as optimization variables. These variables are optimized in conjunction with learning a weighted, low-rank non-negative matrix factorization (NMF) of the customer-product matrix, similar to how Transductive SVMs implement the low-density separation principle for semi-supervised learning. Experimental results show that our approach gives significantly better recommendations in comparison to various competing alternatives on one-class collaborative filtering tasks.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122488644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interval-valued Matrix Factorization with Applications","authors":"Zhiyong Shen, Liang Du, Xukun Shen, Yi-Dong Shen","doi":"10.1109/ICDM.2010.115","DOIUrl":"https://doi.org/10.1109/ICDM.2010.115","url":null,"abstract":"In this paper, we propose the Interval-valued Matrix Factorization (IMF) framework. Matrix Factorization (MF) is a fundamental building block of data mining. MF techniques, such as Nonnegative Matrix Factorization (NMF) and Probabilistic Matrix Factorization (PMF), are widely used in applications of data mining. For example, NMF has shown its advantage in Face Analysis (FA) while PMF has been successfully applied to Collaborative Filtering (CF). In this paper, we analyze the data approximation in FA as well as CF applications and construct interval-valued matrices to capture these approximation phenomenons. We adapt basic NMF and PMF models to the interval-valued matrices and propose Interval-valued NMF (I-NMF) as well as Interval-valued PMF (I-PMF). We conduct extensive experiments to show that proposed I-NMF and I-PMF significantly outperform their single-valued counterparts in FA and CF applications.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126441380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering Correlated Subspace Clusters in 3D Continuous-Valued Data","authors":"Kelvin Sim, Z. Aung, V. Gopalkrishnan","doi":"10.1109/ICDM.2010.19","DOIUrl":"https://doi.org/10.1109/ICDM.2010.19","url":null,"abstract":"Subspace clusters represent useful information in high-dimensional data. However, mining significant subspace clusters in continuous-valued 3D data such as stock-financial ratio-year data, or gene-sample-time data, is difficult. Firstly, typical metrics either find subspaces with very few objects, or they find too many insignificant subspaces – those which exist by chance. Besides, typical 3D subspace clustering approaches abound with parameters, which are usually set under biased assumptions, making the mining process a ‘guessing game’. We address these concerns by proposing an information theoretic measure, which allows us to identify 3D subspace clusters that stand out from the data. We also develop a highly effective, efficient and parameter-robust algorithm, which is a hybrid of information theoretical and statistical techniques, to mine these clusters. From extensive experimentations, we show that our approach can discover significant 3D subspace clusters embedded in 110 synthetic datasets of varying conditions. We also perform a case study on real-world stock datasets, which shows that our clusters can generate higher profits compared to those mined by other approaches.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125790937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Experts and Novices in Citizen Science Data for Species Distribution Modeling","authors":"Jun Yu, Weng-Keen Wong, R. Hutchinson","doi":"10.1109/ICDM.2010.103","DOIUrl":"https://doi.org/10.1109/ICDM.2010.103","url":null,"abstract":"Citizen scientists, who are volunteers from the community that participate as field assistants in scientific studies [3], enable research to be performed at much larger spatial and temporal scales than trained scientists can cover. Species distribution modeling [6], which involves understanding species-habitat relationships, is a research area that can benefit greatly from citizen science. The eBird project [16] is one of the largest citizen science programs in existence. By allowing birders to upload observations of bird species to an online database, eBird can provide useful data for species distribution modeling. However, since birders vary in their levels of expertise, the quality of data submitted to eBird is often questioned. In this paper, we develop a probabilistic model called the Occupancy-Detection-Expertise (ODE) model that incorporates the expertise of birders submitting data to eBird. We show that modeling the expertise of birders can improve the accuracy of predicting observations of a bird species at a site. In addition, we can use the ODE model for two other tasks: predicting birder expertise given their history of eBird checklists and identifying bird species that are difficult for novices to detect.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134027538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Testing the Significance of Patterns in Data with Cluster Structure","authors":"Niko Vuokko, P. Kaski","doi":"10.1109/ICDM.2010.61","DOIUrl":"https://doi.org/10.1109/ICDM.2010.61","url":null,"abstract":"Clustering is one of the basic operations in data analysis, and the cluster structure of a dataset often has a marked effect on observed patterns in data. Testing whether a data mining result is implied by the cluster structure can give substantial information on the formation of the dataset. We propose a new method for empirically testing the statistical significance of patterns in real-valued data in relation to the cluster structure. The method relies on principal component analysis and is based on the general idea of decomposing the data for the purpose of isolating the null model. We evaluate the performance of the method and the information it provides on various real datasets. Our results show that the proposed method is robust and provides nontrivial information about the origin of patterns in data, such as the source of classification accuracy and the observed correlations between attributes.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132389932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Generalized Linear Threshold Model for Multiple Cascades","authors":"Nishith Pathak, A. Banerjee, J. Srivastava","doi":"10.1109/ICDM.2010.153","DOIUrl":"https://doi.org/10.1109/ICDM.2010.153","url":null,"abstract":"This paper presents a generalized version of the linear threshold model for simulating multiple cascades on a network while allowing nodes to switch between them. The proposed model is shown to be a rapidly mixing Markov chain and the corresponding steady state distribution is used to estimate highly likely states of the cascades' spread in the network. Results on a variety of real world networks demonstrate the high quality of the estimated solution.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131062020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Binary Decision Diagram-Based One-Class Classifier","authors":"Takuro Kutsuna","doi":"10.1109/ICDM.2010.84","DOIUrl":"https://doi.org/10.1109/ICDM.2010.84","url":null,"abstract":"We propose a novel approach for one-class classification problems where a logical formula is used to estimate the region that covers all examples. A formula is viewed as a model that represents a region and is approximated with respect to its hierarchical local densities. The approximation is done quite efficiently via direct manipulations of a binary decision diagram that is a compressed representation of a Boolean formula. The proposed method has only one parameter to be tuned, and the parameter can be selected properly with the help of the minimum description length principle, which requires no labeled training data. In other words, a one-class classifier is generated from an unlabeled training data thoroughly and automatically. Experimental results show that the proposed method works quite well with synthetic data and some realistic data.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116013428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding of Internal Clustering Validation Measures","authors":"Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, Junjie Wu","doi":"10.1109/ICDM.2010.35","DOIUrl":"https://doi.org/10.1109/ICDM.2010.35","url":null,"abstract":"Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications. In general, clustering validation can be categorized into two classes, external clustering validation and internal clustering validation. In this paper, we focus on internal clustering validation and present a detailed study of 11 widely used internal clustering validation measures for crisp clustering. From five conventional aspects of clustering, we investigate their validation properties. Experiment results show that S_Dbw is the only internal validation measure which performs well in all five aspects, while other measures have certain limitations in different application scenarios.","PeriodicalId":294061,"journal":{"name":"2010 IEEE International Conference on Data Mining","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116030481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}