{"title":"Non-negative Matrix Factorization on Manifold","authors":"Deng Cai, Xiaofei He, Xiaoyun Wu, Jiawei Han","doi":"10.1109/ICDM.2008.57","DOIUrl":"https://doi.org/10.1109/ICDM.2008.57","url":null,"abstract":"Recently non-negative matrix factorization (NMF) has received a lot of attentions in information retrieval, computer vision and pattern recognition. NMF aims to find two non-negative matrices whose product can well approximate the original matrix. The sizes of these two matrices are usually smaller than the original matrix. This results in a compressed version of the original data matrix. The solution of NMF yields a natural parts-based representation for the data. When NMF is applied for data representation, a major disadvantage is that it fails to consider the geometric structure in the data. In this paper, we develop a graph based approach for parts-based data representation in order to overcome this limitation. We construct an affinity graph to encode the geometrical information and seek a matrix factorization which respects the graph structure. We demonstrate the success of this novel algorithm by applying it on real world problems.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124931016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balancing Spectral Clustering for Segmenting Spatio-temporal Observations of Multi-agent Systems","authors":"B. Takács, Y. Demiris","doi":"10.1109/ICDM.2008.88","DOIUrl":"https://doi.org/10.1109/ICDM.2008.88","url":null,"abstract":"We examine the application of spectral clustering for breaking up the behavior of a multi-agent system in space and time into smaller, independent elements. We cluster observations of individual entities in order to identify significant changes in the parameter space (like spatial position)and detect temporal alterations of behavior within the same framework. Data is also influenced by knowledge about important events. Clusters are pre-processed at each step of the iterative subdivision to make the algorithm invariant against spatial scaling, rotation, replay speed and varying sampling frequency. A method is presented to balance spatial and temporal segmentation based on the expected group size. We demonstrate our results by analyzing the outcomes of a computer game.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124373888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ben Kao, Sau-dan. Lee, David Wai-Lok Cheung, Wai-Shing Ho, K. F. Chan
{"title":"Clustering Uncertain Data Using Voronoi Diagrams","authors":"Ben Kao, Sau-dan. Lee, David Wai-Lok Cheung, Wai-Shing Ho, K. F. Chan","doi":"10.1109/ICDM.2008.31","DOIUrl":"https://doi.org/10.1109/ICDM.2008.31","url":null,"abstract":"We study the problem of clustering uncertain objects whose locations are described by probability density functions (pdf). We show that the UK-means algorithm, which generalises the k-means algorithm to handle uncertain objects, is very inefficient. The inefficiency comes from the fact that UK-means computes expected distances (ED) between objects and cluster representatives. For arbitrary pdf's, expected distances are computed by numerical integrations, which are costly operations. We propose pruning techniques that are based on Voronoi diagrams to reduce the number of expected distance calculation. These techniques are analytically proven to be more effective than the basic bounding-box-based technique previous known in the literature. We conduct experiments to evaluate the effectiveness of our pruning techniques and to show that our techniques significantly outperform previous methods.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123541758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hua-Fu Li, Hsin-Yun Huang, Yi-Cheng Chen, Yu-Jiun Liu, Suh-Yin Lee
{"title":"Fast and Memory Efficient Mining of High Utility Itemsets in Data Streams","authors":"Hua-Fu Li, Hsin-Yun Huang, Yi-Cheng Chen, Yu-Jiun Liu, Suh-Yin Lee","doi":"10.1109/ICDM.2008.107","DOIUrl":"https://doi.org/10.1109/ICDM.2008.107","url":null,"abstract":"Efficient mining of high utility itemsets has become one of the most interesting data mining tasks with broad applications. In this paper, we proposed two efficient one-pass algorithms, MHUI-BIT and MHUI-TID, for mining high utility itemsets from data streams within a transaction-sensitive sliding window. Two effective representations of item information and an extended lexicographical tree-based summary data structure are developed to improve the efficiency of mining high utility itemsets. Experimental results show that the proposed algorithms outperform than the existing algorithms for mining high utility itemsets from data streams.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130166285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Support Vector Regression for Censored Data (SVRc): A Novel Tool for Survival Analysis","authors":"F. Khan, V. Zubek","doi":"10.1109/ICDM.2008.50","DOIUrl":"https://doi.org/10.1109/ICDM.2008.50","url":null,"abstract":"A crucial challenge in predictive modeling for survival analysis is managing censored observations in the data. The Cox proportional hazards model is the standard tool for the analysis of continuous censored survival data. We propose a novel machine learning algorithm, support vector regression for censored data (SVRc) for improved analysis of medical survival data. SVRc leverages the high-dimensional capabilities of traditional SVR while adapting it for use with censored data through a modified asymmetric loss/penalty function which allows censored (left and right censored) data to be processed. We applied the new algorithm to predict the recurrence and disease progression of prostate cancer, breast cancer and lung cancer. Compared with the traditional Cox model, SVRc achieves significant improvement in overall accuracy as well as in the ability to identify high-risk and low-risk patient populations.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"25 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114102364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Gullo, Giovanni Ponti, Andrea Tagarelli, S. Greco
{"title":"A Hierarchical Algorithm for Clustering Uncertain Data via an Information-Theoretic Approach","authors":"Francesco Gullo, Giovanni Ponti, Andrea Tagarelli, S. Greco","doi":"10.1109/ICDM.2008.115","DOIUrl":"https://doi.org/10.1109/ICDM.2008.115","url":null,"abstract":"In recent years there has been a growing interest in clustering uncertain data. In contrast to traditional, \"sharp\" data representation models, uncertain data objects can be represented in terms of an uncertainty region over which a probability density function (pdf) is defined. In this context, the focus has been mainly on partitional and density-based approaches, whereas hierarchical clustering schemes have drawn less attention. We propose a centroid-linkage-based agglomerative hierarchical algorithm for clustering uncertain objects, named U-AHC. The cluster merging criterion is based on an information-theoretic measure to compute the distance between cluster prototypes. These prototypes are represented as mixture densities that summarize the pdfs of all the uncertain objects in the clusters. Experiments have shown that our method outperforms state-of-the-art clustering algorithms from an accuracy viewpoint while achieving reasonably good efficiency.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114293614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Isolation Forest","authors":"Fei Tony Liu, K. Ting, Zhi-Hua Zhou","doi":"10.1109/ICDM.2008.17","DOIUrl":"https://doi.org/10.1109/ICDM.2008.17","url":null,"abstract":"Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and random forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129345191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why Stacked Models Perform Effective Collective Classification","authors":"A. Fast, David D. Jensen","doi":"10.1109/ICDM.2008.126","DOIUrl":"https://doi.org/10.1109/ICDM.2008.126","url":null,"abstract":"Collective classification techniques jointly infer all class labels of a relational data set, using the inferences about one class label to influence inferences about related class labels. Kou and Cohen recently introduced an efficient relational model based on stacking that, despite its simplicity, has equivalent accuracy to more sophisticated joint inference approaches. Using experiments on both real and synthetic data, we show that the primary cause for the performance of the stacked model is the reduction in bias from learning the stacked model on inferred labels rather than true labels. The reduction in variance due to conditional inference also contributes to the effect but it is not as strong. In addition, we show that the performance of the joint inference and stacked learners can be attributed to an implicit weighting of local and relational features at learning time.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126961298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking","authors":"Loulwah AlSumait, Daniel Barbará, C. Domeniconi","doi":"10.1109/ICDM.2008.140","DOIUrl":"https://doi.org/10.1109/ICDM.2008.140","url":null,"abstract":"This paper presents online topic model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the latent Dirichlet allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121860844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Method of Combined Feature Extraction for Recognition","authors":"Tingkai Sun, Songcan Chen, Jing-yu Yang, P. Shi","doi":"10.1109/ICDM.2008.28","DOIUrl":"https://doi.org/10.1109/ICDM.2008.28","url":null,"abstract":"Multimodal recognition is an emerging technique to overcome the non-robustness of the unimodal recognition in real applications. Canonical correlation analysis (CCA) has been employed as a powerful tool for feature fusion in the realization of such multimodal system. However, CCA is the unsupervised feature extraction and it does not utilize the class information of the samples, resulting in the constraint of the recognition performance. In this paper, the class information is incorporated into the framework of CCA for combined feature extraction, and a novel method of combined feature extraction for multimodal recognition, called discriminative canonical correlation analysis (DCCA), is proposed. The experiments show that DCCA outperforms some related methods of both unimodal recognition and multimodal recognition.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122020728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}