{"title":"An Information Theoretic Approach to Detection of Minority Subsets in Database","authors":"S. Ando, Einoshin Suzuki","doi":"10.1109/ICDM.2006.19","DOIUrl":"https://doi.org/10.1109/ICDM.2006.19","url":null,"abstract":"Detection of rare and exceptional occurrences in large- scale databases have become an important practice in the field of knowledge discovery and information retrieval. Many databases include large amount of noise or irrelevant data, whose distribution often overlaps with the subsets of exceptional data containing useful knowledge. This paper addresses the problem of finding a small subset of \"minority\" data whose distribution overlaps with, but are exceptional to or inconsistent with that of the majority of the database. In such a case, conventional distance-based or density-based approaches in Outlier Detection are ineffective due to their dependence on the structure of the majority or the prerequisite of critical parameters. We formalize the task as an estimation of a model of the minority subset which provides a simple description of the subset and yet maintains divergence from that of the majority. This estimation is formalized as a minimization problem using an information theoretic framework of Rate Distortion theory. We further introduce conditions of the majority to derive an objective function which factorizes the property of the minority and dependence to the structure of the majority. The proposed method shows improvements from conventional approaches in artificial data and a promising result in document retrieval problem.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121808601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Vadapalli, Satyanarayana R. Valluri, K. Karlapalem
{"title":"A Simple Yet Effective Data Clustering Algorithm","authors":"S. Vadapalli, Satyanarayana R. Valluri, K. Karlapalem","doi":"10.1109/ICDM.2006.9","DOIUrl":"https://doi.org/10.1109/ICDM.2006.9","url":null,"abstract":"In this paper, we use a simple concept based on k-reverse nearest neighbor digraphs, to develop a framework RECORD for clustering and outlier detection. We developed three algorithms - (i) RECORD algorithm (requires one parameter), (ii) Agglomerative RECORD algorithm (no parameters required) and (iii) Stability-based RECORD algorithm (no parameters required). Our experimental results with published datasets, synthetic and real-life datasets show that RECORD not only handles noisy data, but also identifies the relevant clusters. Our results are as good as (if not better than) the results got from other algorithms.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116928930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Kriegel, A. Pryakhin, Matthias Schubert, A. Zimek
{"title":"COSMIC: Conceptually Specified Multi-Instance Clusters","authors":"H. Kriegel, A. Pryakhin, Matthias Schubert, A. Zimek","doi":"10.1109/ICDM.2006.46","DOIUrl":"https://doi.org/10.1109/ICDM.2006.46","url":null,"abstract":"Recently, more and more applications represent data objects as sets of feature vectors or multi-instance objects. In this paper, we propose COSMIC, a method for deriving concept lattices from multi-instance data based on hierarchical density-based clustering. The found concepts correspond to groups or clusters of multi-instance objects having similar instances in common. We demonstrate that COSMIC outperforms compared methods with respect to efficiency and cluster quality and is capable to extract interesting patterns in multi-instance data sets.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127978791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bregman Bubble Clustering: A Robust, Scalable Framework for Locating Multiple, Dense Regions in Data","authors":"Gunjan Gupta, Joydeep Ghosh","doi":"10.1109/ICDM.2006.32","DOIUrl":"https://doi.org/10.1109/ICDM.2006.32","url":null,"abstract":"In traditional clustering, every data point is assigned to at least one cluster. On the other extreme, one class clustering algorithms proposed recently identify a single dense cluster and consider the rest of the data as irrelevant. However, in many problems, the relevant data forms multiple natural clusters. In this paper, we introduce the notion of Bregman bubbles and propose Bregman bubble clustering (BBC) that seeks k dense Bregman bubbles in the data. We also present a corresponding generative model, soft BBC, and show several connections with Bregman clustering, and with a one class clustering algorithm. Empirical results on various datasets show the effectiveness of our method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132814870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Blocking: Learning to Scale Up Record Linkage","authors":"M. Bilenko, B. Kamath, R. Mooney","doi":"10.1109/ICDM.2006.13","DOIUrl":"https://doi.org/10.1109/ICDM.2006.13","url":null,"abstract":"Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133049729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recommendation on Item Graphs","authors":"Fei Wang, Shengchao Ma, Liuzhong Yang, Ta-Hsin Li","doi":"10.1109/ICDM.2006.133","DOIUrl":"https://doi.org/10.1109/ICDM.2006.133","url":null,"abstract":"A novel scheme for item-based recommendation is proposed in this paper. In our framework, the items are described by an undirected weighted graph Q = (V,epsiv). V is the node set which is identical to the item set, and epsiv is the edge set. Associate with each edge eij isin epsiv is a weight omegaij ges 0, which represents similarity between items i and j. Without the loss of generality, we assume that any user's ratings to the items should be sufficiently smooth with respect to the intrinsic structure of the items, i.e., a user should give similar ratings to similar items. A simple algorithm is presented to achieve such a smooth solution. Encouraging experimental results are provided to show the effectiveness of our method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133388438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Framework for Regional Association Rule Mining in Spatial Datasets","authors":"W. Ding, C. Eick, Jing Wang, Xiaojing Yuan","doi":"10.1109/ICDM.2006.5","DOIUrl":"https://doi.org/10.1109/ICDM.2006.5","url":null,"abstract":"The immense explosion of geographically referenced data calls for efficient discovery of spatial knowledge. One of the special challenges for spatial data mining is that information is usually not uniformly distributed in spatial datasets. Consequently, the discovery of regional knowledge is of fundamental importance for spatial data mining. This paper centers on discovering regional association rules in spatial datasets. In particular, we introduce a novel framework to mine regional association rules relying on a given class structure. A reward-based regional discovery methodology is introduced, and a divisive, grid-based supervised clustering algorithm is presented that identifies interesting subregions in spatial datasets. Then, an integrated approach is discussed to systematically mine regional rules. The proposed framework is evaluated in a real-world case study that identifies spatial risk patterns from arsenic in the Texas water supply.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132103891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Reference-Based Approach to Outlier Detection in Large Datasets","authors":"Yaling Pei, Osmar R Zaiane, Yong Gao","doi":"10.1109/ICDM.2006.17","DOIUrl":"https://doi.org/10.1109/ICDM.2006.17","url":null,"abstract":"A bottleneck to detecting distance and density based outliers is that a nearest-neighbor search is required for each of the data points, resulting in a quadratic number of pairwise distance evaluations. In this paper, we propose a new method that uses the relative degree of density with respect to a fixed set of reference points to approximate the degree of density defined in terms of nearest neighbors of a data point. The running time of our algorithm based on this approximation is 0(Rn log n) where n is the size of dataset and R is the number of reference points. Candidate outliers are ranked based on the outlier score assigned to each data point. Theoretical analysis and empirical studies show that our method is effective, efficient, and highly scalable to very large datasets.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132587818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Method for Detecting Outlying Subspaces in High-dimensional Databases Using Genetic Algorithm","authors":"Ji Zhang, Q. Gao, Hai H. Wang","doi":"10.1109/ICDM.2006.6","DOIUrl":"https://doi.org/10.1109/ICDM.2006.6","url":null,"abstract":"Detecting outlying subspaces is a relatively new research problem in outlier-ness analysis for high-dimensional data. An outlying subspace for a given data point p is the sub- space in which p is an outlier. Outlying subspace detection can facilitate a better characterization process for the detected outliers. It can also enable outlier mining for high- dimensional data to be performed more accurately and efficiently. In this paper, we proposed a new method using genetic algorithm paradigm for searching outlying subspaces efficiently. We developed a technique for efficiently computing the lower and upper bounds of the distance between a given point and its kth nearest neighbor in each possible subspace. These bounds are used to speed up the fitness evaluation of the designed genetic algorithm for outlying subspace detection. We also proposed a random sampling technique to further reduce the computation of the genetic algorithm. The optimal number of sampling data is specified to ensure the accuracy of the result. We show that the proposed method is efficient and effective in handling outlying subspace detection problem by a set of experiments conducted on both synthetic and real-life datasets.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126797445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dimension Reduction for Supervised Ordering","authors":"Toshihiro Kamishima, S. Akaho","doi":"10.1109/ICDM.2006.53","DOIUrl":"https://doi.org/10.1109/ICDM.2006.53","url":null,"abstract":"Ordered lists of objects are widely used as representational forms. Such ordered objects include Web search results and best-seller lists. Techniques for processing such ordinal data are being developed, particularly methods for a supervised ordering task: i.e., learning functions used to sort objects from sample orders. In this article, we propose two dimension reduction methods specifically designed to improve prediction performance in a supervised ordering task.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115963231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}