{"title":"Probabilistic Enhanced Mapping with the Generative Tabular Model","authors":"R. Priam, M. Nadif","doi":"10.1109/ICDM.2006.128","DOIUrl":"https://doi.org/10.1109/ICDM.2006.128","url":null,"abstract":"Visualization of the massive datasets needs new methods which are able to quickly and easily reveal their contents. The projection of the data cloud is an interesting paradigm in spite of its difficulty to be explored when data plots are too numerous. So we study a new way to show a bidimensional projection from a multidimensional data cloud: our generative model constructs a tabular view of the projected cloud. We are able to show the high densities areas by their non equidistributed discretization. This approach is an alternative to the self-organizing map when a projection does already exist. The resulting pixel views of a dataset are illustrated by projecting a data sample of real images: it becomes possible to observe how are laid out the class labels or the frequencies of a group of modalities without being lost because of a zoom enlarging change for instance. The conclusion gives perspectives to this original promising point of view to get a readable projection for a statistical data analysis of large data samples.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125242000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Nearest Neighbor Classifier Using Tabu Search and Ensemble Distance Metrics","authors":"M. Tahir, Jim E. Smith","doi":"10.1109/ICDM.2006.86","DOIUrl":"https://doi.org/10.1109/ICDM.2006.86","url":null,"abstract":"The nearest-neighbor (NN) classifier has long been used in pattern recognition, exploratory data analysis, and data mining problems. A vital consideration in obtaining good results with this technique is the choice of distance function, and correspondingly which features to consider when computing distances between samples. In this paper, a new ensemble technique is proposed to improve the performance of NN classifier. The proposed approach combines multiple NN classifiers, where each classifier uses a different distance function and potentially a different set of features (feature vector). These feature vectors are determined for each distance metric using Simple Voting Scheme incorporated in Tabu Search (TS). The proposed ensemble classifier with different distance metrics and different feature vectors (TS-DF/NN) is evaluated using various benchmark data sets from UCI Machine Learning Repository. Results have indicated a significant increase in the performance when compared with various well-known classifiers. Furthermore, the proposed ensemble method is also compared with ensemble classifier using different distance metrics but with same feature vector (with or without Feature Selection (FS)).","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"27 18","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114017535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incremental Mining of Frequent Query Patterns from XML Queries for Caching","authors":"Guoliang Li, Jianhua Feng, Jianyong Wang, Yong Zhang, Lizhu Zhou","doi":"10.1109/ICDM.2006.88","DOIUrl":"https://doi.org/10.1109/ICDM.2006.88","url":null,"abstract":"Existing studies for mining frequent XML query patterns mainly introduce a straightforward candidate generate-and-test strategy and compute frequencies of candidate query patterns from scratch periodically by checking the entire transaction database, which consists of XML query patterns transformed from user queries. However, it is nontrivial to maintain such discovered frequent patterns in real XML databases because there may incur frequent updates that may not only invalidate some existing frequent query patterns but also generate some new frequent ones. Accordingly, existing proposals are inefficient for the evolution of the transaction database. To address these problems, this paper presents an efficient algorithm IPS-FXQPMiner for mining frequent XML query patterns without candidate maintenance and costly tree-containment checking. We transform XML queries into sequences through a one- to-one mapping and then mine the frequent sequences to generate frequent XML query patterns. More importantly, based on IPS-FXQPMiner, an efficient incremental algorithm, Incre-FXQPMiner is proposed to incrementally mine frequent XML query patterns, which can minimize the I/O and computation requirements for handling incremental updates. Our experimental study on various real-life datasets demonstrates the efficiency and scalability of our algorithms over previous known alternatives.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128139391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity","authors":"Eric Bae, J. Bailey","doi":"10.1109/ICDM.2006.37","DOIUrl":"https://doi.org/10.1109/ICDM.2006.37","url":null,"abstract":"Cluster analysis has long been a fundamental task in data mining and machine learning. However, traditional clustering methods concentrate on producing a single solution, even though multiple alternative clusterings may exist. It is thus difficult for the user to validate whether the given solution is in fact appropriate, particularly for large and complex datasets. In this paper we explore the critical requirements for systematically finding a new clustering, given that an already known clustering is available and we also propose a novel algorithm, COALA, to discover this new clustering. Our approach is driven by two important factors; dissimilarity and quality. These are especially important for finding a new clustering which is highly informative about the underlying structure of data, but is at the same time distinctively different from the provided clustering. We undertake an experimental analysis and show that our method is able to outperform existing techniques, for both synthetic and real datasets.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133301425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangji Huang, Y. Huang, M. Wen, Aijun An, Y. Liu, Josiah Poon
{"title":"Applying Data Mining to Pseudo-Relevance Feedback for High Performance Text Retrieval","authors":"Xiangji Huang, Y. Huang, M. Wen, Aijun An, Y. Liu, Josiah Poon","doi":"10.1109/ICDM.2006.22","DOIUrl":"https://doi.org/10.1109/ICDM.2006.22","url":null,"abstract":"In this paper, we investigate the use of data mining, in particular the text classification and co-training techniques, to identify more relevant passages based on a small set of labeled passages obtained from the blind feedback of a retrieval system. The data mining results are used to expand query terms and to re-estimate some of the parameters used in a probabilistic weighting function. We evaluate the data mining based feedback method on the TREC HARD data set. The results show that data mining can be successfully applied to improve the text retrieval performance. We report our experimental findings in detail.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133963687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Solution Path for Semi-Supervised Classification with Manifold Regularization","authors":"G. Wang, Tao Chen, D. Yeung, F. Lochovsky","doi":"10.1109/ICDM.2006.150","DOIUrl":"https://doi.org/10.1109/ICDM.2006.150","url":null,"abstract":"With very low extra computational cost, the entire solution path can be computed for various learning algorithms like support vector classification (SVC) and support vector regression (SVR). In this paper, we extend this promising approach to semi-supervised learning algorithms. In particular, we consider finding the solution path for the Laplacian support vector machine (LapSVM) which is a semi-supervised classification model based on manifold regularization. One advantage of the this algorithm is that the coefficient path is piecewise linear with respect to the regularization parameter, hence its computational complexity is quadratic in the number of labeled examples.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134645601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"bitSPADE: A Lattice-based Sequential Pattern Mining Algorithm Using Bitmap Representation","authors":"S. Aseervatham, A. Osmani, E. Viennet","doi":"10.1109/ICDM.2006.28","DOIUrl":"https://doi.org/10.1109/ICDM.2006.28","url":null,"abstract":"Sequential pattern mining allows to discover temporal relationship between items within a database. The patterns can then be used to generate association rules. When the databases are very large, the execution speed and the memory usage of the mining algorithm become critical parameters. Previous research has focused on either one of the two parameters. In this paper, we present bitSPADE, a novel algorithm that combines the best features of SPAM, one of the fastest algorithm, and SPADE, one of the most memory efficient algorithm. Moreover, we introduce a new pruning strategy that enables bitSPADE to reach high performances. Experimental evaluations showed that bitSPADE ensures an efficient tradeoff between speed and memory usage by outperforming SPADE by both speed and memory usage factors more than 3.4 and SPAM by a memory consumption factor up to more than an order of magnitude.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121847359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Object Identification with Constraints","authors":"Steffen Rendle, L. Schmidt-Thieme","doi":"10.1109/ICDM.2006.117","DOIUrl":"https://doi.org/10.1109/ICDM.2006.117","url":null,"abstract":"Object identification aims at identifying different representations of the same object based on noisy attributes such as descriptions of the same product in different online shops or references to the same paper in different publications. Numerous solutions have been proposed for solving this task, almost all of them based on similarity functions of a pair of objects. Although today the similarity functions are learned from a set of labeled training data, the structural information given by the labeled data is not used. By formulating a generic model for object identification we show how almost any proposed identification model can easily be extended for satisfying structural constraints. Therefore we propose a model that uses structural information given as pairwise constraints to guide collective decisions about object identification in addition to a learned similarity measure. We show with empirical experiments on public and on real-life data that combining both structural information and attribute-based similarity enormously increases the overall performance for object identification tasks.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129233534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Wang, Haixun Wang, Wei Wang, Baile Shi, Philip S. Yu
{"title":"LOCI: Load Shedding through Class-Preserving Data Acquisition","authors":"Peng Wang, Haixun Wang, Wei Wang, Baile Shi, Philip S. Yu","doi":"10.1109/ICDM.2006.100","DOIUrl":"https://doi.org/10.1109/ICDM.2006.100","url":null,"abstract":"An avalanche of data available in the stream form is overstretching our data analyzing ability. In this paper, we propose a novel load shedding method that enables fast and accurate stream data classification. We transform input data so that its class information concentrates on a few features, and we introduce a progressive classifier that makes prediction with partial input. We take advantage of stream data's temporal locality -for example, readings from a temperature sensor usually do not change dramatically over a short period of time -for load shedding. We first show that temporal locality of the original data is preserved by our transform, then we utilize positive and negative knowledge about the data (which is of much smaller size than the data itself) for classification. We employ both analytical and empirical analysis to demonstrate the advantage of our approach.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128656062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Getting the Most Out of Ensemble Selection","authors":"R. Caruana, Art Munson, Alexandru Niculescu-Mizil","doi":"10.1109/ICDM.2006.76","DOIUrl":"https://doi.org/10.1109/ICDM.2006.76","url":null,"abstract":"We investigate four previously unexplored aspects of ensemble selection, a procedure for building ensembles of classifiers. First we test whether adjusting model predictions to put them on a canonical scale makes the ensembles more effective. Second, we explore the performance of ensemble selection when different amounts of data are available for ensemble hillclimbing. Third, we quantify the benefit of ensemble selection's ability to optimize to arbitrary metrics. Fourth, we study the performance impact of pruning the number of models available for ensemble selection. Based on our results we present improved ensemble selection methods that double the benefit of the original method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122377498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}