{"title":"Improving medical/biological data classification performance by wavelet preprocessing","authors":"Qi Li, Tao Li, Shenghuo Zhu, C. Kambhamettu","doi":"10.1109/ICDM.2002.1184022","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184022","url":null,"abstract":"Many real-world datasets contain noise which could degrade the performances of learning algorithms. Motivated from the success of wavelet denoising techniques in image data, we explore a general solution to alleviate the effect of noisy data by wavelet preprocessing for medical/biological data classification. Our experiments are divided into two categories: one is of different classification algorithms on a specific database, and the other is of a specific classification algorithm (decision tree) on different databases. The experiment results show that the wavelet denoising of noisy data is able to improve the accuracies of those classification methods, if the localities of the attributes are strong enough.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"28 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120923507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparison study on algorithms for incremental update of frequent sequences","authors":"Minghua Zhang, B. Kao, Chi Lap Yip","doi":"10.1109/ICDM.2002.1184001","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184001","url":null,"abstract":"The problem of mining frequent sequences is to extract frequently occurring subsequences in a sequence database. Algorithms on this mining problem include GSP, MFS, and SPADE. The problem of incremental update of frequent sequences is to keep track of the set of frequent sequences as the underlying database changes. Previous studies have extended the traditional algorithms to efficiently solve the update problem. These incremental algorithms include ISM, GSP+ and MFS+. Each incremental algorithm has its own characteristics and they have been studied and evaluated separately under different scenarios. This paper presents a comprehensive study on the relative performance of the incremental algorithms as well as their non-incremental counterparts. Our goal is to provide guidelines on the choice of an algorithm for solving the incremental update problem given the various characteristics of a sequence database.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123008621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Solving the fragmentation problem of decision trees by discovering boundary emerging patterns","authors":"Jinyan Li, L. Wong","doi":"10.1109/ICDM.2002.1184021","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184021","url":null,"abstract":"The single coverage constraint discourages a decision tree to contain many significant rules. The loss of significant rules leads to a loss in accuracy. On the other hand, the fragmentation problem causes a decision tree to contain too many minor rules. The presence of minor rules decreases the accuracy. We propose to use emerging patterns to solve these problems. In our approach, many globally significant rules can be discovered. Extensive expert. mental results on gene expression datasets show that our approach are more accurate than single C4.5 trees, and are also better than bagged or boosted C4.5 trees.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124682146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mixtures of ARMA models for model-based time series clustering","authors":"Yimin Xiong, D. Yeung","doi":"10.1109/ICDM.2002.1184037","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184037","url":null,"abstract":"Clustering problems are central to many knowledge discovery and data mining tasks. However, most existing clustering methods can only work with fixed-dimensional representations of data patterns. In this paper we study the clustering of data patterns that are represented as sequences or time series possibly of different lengths. We propose a model-based approach to this problem using mixtures of autoregressive moving average (ARMA) models. We derive an expectation-maximization (EM) algorithm for learning the mixing coefficients as well as the parameters of component models. Experiments were conducted on simulated and real datasets. Results show that our method compares favorably with another method recently proposed by others for similar time series clustering problems.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116326642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementation of a least fixpoint operator for fast mining of relational databases","authors":"H. Jamil","doi":"10.1109/ICDM.2002.1184016","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184016","url":null,"abstract":"Recent research has focused on computing large item sets for association rule mining using SQL3 least fixpoint computation, and by exploiting the monotonic nature of the SQL3 aggregate functions such as sum and create view recursive constructs. Such approaches allow us to view mining as an ad hoc querying exercise and treat the efficiency issue as an optimization problem. We present a recursive implementation of a recently proposed least fixpoint operator for computing large item sets from object-relational databases. We present experimental evidence to show that our implementation compares well with several well-regarded and contemporary algorithms for large item set generation.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133937625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adapting information extraction knowledge for unseen Web sites","authors":"Tak-Lam Wong, Wai Lam","doi":"10.1109/ICDM.2002.1183995","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183995","url":null,"abstract":"We propose a wrapper adaptation framework which aims at adapting a learned wrapper to an unseen Web site. It significantly reduces human effort in constructing wrappers. Our framework makes use of extraction rules previously discovered from a particular site to seek potential training example candidates for an unseen site. Rule generalization and text categorization are employed for finding suitable example candidates. Another feature of our approach is that it makes use of the previously discovered lexicon to classify good training examples automatically for the new site. We conducted extensive experiments to evaluate the quality of the extraction performance and the adaptability of our approach.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132787976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new implementation technique for fast spectral based document retrieval systems","authors":"L. Park, M. Palaniswami, K. Ramamohanarao","doi":"10.1109/ICDM.2002.1183922","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183922","url":null,"abstract":"The traditional methods of spectral text retrieval (FDS,CDS) create an index of spatial data and convert the data to its spectral form at query time. We present a new method of implementing and querying an index containing spectral data which will conserve the high precision performance of the spectral methods, reduce the time needed to resolve the query, and maintain an acceptable size for the index. This is done by taking advantage of the properties of the discrete cosine transform and by applying ideas from vector space document ranking methods.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133205085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On computing condensed frequent pattern bases","authors":"J. Pei, Guozhu Dong, Wei Zou, Jiawei Han","doi":"10.1109/ICDM.2002.1183928","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183928","url":null,"abstract":"Frequent pattern mining has been studied extensively. However, the effectiveness and efficiency of this mining is often limited, since the number of frequent patterns generated is often too large. In many applications it is sufficient to generate and examine only frequent patterns with support frequency in close-enough approximation instead of in full precision. Such a compact but close-enough frequent pattern base is called a condensed frequent patterns-base. In this paper we propose and examine several alternatives at the design, representation, and implementation of such condensed frequent pattern-bases. A few algorithms for computing such pattern-bases are proposed. Their effectiveness at pattern compression and their efficient computation methods are investigated. A systematic performance study is conducted on different kinds of databases, which demonstrates the effectiveness and efficiency of our approach at handling frequent pattern mining in large databases.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115701046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neighborgram clustering. Interactive exploration of cluster neighborhoods","authors":"M. Berthold, Bernd Wiswedel, D. E. Patterson","doi":"10.1109/ICDM.2002.1184004","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184004","url":null,"abstract":"We describe an interactive way to generate a set of clusters for a given data set. The clustering is done by constructing local histograms, which can then be used to visualize, select, and fine-tune potential cluster candidates. The accompanying algorithm can also generate clusters automatically, allowing for an automatic or semi-automatic clustering process where the user only occasionally interacts with the algorithm. We illustrate the ability to automatically identify and visualize clusters using NCI's AIDS Antiviral Screen data set.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124135264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On active learning for data acquisition","authors":"Zhiqiang Zheng, B. Padmanabhan","doi":"10.1109/ICDM.2002.1184002","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184002","url":null,"abstract":"Many applications are characterized by having naturally incomplete data on customers - where data on only some fixed set of local variables is gathered However, having a more complete picture can help build better models. The naive solution to this problem - acquiring complete data for all customers s often impractical due to the costs of doing so. A possible alternative is to acquire complete data for \"some\" customers and to use this to improve the models built. The data acquisition problem is determining how many, and which, customers to acquire additional data from. In this paper we suggest using active learning based approaches for the data acquisition problem. In particular, we present initial methods for data acquisition and evaluate these methods experimentally on web usage data and UCI datasets. Results show that the methods perform well and indicate that active learning based methods for data acquisition can be a promising area for data mining research.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125904472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}