{"title":"Evaluating the utility of statistical phrases and latent semantic indexing for text classification","authors":"H. Wu, D. Gunopulos","doi":"10.1109/ICDM.2002.1184036","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184036","url":null,"abstract":"The term-based vector space model is a prominent technique for retrieving textual information. In this paper we examine the usefulness of phrases as terms in vector-based document classification. We focus on statistical techniques to extract both adjacent and window phrases from documents. We discover that the positive effect of adding phrase terms is very limited, if we have already achieved good performance using single-word terms, even when SVD/LSI is used as the dimensionality reduction method.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133307167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feature selection algorithms: a survey and experimental evaluation","authors":"L. Molina, L. B. Muñoz, À. Nebot","doi":"10.1109/ICDM.2002.1183917","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183917","url":null,"abstract":"In view of the substantial number of existing feature selection algorithms, the need arises to count on criteria that enables to adequately decide which algorithm to use in certain situations. This work assesses the performance of several fundamental algorithms found in the literature in a controlled scenario. A scoring measure ranks the algorithms by taking into account the amount of relevance, irrelevance and redundance on sample data sets. This measure computes the degree of matching between the output given by the algorithm and the known optimal solution. Sample size effects are also studied.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116481699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Abe, E. Pednault, Haixun Wang, B. Zadrozny, W. Fan, C. Apté
{"title":"Empirical comparison of various reinforcement learning strategies for sequential targeted marketing","authors":"N. Abe, E. Pednault, Haixun Wang, B. Zadrozny, W. Fan, C. Apté","doi":"10.1109/ICDM.2002.1183879","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183879","url":null,"abstract":"We empirically evaluate the performance of various reinforcement learning methods in applications to sequential targeted marketing. In particular we propose and evaluate a progression of reinforcement learning methods, ranging from the \"direct\" or \"batch\" methods to \"indirect\" or \"simulation based\" methods, and those that we call \"semidirect\" methods that fall between them. We conduct a number of controlled experiments to evaluate the performance of these competing methods. Our results indicate that while the indirect methods can perform better in a situation in which nearly perfect modeling is possible, under the more realistic situations in which the system's modeling parameters have restricted attention, the indirect methods' performance tend to degrade. We also show that semi-direct methods are effective in reducing the amount of computation necessary to attain a given level of performance, and often result in more profitable policies.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123518242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining generalized association rules using pruning techniques","authors":"Yin-Fu Huang, Chieh-Ming Wu","doi":"10.1109/ICDM.2002.1183907","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183907","url":null,"abstract":"The goal of the paper is to mine generalized association rules using pruning techniques. Given a large transaction database and a hierarchical taxonomy tree of the items, we try to find the association rules between the items at different levels in the taxonomy tree under the assumption that original frequent itemsets and association rules have already been generated beforehand In the proposed algorithm GMAR, we use join methods and pruning techniques to generate new generalized association rules. Through several comprehensive experiments, we find that the GMAR algorithm is much better than BASIC and Cumulate algorithms.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124963027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining associated implication networks: computational intermarket analysis","authors":"P. W. Tse, Jiming Liu","doi":"10.1109/ICDM.2002.1184030","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184030","url":null,"abstract":"Current attempts to analyze international financial markets include the use of financial technical analysis and data mining techniques. In this paper, we propose a new approach that incorporates implication networks and association rules to form an associated network structure. The proposed approach explicitly addresses the issue of local vs. global influences between financial markets.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125973153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SLPMiner: an algorithm for finding frequent sequential patterns using length-decreasing support constraint","authors":"Masakazu Seno, G. Karypis","doi":"10.1109/ICDM.2002.1183937","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183937","url":null,"abstract":"Over the years, a variety of algorithms for finding frequent sequential patterns in very large sequential databases have been developed. The key feature in most of these algorithms is that they use a constant support constraint to control the inherently exponential complexity of the problem. In general, patterns that contain only a few items will tend to be interesting if they have good support, whereas long patterns can still be interesting even if their support is relatively small. Ideally, we need an algorithm that finds all the frequent patterns whose support decreases as a function of their length. In this paper we present an algorithm called SLPMiner that finds all sequential patterns that satisfy a length-decreasing support constraint. Our experimental evaluation shows that SLPMiner achieves up to two orders of magnitude of speedup by effectively exploiting the length-decreasing support constraint, and that its runtime increases gradually as the average length of the sequences (and the discovered frequent patterns) increases.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124628758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards automatic generation of query taxonomy: a hierarchical query clustering approach","authors":"Shui-Lung Chuang, Lee-Feng Chien","doi":"10.1109/ICDM.2002.1183888","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183888","url":null,"abstract":"Most previous work on automatic query clustering generated a flat, un-nested partition of query terms. In this work, we discuss the organization of query terms into a hierarchical structure and construct a query taxonomy in an automatic way. The proposed approach is designed based on a hierarchical agglomerative clustering algorithm to hierarchically group similar queries and generate cluster hierarchies using a novel cluster partition technique. The search processes of real-world search engines are combined to obtain highly ranked Web documents as the feature source for each query term. Preliminary experiments show that the proposed approach is effective for obtaining thesaurus information for query terms, and is also feasible for constructing a query taxonomy which provides a basis for in-depth analysis of users' search interests and domain-specific vocabulary on a larger scale.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126264873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Iterative clustering of high dimensional text data augmented by local search","authors":"I. Dhillon, Yuqiang Guan, J. Kogan","doi":"10.1109/ICDM.2002.1183895","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183895","url":null,"abstract":"The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However spherical k-means can often yield qualitatively poor results, especially when cluster sizes are small, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal solution. In this paper, we present a local search procedure, which we call 'first-variation\" that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value. An enhancement of first variation allows a chain of such moves in a Kernighan-Lin fashion and leads to a better local maximum. Combining the enhanced first-variation with spherical k-means yields a powerful \"ping-pong\" strategy that often qualitatively improves k-means clustering and is computationally efficient. We present several experimental results to highlight the improvement achieved by our proposed algorithm in clustering high-dimensional and sparse text data.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116823965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining association rules from stars","authors":"Eric Ka Ka Ng, A. Fu, Ke Wang","doi":"10.1109/ICDM.2002.1183919","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1183919","url":null,"abstract":"Association rule mining is an important data mining problem. It is found to be useful for conventional relational data. However, previous work has mostly targeted on mining a single table. In real life, a database is typically made up of multiple tables and one important case is where some of the tables form a star schema. The tables typically correspond to entity sets and joining the tables in a star schema gives relationships among entity sets which can be very interesting information. Hence mining on the join result is an important problem. Based on characteristics of the star schema we propose an efficient algorithm for mining association rules on the join result but without actually performing the join operation. We show that this approach can significantly out-perform the join-then-mine approach even when the latter adopts a fastest known mining algorithm.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117033876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the parameter state space of stacking","authors":"A. Seewald","doi":"10.1109/ICDM.2002.1184029","DOIUrl":"https://doi.org/10.1109/ICDM.2002.1184029","url":null,"abstract":"Ensemble learning schemes are a new field in data mining. While current research concentrates mainly on improving the performance of single learning algorithms, an alternative is to combine learners with different biases. Stacking is the best-known such scheme which tries to combine learners' predictions or confidences via another learning algorithm. However, the adoption of stacking into the data mining community is hampered by its large parameter space, consisting mainly of other learning algorithms: (1) the set of learning algorithms to combine, (2) the meta-learner responsible for the combining; and (3) the type of meta-data to use - confidences or predictions. None of these parameters are obvious choices. Furthermore, little is known about the relation between the parameter settings and performance of stacking. By exploring all of stacking's parameter settings and their interdependencies, we attempt to make stacking a suitable choice for mainstream data mining applications.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121562793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}