{"title":"Incorporating User Provided Constraints into Document Clustering","authors":"Yanhua Chen, M. Rege, Ming Dong, Jing Hua","doi":"10.1109/ICDM.2007.67","DOIUrl":"https://doi.org/10.1109/ICDM.2007.67","url":null,"abstract":"Document clustering without any prior knowledge or background information is a challenging problem. In this paper, we propose SS-NMF: a semi-supervised non- negative matrix factorization framework for document clustering. In SS-NMF, users are able to provide supervision for document clustering in terms of pairwise constraints on a few documents specifying whether they \"must\" or \"cannot\" be clustered together. Through an iterative algorithm, we perform symmetric tri-factorization of the document- document similarity matrix to infer the document clusters. Theoretically, we show that SS-NMF provides a general framework for semi-supervised clustering and that existing approaches can be considered as special cases of SS-NMF. Through extensive experiments conducted on publicly available data sets, we demonstrate the superior performance of SS-NMF for clustering documents.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124766578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Training Conditional Random Fields by Periodic Step Size Adaptation for Large-Scale Text Mining","authors":"Han-Shen Huang, Yu-Ming Chang, Chun-Nan Hsu","doi":"10.1109/ICDM.2007.39","DOIUrl":"https://doi.org/10.1109/ICDM.2007.39","url":null,"abstract":"For applications with consecutive incoming training examples, on-line learning has the potential to achieve a likelihood as high as off-line learning without scanning all available training examples and usually has a much smaller memory footprint. To train CRFson-line, this paper presents the Periodic Step size Adaptation (PSA) method to dynamically adjust the learning rates in stochastic gradient descent. We applied our method to three large scale text mining tasks. Experimental results show that PSA outperforms the best off-line algorithm, L-BFGS, by many hundred times, and outperforms the best on-line algorithm, SMD, by an order of magnitude in terms of the number of passes required to scan the training data set.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124978381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sampling for Sequential Pattern Mining: From Static Databases to Data Streams","authors":"Chedy Raïssi, P. Poncelet","doi":"10.1109/ICDM.2007.82","DOIUrl":"https://doi.org/10.1109/ICDM.2007.82","url":null,"abstract":"Sequential pattern mining is an active field in the domain of knowledge discovery. Recently, with the constant progress in hardware technologies, real-world databases tend to grow larger and the hypothesis that a database can be loaded into main-memory for sequential pattern mining purpose is no longer valid. Furthermore, the new model of data as a continuous and potentially infinite flow, known as data stream model, call for a pre-processing step to ease the mining operations. Since the database size is the most influential factor for mining algorithms we examine the use of sampling over static databases to get approximate mining results with an upper bound on the error rate. Moreover, we extend these sampling analysis and present an algorithm based on reservoir sampling to cope with sequential pattern mining over data streams. We demonstrate with empirical results that our sampling methods are efficient and that sequence mining remains accurate over static databases and data streams.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116681543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Chosen Few: On Identifying Valuable Patterns","authors":"Björn Bringmann, Albrecht Zimmermann","doi":"10.1109/ICDM.2007.85","DOIUrl":"https://doi.org/10.1109/ICDM.2007.85","url":null,"abstract":"Constrained pattern mining extracts patterns based on their individual merit. Usually this results in far more patterns than a human expert or a machine learning technique could make use of. Often different patterns or combinations of patterns cover a similar subset of the examples, thus being redundant and not carrying any new information. To remove the redundant information contained in such pattern sets, we propose a general heuristic approach for selecting a small subset of patterns. We identify several selection techniques for use in this general algorithm and evaluate those on several data sets. The results show that the technique succeeds in severely reducing the number of patterns, while at the same time apparently retaining much of the original information. Additionally the experiments show that reducing the pattern set indeed improves the quality of classification results. Both results show that the approach is very well suited for the goals we aim at.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126959275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hasan, V. Chaoji, Saeed Salem, J. Besson, Mohammed J. Zaki
{"title":"ORIGAMI: Mining Representative Orthogonal Graph Patterns","authors":"M. Hasan, V. Chaoji, Saeed Salem, J. Besson, Mohammed J. Zaki","doi":"10.1109/ICDM.2007.45","DOIUrl":"https://doi.org/10.1109/ICDM.2007.45","url":null,"abstract":"In this paper, we introduce the concept of alpha-orthogonal patterns to mine a representative set of graph patterns. Intuitively, two graph patterns are alpha-orthogonal if their similarity is bounded above by alpha. Each alpha-orthogonal pattern is also a representative for those patterns that are at least beta similar to it. Given user defined alpha, beta isin [0,1], the goal is to mine an alpha-orthogonal, beta-representative set that minimizes the set of unrepresented patterns. We present ORIGAMI, an effective algorithm for mining the set of representative orthogonal patterns. ORIGAMI first uses a randomized algorithm to randomly traverse the pattern space, seeking previously unexplored regions, to return a set of maximal patterns. ORIGAMI then extracts an alpha-orthogonal, beta-representative set from the mined maximal patterns. We show the effectiveness of our algorithm on a number of real and synthetic datasets. In particular, we show that our method is able to extract high quality patterns even in cases where existing enumerative graph mining methods fail to do so.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"04 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129040680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pannagadatta K. Shivaswamy, Wei Chu, Martin Jansche
{"title":"A Support Vector Approach to Censored Targets","authors":"Pannagadatta K. Shivaswamy, Wei Chu, Martin Jansche","doi":"10.1109/ICDM.2007.93","DOIUrl":"https://doi.org/10.1109/ICDM.2007.93","url":null,"abstract":"Censored targets, such as the time to events in survival analysis, can generally be represented by intervals on the real line. In this paper, we propose a novel support vector technique (named SVCR) for regression on censored targets. SVCR inherits the strengths of support vector methods, such as a globally optimal solution by convex programming, fast training speed and strong generalization capacity. In contrast to ranking approaches to survival analysis, our approach is able not only to achieve superior ordering performance, but also to predict the survival time very well. Experiments show a significant performance improvement when the majority of the training data is censored. Experimental results on several survival analysis datasets demonstrate that SVCR is very competitive against classical survival analysis models.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"53 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124199399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multilevel Belief Propagation for Fast Inference on Markov Random Fields","authors":"L. Xiong, Fei Wang, Changshui Zhang","doi":"10.1109/ICDM.2007.9","DOIUrl":"https://doi.org/10.1109/ICDM.2007.9","url":null,"abstract":"Graph-based inference plays an important role in many mining and learning tasks. Among all the solvers for this problem, belief propagation (BP) provides a general and efficient way to derive approximate solutions. However, for large scale graphs the computational cost of BP is still demanding. In this paper, we propose a multilevel algorithm to accelerate belief propagation on Markov Random Fields (MRF). First, we coarsen the original graph to get a smaller one. Then, BP is applied on the new graph to get a coarse result. Finally the coarse solution is efficiently refined back to derive the original solution. Unlike traditional multi- resolution approaches, our method features adaptive coarsening and efficient refinement. The above process can be recursively applied to reduce the computational cost remarkably. We theoretically justify the feasibility of our method on Gaussian MRFs, and empirically show that it is also effectual on discrete MRFs. The effectiveness of our method is verified in experiments on various inference tasks.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"15 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113970731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinglung Liang, Yanyong Zhang, Hui Xiong, R. Sahoo
{"title":"Failure Prediction in IBM BlueGene/L Event Logs","authors":"Yinglung Liang, Yanyong Zhang, Hui Xiong, R. Sahoo","doi":"10.1109/ICDM.2007.46","DOIUrl":"https://doi.org/10.1109/ICDM.2007.46","url":null,"abstract":"Frequent failures are becoming a serious concern to the community of high-end computing, especially when the applications and the underlying systems rapidly grow in size and complexity. In order to develop effective fault-tolerant strategies, there is a critical need to predict failure events. To this end, we have collected detailed event logs from IBM BlueGene/L, which has 128 K processors, and is currently the fastest supercomputer in the world. In this study, we first show how the event records can be converted into a data set that is appropriate for running classification techniques. Then we apply classifiers on the data, including RIPPER (a rule-based classifier), Support Vector Machines (SVMs), a traditional Nearest Neighbor method, and a customized Nearest Neighbor method. We show that the customized nearest neighbor approach can outperform RIPPER and SVMs in terms of both coverage and precision. The results suggest that the customized nearest neighbor approach can be used to alleviate the impact of failures.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131287479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cocktail Ensemble for Regression","authors":"Yang Yu, Zhi-Hua Zhou, K. Ting","doi":"10.1109/ICDM.2007.60","DOIUrl":"https://doi.org/10.1109/ICDM.2007.60","url":null,"abstract":"This paper is motivated to improve the performance of individual ensembles using a hybrid mechanism in the regression setting. Based on an error-ambiguity decomposition, we formally analyze the optimal linear combination of two base ensembles, which is then extended to multiple individual ensembles via pairwise combinations. The Cocktail ensemble approach is proposed based on this analysis. Experiments over a broad range of data sets show that the proposed approach outperforms the individual ensembles, two other methods of ensemble combination, and two state-of-the-art regression approaches.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125200734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prism: A Primal-Encoding Approach for Frequent Sequence Mining","authors":"K. Gouda, M. Hassaan, Mohammed J. Zaki","doi":"10.1109/ICDM.2007.33","DOIUrl":"https://doi.org/10.1109/ICDM.2007.33","url":null,"abstract":"Sequence mining is one of the fundamental data mining tasks. In this paper we present a novel approach called Prism, for mining frequent sequences. Prism utilizes a vertical approach for enumeration and support counting, based on the novel notion o/prime block encoding, which in turn is based on prime factorization theory. Via an extensive evaluation on both synthetic and real datasets, we show that Prism outperforms popular sequence mining methods like SPADE [10], PrefixSpan [6] and SPAM [2], by an order of magnitude or more.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134074155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}