2008 Eighth IEEE International Conference on Data Mining最新文献

筛选
英文 中文
On Locally Linear Classification by Pairwise Coupling 基于成对耦合的局部线性分类
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.137
F. Chen, Chang-Tien Lu, Arnold P. Boedihardjo
{"title":"On Locally Linear Classification by Pairwise Coupling","authors":"F. Chen, Chang-Tien Lu, Arnold P. Boedihardjo","doi":"10.1109/ICDM.2008.137","DOIUrl":"https://doi.org/10.1109/ICDM.2008.137","url":null,"abstract":"Locally linear classification by pairwise coupling addresses a nonlinear classification problem by three basic phases: decompose the classes of complex concepts into linearly separable subclasses, learn a linear classifier for each pair, and combine pairwise classifiers into a single classifier. A number of methods have been proposed in this framework. However, these methods have two major deficiencies: 1) lack of systematic evaluation of this framework; 2) naive application of clustering algorithms to generate subclasses. This paper proves the equivalence between three popular combination schemas under general settings, defines several global criterion functions for measuring the goodness of subclasses, and presents a supervised greedy clustering algorithm to optimize the proposed criterion functions. Extensive experiments were conducted to validate the effectiveness of the proposed techniques.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127408543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Graph-Based Rare Category Detection 基于图的稀有类别检测
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.122
Jingrui He, Yan Liu, Richard D. Lawrence
{"title":"Graph-Based Rare Category Detection","authors":"Jingrui He, Yan Liu, Richard D. Lawrence","doi":"10.1109/ICDM.2008.122","DOIUrl":"https://doi.org/10.1109/ICDM.2008.122","url":null,"abstract":"Rare category detection is the task of identifying examples from rare classes in an unlabeled data set. It is an open challenge in machine learning and plays key roles in real applications such as financial fraud detection, network intrusion detection, astronomy, spam image detection, etc. In this paper, we develop a new graph-based method for rare category detection named GRADE. It makes use of the global similarity matrix motivated by the manifold ranking algorithm, which results in more compact clusters for the minority classes; by selecting examples from the regions where probability density changes the most, it relaxes the assumption that the majority classes and the minority classes are separable. Furthermore, when detailed information about the data set is not available, we develop a modified version of GRADE named GRADE-LI, which only needs an upper bound on the proportion of each minority class as input. Besides working with data with structured features, both GRADE and GRADE-LI can also work with graph data, which can not be handled by existing rare category detection methods. Experimental results on both synthetic and real data sets demonstrate the effectiveness of the GRADE and GRADE-LI algorithms.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121135504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Classifying High-Dimensional Text and Web Data Using Very Short Patterns 使用非常短的模式分类高维文本和Web数据
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.139
Hassan H. Malik, J. Kender
{"title":"Classifying High-Dimensional Text and Web Data Using Very Short Patterns","authors":"Hassan H. Malik, J. Kender","doi":"10.1109/ICDM.2008.139","DOIUrl":"https://doi.org/10.1109/ICDM.2008.139","url":null,"abstract":"In this paper, we propose the \"democratic classifier\", a simple pattern-based classification algorithm that uses very short patterns for classification, and does not rely on the minimum support threshold. Borrowing ideas from democracy, our training phase allows each training instance to vote for an equal number of candidate size-2 patterns. The training instances select patterns by effectively balancing between local, class, and global significance of patterns. The selected patterns are simultaneously added to the model for all applicable classes and a novel power law based weighing scheme adjusts their weights with respect of each class. Results of experiments performed on 121 common text and Web datasets show that our algorithm almost always outperforms state of the art classification algorithms, without any parameter tuning. On 100 real-life Web datasets, the average absolute classification accuracy improvement was as great as 9.4% over SVM, Harmony, C4.5 and KNN. Also, our algorithm ran about 3.5 times faster than the fastest existing pattern-based classification algorithm.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115993940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Organic Pie Charts 有机饼状图
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.64
F. Mörchen
{"title":"Organic Pie Charts","authors":"F. Mörchen","doi":"10.1109/ICDM.2008.64","DOIUrl":"https://doi.org/10.1109/ICDM.2008.64","url":null,"abstract":"We present a new visualization of the distance and cluster structure of high dimensional data. It is particularly well suited for analysis tasks of users unfamiliar with complex data analysis techniques as it builds on the well known concept of pie charts. The non-linear projection capabilities of Emergent Self-Organizing Maps (ESOM) are used to generate a topology-preserving ordering of the data points on a circle. The distance structure within the high dimensional space is visualized on the circle analogously to the U-Matrix method for two-dimensional SOM. The resulting display resembles pie charts but has an organic structure that naturally emerges from the data. Pie segments correspond to groups of similar data points. Boundaries between segments represent low density regions with larger distances among neighboring points in the high dimensional space. The representation of distances in the form of a periodic sequence of values makes time series segmentation applicable to automated clustering of the data that is in sync with the visualization. We discuss the usefulness of the method on a variety of data sets to demonstrate the applicability in applications such as document analysis or customer segmentation.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115492118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Paired Learners for Concept Drift 概念漂移的配对学习者
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.119
Stephen H. Bach, M. Maloof
{"title":"Paired Learners for Concept Drift","authors":"Stephen H. Bach, M. Maloof","doi":"10.1109/ICDM.2008.119","DOIUrl":"https://doi.org/10.1109/ICDM.2008.119","url":null,"abstract":"To cope with concept drift, we paired a stable online learner with a reactive one. A stable learner predicts based on all of its experience, whereas are active learner predicts based on its experience over a short, recent window of time. The method of paired learning uses differences in accuracy between the two learners over this window to determine when to replace the current stable learner, since the stable learner performs worse than does there active learner when the target concept changes. While the method uses the reactive learner as an indicator of drift, it uses the stable learner to predict, since the stable learner performs better than does the reactive learner when acquiring target concept. Experimental results support these assertions. We evaluated the method by making direct comparisons to dynamic weighted majority, accuracy weighted ensemble, and streaming ensemble algorithm (SEA) using two synthetic problems, the Stagger concepts and the SEA concepts, and three real-world data sets: meeting scheduling, electricity prediction, and malware detection. Results suggest that, on these problems, paired learners outperformed or performed comparably to methods more costly in time and space.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115544160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 141
Space Efficient String Mining under Frequency Constraints 频率约束下的空间高效字符串挖掘
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.32
J. Fischer, V. Mäkinen, Niko Välimäki
{"title":"Space Efficient String Mining under Frequency Constraints","authors":"J. Fischer, V. Mäkinen, Niko Välimäki","doi":"10.1109/ICDM.2008.32","DOIUrl":"https://doi.org/10.1109/ICDM.2008.32","url":null,"abstract":"Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet Sigma, with overall length n. We study the problem of mining discriminative patterns between D1 and D2 - e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as item-sets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |Sigma| Lt n (in particular for constant |Sigma|), as the databases themselves occupy only n log |Sigma| bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log |Sigma| + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128274086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Nonnegative Matrix Factorization for Combinatorial Optimization: Spectral Clustering, Graph Matching, and Clique Finding 组合优化的非负矩阵分解:谱聚类,图匹配和团查找
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.130
C. Ding, Tao Li, Michael I. Jordan
{"title":"Nonnegative Matrix Factorization for Combinatorial Optimization: Spectral Clustering, Graph Matching, and Clique Finding","authors":"C. Ding, Tao Li, Michael I. Jordan","doi":"10.1109/ICDM.2008.130","DOIUrl":"https://doi.org/10.1109/ICDM.2008.130","url":null,"abstract":"Nonnegative matrix factorization (NMF) is a versatile model for data clustering. In this paper, we propose several NMF inspired algorithms to solve different data mining problems. They include (1) multi-way normalized cut spectral clustering, (2) graph matching of both undirected and directed graphs, and (3) maximal clique finding on both graphs and bipartite graphs. Key features of these algorithms are (a) they are extremely simple to implement; and (b) they are provably convergent. We conduct experiments to demonstrate the effectiveness of these new algorithms. We also derive a new spectral bound for the size of maximal edge bicliques as a byproduct of our approach.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130075042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 139
Direct Zero-Norm Optimization for Feature Selection 特征选择的直接零范数优化
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.60
Kaizhu Huang, Irwin King, Michael R. Lyu
{"title":"Direct Zero-Norm Optimization for Feature Selection","authors":"Kaizhu Huang, Irwin King, Michael R. Lyu","doi":"10.1109/ICDM.2008.60","DOIUrl":"https://doi.org/10.1109/ICDM.2008.60","url":null,"abstract":"Zero-norm, defined as the number of non-zero elements in a vector, is an ideal quantity for feature selection. However, minimization of zero-norm is generally regarded as a combinatorially difficult optimization problem. In contrast to previous methods that usually optimize a surrogate of zero-norm, we propose a direct optimization method to achieve zero-norm for feature selection in this paper. Based on Expectation Maximization (EM), this method boils down to solving a sequence of Quadratic Programming problems and hence can be practically optimized in polynomial time. We show that the proposed optimization technique has a nice Bayesian interpretation and converges to the true zero norm asymptotically, provided that a good starting point is given. Following the scheme of our proposed zero-norm, we even show that an arbitrary-norm based Support Vector Machine can be achieved in polynomial time. A series of experiments demonstrate that our proposed EM based zero-norm outperforms other state-of-the-art methods for feature selection on biological microarray data and UCI data, in terms of both the accuracy and the learning efficiency.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123895633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
TOFA: Trace Oriented Feature Analysis in Text Categorization 文本分类中面向跟踪的特征分析
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.67
Jun Yan, Ning Liu, Qiang Yang, Weiguo Fan, Zheng Chen
{"title":"TOFA: Trace Oriented Feature Analysis in Text Categorization","authors":"Jun Yan, Ning Liu, Qiang Yang, Weiguo Fan, Zheng Chen","doi":"10.1109/ICDM.2008.67","DOIUrl":"https://doi.org/10.1109/ICDM.2008.67","url":null,"abstract":"Dimension reduction for large-scale text data is attracting much attention lately due to the rapid growth of World Wide Web. We can consider dimension reduction algorithms in two categories: feature extraction and feature selection. An important problem remains: it has been difficult to integrate these two algorithm categories into a single framework, making it difficult to reap the benefit of both. In this paper, we formulate the two algorithm categories through a unified optimization framework. Under this framework, we develop a novel feature selection algorithm called Trace Oriented Feature Analysis (TOFA). The novel objective function of TOFA is a unified framework that integrates many prominent feature extraction algorithms such as unsupervised Principal Component Analysis and supervised Maximum Margin Criterion are special cases of it. Thus TOFA can process not only supervised problem but also unsupervised and semi-supervised problems. Experimental results on real text datasets demonstrate the effectiveness and efficiency of TOFA.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126279863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Text Mining in Radiology Reports 放射学报告中的文本挖掘
2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.150
Tianxia Gong, Chew Lim Tan, T. Leong, C. Lee, B. Pang, C. C. Tchoyoson Lim, Qi Tian, Suisheng Tang, Zhuo Zhang
{"title":"Text Mining in Radiology Reports","authors":"Tianxia Gong, Chew Lim Tan, T. Leong, C. Lee, B. Pang, C. C. Tchoyoson Lim, Qi Tian, Suisheng Tang, Zhuo Zhang","doi":"10.1109/ICDM.2008.150","DOIUrl":"https://doi.org/10.1109/ICDM.2008.150","url":null,"abstract":"Medical text mining has gained increasing interest in recent years. Radiology reports contain rich information describing radiologistpsilas observations on the patientpsilas medical conditions in the associated medical images. However, as most reports are in free text format, the valuable information contained in those reports cannot be easily accessed and used, unless proper text mining has been applied. In this paper, we propose a text mining system to extract and use the information in radiology reports. The system consists of three main modules: a medical finding extractor, a report and image retriever, and a text-assisted image feature extractor. In evaluation, the overall precision and recall for medical finding extraction are 95.5% and 87.9% respectively, and for all modifiers of the medical findings 88.2% and 82.8% respectively. The overall result of report and image retrieval module and text-assisted image feature extraction module is satisfactory to radiologists.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126329052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信