Seventh IEEE International Conference on Data Mining (ICDM 2007)最新文献_第7页

Succinct Matrix Approximation and Efficient k-NN Classification 简洁矩阵逼近与高效k-NN分类

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.41

Rong Liu, Yong Shi

引用次数: 0

Document Transformation for Multi-label Feature Selection in Text Categorization 文本分类中多标签特征选择的文档转换

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.18

Weizhu Chen, Jun Yan, Benyu Zhang, Zheng Chen, Qiang Yang

{"title":"Document Transformation for Multi-label Feature Selection in Text Categorization","authors":"Weizhu Chen, Jun Yan, Benyu Zhang, Zheng Chen, Qiang Yang","doi":"10.1109/ICDM.2007.18","DOIUrl":"https://doi.org/10.1109/ICDM.2007.18","url":null,"abstract":"Feature selection on multi-label documents for automatic text categorization is an under-explored research area. This paper presents a systematic document transformation framework, whereby the multi-label documents are transformed into single-label documents before applying standard feature selection algorithms, to solve the multi-label feature selection problem. Under this framework, we undertake a comparative study on four intuitive document transformation approaches and propose a novel approach called entropy-based label assignment (ELA), which assigns the labels weights to a multi-label document based on label entropy. Three standard feature selection algorithms are utilized for evaluating the document transformation approaches in order to verify its impact on multi-class text categorization problems. Using a SVM classifier and two multi-label evaluation benchmark text collections, we show that the choice of document transformation approaches can significantly influence the performance of multi-class categorization and that our proposed document transformation approach ELA can achieve better performance than all other approaches.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123147187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 127

How Much Noise Is Too Much: A Study in Automatic Text Classification 多大的噪声才算多:文本自动分类研究

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.21

Sumeet Agarwal, S. Godbole, Diwakar Punjani, Shourya Roy

{"title":"How Much Noise Is Too Much: A Study in Automatic Text Classification","authors":"Sumeet Agarwal, S. Godbole, Diwakar Punjani, Shourya Roy","doi":"10.1109/ICDM.2007.21","DOIUrl":"https://doi.org/10.1109/ICDM.2007.21","url":null,"abstract":"Noise is a stark reality in real life data. Especially in the domain of text analytics, it has a significant impact as data cleaning forms a very large part of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech, and automatically recognized text from printed or handwritten material. Gigabytes of such data is being generated everyday on the Internet, in contact centers, and on mobile phones. Researchers have looked at various text mining issues such as pre-processing and cleaning noisy text, information extraction, rule learning, and classification for noisy text. This paper focuses on the issues faced by automatic text classifiers in analyzing noisy documents coming from various sources. The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification. Does the nature of such text warrant moving beyond traditional text classification techniques? We present detailed experimental results with simulated noise on the Reuters- 21578 and 20-newsgroups benchmark datasets. We present interesting results on real-life noisy datasets from various CRM domains.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134464621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 99

A Computational Approach to Style in American Poetry 美国诗歌风格的计算方法

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.76

D. M. Kaplan, D. Blei

引用次数: 52

Preserving Privacy through Data Generation 通过数据生成保护隐私

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.25

Jilles Vreeken, M. Leeuwen, A. Siebes

引用次数: 38

estMax: Tracing Maximal Frequent Itemsets over Online Data Streams estMax:跟踪在线数据流上的最大频繁项集

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.70

Ho Jin Woo, W. Lee

引用次数: 18

Detecting Fractures in Classifier Performance 分类器性能缺陷检测

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.106

David A. Cieslak, N. Chawla

{"title":"Detecting Fractures in Classifier Performance","authors":"David A. Cieslak, N. Chawla","doi":"10.1109/ICDM.2007.106","DOIUrl":"https://doi.org/10.1109/ICDM.2007.106","url":null,"abstract":"A fundamental tenet assumed by many classification algorithms is the presumption that both training and testing samples are drawn from the same distribution of data - this is the stationary distribution assumption. This entails that the past is strongly indicative of the future. However, in real world applications, many factors may alter the One True Model responsible for generating the data distribution both significantly and subtly. In circumstances violating the stationary distribution assumption, traditional validation schemes such as ten-folds and hold-out become poor performance predictors and classifier rankers. Thus, it becomes critical to discover the fracture points in classifier performance by discovering the divergence between populations. In this paper, we implement a comprehensive evaluation framework to identify bias, enabling selection of a \"correct\" classifier given the sample bias. To thoroughly evaluate the performance of classifiers within biased distributions, we consider the following three scenarios: missing completely at random (akin to stationary); missing at random; and missing not at random. The latter reflects the canonical sample selection bias problem.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"417 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125026865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Consensus Clusterings 共识集群

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.73

Nam Nguyen, R. Caruana

引用次数: 255

On Appropriate Assumptions to Mine Data Streams: Analysis and Practice 挖掘数据流的适当假设:分析与实践

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.96

Jing Gao, W. Fan, Jiawei Han

{"title":"On Appropriate Assumptions to Mine Data Streams: Analysis and Practice","authors":"Jing Gao, W. Fan, Jiawei Han","doi":"10.1109/ICDM.2007.96","DOIUrl":"https://doi.org/10.1109/ICDM.2007.96","url":null,"abstract":"Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the \"same distribution\", and yet this \"same distribution\" evolves over time. We demonstrate that this may not be true, and one actually may never know either \"how\" or \"when\" the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data \"continuously follows significantly different\" distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131324182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 176

Predicting Blogging Behavior Using Temporal and Social Networks 使用时间和社会网络预测博客行为

Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.97

Bi Chen, Qiankun Zhao, Bingjun Sun, P. Mitra

引用次数: 16