Seventh IEEE International Conference on Data Mining (ICDM 2007)最新文献

筛选
英文 中文
Succinct Matrix Approximation and Efficient k-NN Classification 简洁矩阵逼近与高效k-NN分类
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.41
Rong Liu, Yong Shi
{"title":"Succinct Matrix Approximation and Efficient k-NN Classification","authors":"Rong Liu, Yong Shi","doi":"10.1109/ICDM.2007.41","DOIUrl":"https://doi.org/10.1109/ICDM.2007.41","url":null,"abstract":"This work reveals that instead of the polynomial bounds in previous literatures there exists a sharper bound of exponential form for the L2 norm of an arbitrary shaped random matrix. Based on the newly elaborated bound, a nonuniform sampling method is presented to succinctly approximate a matrix with a sparse binary one, and thus relieves the computation loads of k-NN classifier in both time and storage. The method is also pass-efficient because sampling and quantizing are combined together in a single step and the whole process can be completed within one pass over the input matrix. In the evaluations on compression ratio and reconstruction error, the sampling method exhibits impressive capability in providing succinct and tight approximations for the input matrices. The most significant finding in the classification experiment is that the k-NN classifier based on the approximation can even outperform the standard one. This provides another strong evidence for the claim that our method is especially capable in capturing intrinsic characteristics.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116387345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Document Transformation for Multi-label Feature Selection in Text Categorization 文本分类中多标签特征选择的文档转换
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.18
Weizhu Chen, Jun Yan, Benyu Zhang, Zheng Chen, Qiang Yang
{"title":"Document Transformation for Multi-label Feature Selection in Text Categorization","authors":"Weizhu Chen, Jun Yan, Benyu Zhang, Zheng Chen, Qiang Yang","doi":"10.1109/ICDM.2007.18","DOIUrl":"https://doi.org/10.1109/ICDM.2007.18","url":null,"abstract":"Feature selection on multi-label documents for automatic text categorization is an under-explored research area. This paper presents a systematic document transformation framework, whereby the multi-label documents are transformed into single-label documents before applying standard feature selection algorithms, to solve the multi-label feature selection problem. Under this framework, we undertake a comparative study on four intuitive document transformation approaches and propose a novel approach called entropy-based label assignment (ELA), which assigns the labels weights to a multi-label document based on label entropy. Three standard feature selection algorithms are utilized for evaluating the document transformation approaches in order to verify its impact on multi-class text categorization problems. Using a SVM classifier and two multi-label evaluation benchmark text collections, we show that the choice of document transformation approaches can significantly influence the performance of multi-class categorization and that our proposed document transformation approach ELA can achieve better performance than all other approaches.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123147187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 127
How Much Noise Is Too Much: A Study in Automatic Text Classification 多大的噪声才算多:文本自动分类研究
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.21
Sumeet Agarwal, S. Godbole, Diwakar Punjani, Shourya Roy
{"title":"How Much Noise Is Too Much: A Study in Automatic Text Classification","authors":"Sumeet Agarwal, S. Godbole, Diwakar Punjani, Shourya Roy","doi":"10.1109/ICDM.2007.21","DOIUrl":"https://doi.org/10.1109/ICDM.2007.21","url":null,"abstract":"Noise is a stark reality in real life data. Especially in the domain of text analytics, it has a significant impact as data cleaning forms a very large part of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech, and automatically recognized text from printed or handwritten material. Gigabytes of such data is being generated everyday on the Internet, in contact centers, and on mobile phones. Researchers have looked at various text mining issues such as pre-processing and cleaning noisy text, information extraction, rule learning, and classification for noisy text. This paper focuses on the issues faced by automatic text classifiers in analyzing noisy documents coming from various sources. The goal of this paper is to bring out and study the effect of different kinds of noise on automatic text classification. Does the nature of such text warrant moving beyond traditional text classification techniques? We present detailed experimental results with simulated noise on the Reuters- 21578 and 20-newsgroups benchmark datasets. We present interesting results on real-life noisy datasets from various CRM domains.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134464621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 99
A Computational Approach to Style in American Poetry 美国诗歌风格的计算方法
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.76
D. M. Kaplan, D. Blei
{"title":"A Computational Approach to Style in American Poetry","authors":"D. M. Kaplan, D. Blei","doi":"10.1109/ICDM.2007.76","DOIUrl":"https://doi.org/10.1109/ICDM.2007.76","url":null,"abstract":"We develop a quantitative method to assess the style of American poems and to visualize a collection of poems in relation to one another. Qualitative poetry criticism helped guide our development of metrics that analyze various orthographic, syntactic, and phonemic features. These features are used to discover comprehensive stylistic information from a poem's multi-layered latent structure, and to compute distances between poems in this space. Visualizations provide ready access to the analytical components. We demonstrate our method on several collections of poetry, showing that it better delineates poetry style than the traditional word-occurrence features that are used in typical text analysis algorithms. Our method has potential applications to academic research of texts, to research of the intuitive personal response to poetry, and to making recommendations to readers based on their favorite poems.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129578291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
Preserving Privacy through Data Generation 通过数据生成保护隐私
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.25
Jilles Vreeken, M. Leeuwen, A. Siebes
{"title":"Preserving Privacy through Data Generation","authors":"Jilles Vreeken, M. Leeuwen, A. Siebes","doi":"10.1109/ICDM.2007.25","DOIUrl":"https://doi.org/10.1109/ICDM.2007.25","url":null,"abstract":"Many databases will not or can not be disclosed without strong guarantees that no sensitive information can be extracted. To address this concern several data perturbation techniques have been proposed. However, it has been shown that either sensitive information can still be extracted from the perturbed data with little prior knowledge, or that many patterns are lost. In this paper we show that generating new data is an inherently safer alternative. We present a data generator based on the models obtained by the MDL-based KRIMP (Siebes et al., 2006) algorithm. These are accurate representations of the data distributions and can thus be used to generate data with the same characteristics as the original data. Experimental results show a very large pattern-similarity between the generated and the original data, ensuring that viable conclusions can be drawn from the anonymised data. Furthermore, anonymity is guaranteed for suited databases and the quality-privacy trade-off can be balanced explicitly.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127000756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
estMax: Tracing Maximal Frequent Itemsets over Online Data Streams estMax:跟踪在线数据流上的最大频繁项集
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.70
Ho Jin Woo, W. Lee
{"title":"estMax: Tracing Maximal Frequent Itemsets over Online Data Streams","authors":"Ho Jin Woo, W. Lee","doi":"10.1109/ICDM.2007.70","DOIUrl":"https://doi.org/10.1109/ICDM.2007.70","url":null,"abstract":"In general, the number of frequent itemsets in a data set is very large. In order to represent them in more compact notation, closed or maximal frequent itemsets (MFIs) are used. However, the characteristics of a data stream make such a task be more difficult. For this purpose, this paper proposes a method called estMax that can trace the set of MFIs over a data stream. The proposed method maintains the set of frequent itemsets by a prefix tree and extracts all of MFIs without any additional superset/subset checking mechanism. Upon processing a newly generated transaction, its longest matched frequent itemsets are marked in a prefix tree as candidates for MFIs. At the same time, if any subset of these newly marked itemsets has been already marked as a candidate MFI, it is cleared as well. By employing this additional step, it is possible to extract the set of MFIs at any moment. The performance of the proposed method is comparatively analyzed by a series of experiments to identify its various characteristics.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"256 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114466809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Detecting Fractures in Classifier Performance 分类器性能缺陷检测
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.106
David A. Cieslak, N. Chawla
{"title":"Detecting Fractures in Classifier Performance","authors":"David A. Cieslak, N. Chawla","doi":"10.1109/ICDM.2007.106","DOIUrl":"https://doi.org/10.1109/ICDM.2007.106","url":null,"abstract":"A fundamental tenet assumed by many classification algorithms is the presumption that both training and testing samples are drawn from the same distribution of data - this is the stationary distribution assumption. This entails that the past is strongly indicative of the future. However, in real world applications, many factors may alter the One True Model responsible for generating the data distribution both significantly and subtly. In circumstances violating the stationary distribution assumption, traditional validation schemes such as ten-folds and hold-out become poor performance predictors and classifier rankers. Thus, it becomes critical to discover the fracture points in classifier performance by discovering the divergence between populations. In this paper, we implement a comprehensive evaluation framework to identify bias, enabling selection of a \"correct\" classifier given the sample bias. To thoroughly evaluate the performance of classifiers within biased distributions, we consider the following three scenarios: missing completely at random (akin to stationary); missing at random; and missing not at random. The latter reflects the canonical sample selection bias problem.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"417 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125026865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Consensus Clusterings 共识集群
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.73
Nam Nguyen, R. Caruana
{"title":"Consensus Clusterings","authors":"Nam Nguyen, R. Caruana","doi":"10.1109/ICDM.2007.73","DOIUrl":"https://doi.org/10.1109/ICDM.2007.73","url":null,"abstract":"In this paper we address the problem of combining multiple clusterings without access to the underlying features of the data. This process is known in the literature as clustering ensembles, clustering aggregation, or consensus clustering. Consensus clustering yields a stable and robust final clustering that is in agreement with multiple clusterings. We find that an iterative EM-like method is remarkably effective for this problem. We present an iterative algorithm and its variations for finding clustering consensus. An extensive empirical study compares our proposed algorithms with eleven other consensus clustering methods on four data sets using three different clustering performance metrics. The experimental results show that the new ensemble clustering methods produce clusterings that are as good as, and often better than, these other methods.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129122779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 255
On Appropriate Assumptions to Mine Data Streams: Analysis and Practice 挖掘数据流的适当假设:分析与实践
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.96
Jing Gao, W. Fan, Jiawei Han
{"title":"On Appropriate Assumptions to Mine Data Streams: Analysis and Practice","authors":"Jing Gao, W. Fan, Jiawei Han","doi":"10.1109/ICDM.2007.96","DOIUrl":"https://doi.org/10.1109/ICDM.2007.96","url":null,"abstract":"Recent years have witnessed an increasing number of studies in stream mining, which aim at building an accurate model for continuously arriving data. Somehow most existing work makes the implicit assumption that the training data and the yet-to-come testing data are always sampled from the \"same distribution\", and yet this \"same distribution\" evolves over time. We demonstrate that this may not be true, and one actually may never know either \"how\" or \"when\" the distribution changes. Thus, a model that fits well on the observed distribution can have unsatisfactory accuracy on the incoming data. Practically, one can just assume the bare minimum that learning from observed data is better than both random guessing and always predicting exactly the same class label. Importantly, we formally and experimentally demonstrate the robustness of a model averaging and simple voting-based framework for data streams, particularly when incoming data \"continuously follows significantly different\" distributions. On a real streaming data, this framework reduces the expected error of baseline models by 60%, and remains the most accurate compared to those baseline models.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131324182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 176
Predicting Blogging Behavior Using Temporal and Social Networks 使用时间和社会网络预测博客行为
Seventh IEEE International Conference on Data Mining (ICDM 2007) Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.97
Bi Chen, Qiankun Zhao, Bingjun Sun, P. Mitra
{"title":"Predicting Blogging Behavior Using Temporal and Social Networks","authors":"Bi Chen, Qiankun Zhao, Bingjun Sun, P. Mitra","doi":"10.1109/ICDM.2007.97","DOIUrl":"https://doi.org/10.1109/ICDM.2007.97","url":null,"abstract":"Modeling the behavior of bloggers is an important problem with various applications in recommender systems, targeted advertising, and event detection. In this paper, we propose three models by combining content, temporal, social dimensions: the general blogging-behavior model, the profile-based blogging-behavior model and the social- network and profile-based blogging-behavior model. The models are based on two regression techniques: Extreme Learning Machine (ELM), and Modified General Regression Neural Network (MGRNN). We choose one of the largest blogs, a political blog, DailyKos1, for our empirical evaluation. Experiments show that the social network and profile-based blogging behavior model with ELM regression techniques produce good results for the most active bloggers and can be used to predict blogging behavior.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126355389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信