{"title":"Enhancing Text Clustering Using Concept-based Mining Model","authors":"Shady Shehata, F. Karray, M. Kamel","doi":"10.1109/ICDM.2006.64","DOIUrl":"https://doi.org/10.1109/ICDM.2006.64","url":null,"abstract":"Most of text mining techniques are based on word and/or phrase analysis of the text. The statistical analysis of a term (word or phrase) frequency captures the importance of the term within a document. However, to achieve a more accurate analysis, the underlying mining technique should indicate terms that capture the semantics of the text from which the importance of a term in a sentence and in the document can be derived. A new concept-based mining model that relies on the analysis of both the sentence and the document, rather than, the traditional analysis of the document dataset only is introduced. The proposed mining model consists of a concept-based analysis of terms and a concept-based similarity measure. The term which contributes to the sentence semantics is analyzed with respect to its importance at the sentence and document levels. The model can efficiently find significant matching terms, either words or phrases, of the documents according to the semantics of the text. The similarity between documents relies on a new concept-based similarity measure which is applied to the matching terms between documents. Experiments using the proposed concept-based term analysis and similarity measure in text clustering are conducted. Experimental results demonstrate that the newly developed concept-based mining model enhances the clustering quality of sets of documents substantially.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131772239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng Wang, Xiansheng Hua, Yan Song, Lirong Dai, HongJiang Zhang
{"title":"Semi-Supervised Kernel Regression","authors":"Meng Wang, Xiansheng Hua, Yan Song, Lirong Dai, HongJiang Zhang","doi":"10.1109/ICDM.2006.143","DOIUrl":"https://doi.org/10.1109/ICDM.2006.143","url":null,"abstract":"Insufficiency of training data is a major obstacle in machine learning and data mining applications. Many different semi-supervised learning algorithms have been proposed to tackle this difficulty by leveraging a large amount of unlabeled data. However, most of them focus on semi-supervised classification. In this paper we propose a semi-supervised regression algorithm named semi-supervised kernel regression (SSKR). While classical kernel regression is only based on labeled examples, our approach extends it to all observed examples using a weighting factor to modulate the effect of unlabeled examples. Experimental results prove that SSKR significantly outperforms traditional kernel regression and graph-based semi-supervised regression methods.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133135562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams","authors":"Jimeng Sun, S. Papadimitriou, Philip S. Yu","doi":"10.1109/ICDM.2006.169","DOIUrl":"https://doi.org/10.1109/ICDM.2006.169","url":null,"abstract":"Data stream values are often associated with multiple aspects. For example, each value from environmental sensors may have an associated type (e.g., temperature, humidity, etc) as well as location. Aside from timestamp, type and location are the two additional aspects. How to model such streams? How to simultaneously find patterns within and across the multiple aspects? How to do it incrementally in a streaming fashion? In this paper, all these problems are addressed through a general data model, tensor streams, and an effective algorithmic framework, window-based tensor analysis (WTA). Two variations of WTA, independent- window tensor analysis (IW) and moving-window tensor analysis (MW), are presented and evaluated extensively on real datasets. Finally, we illustrate one important application, multi-aspect correlation analysis (MACA), which uses WTA and we demonstrate its effectiveness on an environmental monitoring application.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115034517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Searching for Pattern Rules","authors":"Guichong Li, Howard J. Hamilton","doi":"10.1109/ICDM.2006.139","DOIUrl":"https://doi.org/10.1109/ICDM.2006.139","url":null,"abstract":"We address the problem of finding a set of pattern rules, from a transaction dataset given a statistical metric. A new data structure, called an incrementally counting suffix tree (ICST), is proposed for online computation of estimates of the support of any pattern or itemset. Using an ICST, our approach directly generates a set of pattern rules by a single scan of the whole dataset in partitions without the generation of frequent itemsets. Non-redundant rules can be found by removing redundancies from the pattern rules. The PPMCR algorithm first finds pattern rules and then non-redundant rules by generating valid candidates while traversing the ICST. Experimental results show that the PPMCR algorithm can be used for efficiently mining fewer non-redundant rules.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116027728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparisons of K-Anonymization and Randomization Schemes under Linking Attacks","authors":"Zhouxuan Teng, Wenliang Du","doi":"10.1109/ICDM.2006.40","DOIUrl":"https://doi.org/10.1109/ICDM.2006.40","url":null,"abstract":"Recently K-anonymity has gained popularity as a privacy quantification against linking attacks, in which attackers try to identify a record with values of some identifying attributes. If attacks succeed, the identity of the record will be revealed and potential confidential information contained in other attributes of the record will be disclosed. K-anonymity counters this attack by requiring that each record must be indistinguishable from at least K-1 other records with respect to the identifying attributes. Randomization can also be used for protection against linking attacks. In this paper, we compare the performance of K-anonymization and randomization schemes under linking attacks. We present a new privacy definition that can be applied to both k-anonymization and randomization. We compare these two schemes in terms of both utility and risks of privacy disclosure, and we promote to use R-U confidentiality map for such comparisons. We also compare various randomization schemes.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116128363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kelvin Sim, Jinyan Li, V. Gopalkrishnan, Guimei Liu
{"title":"Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment","authors":"Kelvin Sim, Jinyan Li, V. Gopalkrishnan, Guimei Liu","doi":"10.1109/ICDM.2006.111","DOIUrl":"https://doi.org/10.1109/ICDM.2006.111","url":null,"abstract":"We introduce an unsupervised process to co-cluster groups of stocks and financial ratios, so that investors can gain more insight on how they are correlated. Our idea for the co-clustering is based on a graph concept called maximal quasi-bicliques, which can tolerate erroneous or/and missing information that are common in the stock and financial ratio data. Compared to previous works, our maximal quasi-bicliques require the errors to be evenly distributed, which enable us to capture more meaningful co-clusters. We develop a new algorithm that can efficiently enumerate maximal quasi-bicliques from an undirected graph. The concept of maximal quasi-bicliques is domain-independent; it can be extended to perform co-clustering on any set of data that are modeled by graphs.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122759454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Yan, Ning Liu, Benyu Zhang, Qiang Yang, Shuicheng Yan, Zheng Chen
{"title":"A Novel Scalable Algorithm for Supervised Subspace Learning","authors":"Jun Yan, Ning Liu, Benyu Zhang, Qiang Yang, Shuicheng Yan, Zheng Chen","doi":"10.1109/ICDM.2006.7","DOIUrl":"https://doi.org/10.1109/ICDM.2006.7","url":null,"abstract":"Subspace learning approaches aim to discover important statistical distribution on lower dimensions for high dimensional data. Methods such as principal component analysis (PCA) do not make use of the class information, and linear discriminant analysis (LDA) could not be performed efficiently in a scalable way. In this paper, we propose a novel highly scalable supervised subspace learning algorithm called as supervised Kampong measure (SKM). It assigns data points as close as possible to their corresponding class mean, simultaneously assigns data points to be as far as possible from the other class means in the transformed lower dimensional subspace. Theoretical derivation shows that our algorithm is not limited by the number of classes or the singularity problem faced by LDA. Furthermore, our algorithm can be executed in an incremental manner in which learning is done in an online fashion as data streams are received. Experimental results on several datasets, including a very large text data set RCV1, show the outstanding performance of our proposed algorithm on classification problems as compared to PCA, LDA and a popular feature selection approach, information gain (IG).","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122893849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Corrective Classification: Classifier Ensembling with Corrective and Diverse Base Learners","authors":"Yan Zhang, Xingquan Zhu, Xindong Wu","doi":"10.1109/ICDM.2006.45","DOIUrl":"https://doi.org/10.1109/ICDM.2006.45","url":null,"abstract":"Empirical studies on supervised learning have shown that ensembling methods lead to a model superior to the one built from a single learner under many circumstances especially when learning from imperfect, such as biased or noise infected, information sources. In this paper, we provide a novel corrective classification (C2) design, which incorporates error detection, data cleansing and Bootstrap sampling to construct base learners that constitute the classifier ensemble. The essential goal is to reduce noise impacts and eventually enhance the learners built from noise corrupted data. We further analyze the importance of both the accuracy and diversity of base learners in ensembling, in order to shed some light on the mechanism under which C2 works. Experimental comparisons will demonstrate that C2 is not only superior to the learner built from the original noisy sources, but also more reliable than bagging or the aggressive classifier ensemble (ACE), which are two degenerate components/variants of C2.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125157118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deploying Approaches for Pattern Refinement in Text Mining","authors":"Sheng-Tang Wu, Yuefeng Li, Yue Xu","doi":"10.1109/ICDM.2006.50","DOIUrl":"https://doi.org/10.1109/ICDM.2006.50","url":null,"abstract":"Text mining is the technique that helps users find useful information from a large amount of digital text documents on the Web or databases. Instead of the keyword-based approach which is typically used in this field, the pattern-based model containing frequent sequential patterns is employed to perform the same concept of tasks. However, how to effectively use these discovered patterns is still a big challenge. In this study, we propose two approaches based on the use of pattern deploying strategies. The performance of the pattern deploying algorithms for text mining is investigated on the Reuters dataset RCVI and the results show that the effectiveness is improved by using our proposed pattern refinement approaches.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128261556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Constructing Ensembles for Better Ranking","authors":"Jin Huang, C. Ling","doi":"10.1109/ICDM.2006.42","DOIUrl":"https://doi.org/10.1109/ICDM.2006.42","url":null,"abstract":"We propose a novel algorithm, RankDE, to build an ensemble using an extra artificial dataset. RankDE aims at improving the overall ranking performance, which is crucial in many machine learning applications. This algorithm constructs artificial datasets that are diverse with the current training dataset in terms of ranking. We conduct experiments with real-world data sets to compare RankDE with some traditional and state-of-the-art ensembling algorithms of Bagging, Adaboost, DECORATE and Rankboost in terms of ranking. The experiments show that RankDE outperforms Bagging, DECORATE, Adaboost, and Rankboost when limited data is available. When enough training data is available, it is competitive with DECORATE and Adaboost.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121673506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}