2008 Eighth IEEE International Conference on Data Mining最新文献_第4页

Learning Bayesian Networks: A MAP Criterion for Joint Selection of Model Structure and Parameter 学习贝叶斯网络:模型结构和参数联合选择的MAP准则

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.14

C. Riggelsen

{"title":"Learning Bayesian Networks: A MAP Criterion for Joint Selection of Model Structure and Parameter","authors":"C. Riggelsen","doi":"10.1109/ICDM.2008.14","DOIUrl":"https://doi.org/10.1109/ICDM.2008.14","url":null,"abstract":"For learning Bayesian Network (BN) structures, it has become common practice to use the Bayesian Dirichlet (BD) scoring criterion. In contrast to most other scoring metrics that functionally can be interpreted as regularized maximum likelihood criteria, the BD metric cannot be considered as such. The functional dissimilarity of the BD metric compared to other metrics is an obstacle from an analytical point of view; this is for instance becomes clear in the context of the structural EM algorithm for learning BNs from incomplete data. Also, it is not easy to pin-point why exactly and to what extend regularization is taken care of by applying the BD metric. We introduce a Bayesian scoring criterion that is closely related to the BD metric, but solves the obvious disadvantages of the BD metric. We arrive at this result by using the same basic assumptions as for the BD metric, but in contrast to the BD metric, where focus is on learning the model structure only, we aim at learning the most probable BN pair jointly, i.e., model structure and the parameter are selected as a pair. This approach yields a scoring metric that has the functional form of a regularized maximum likelihood metric. We perform experiments, and show that this MAP BN metric also yields better results than the BIC and BD metrics on independent test data.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128079522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Prediction of Skin Penetration Using Machine Learning Methods 使用机器学习方法预测皮肤穿透

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.97

Yi Sun, G. Moss, Maria Prapopoulou, R. Adams, Marc B. Brown, N. Davey

引用次数: 10

A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data 演化数据流分类的实用方法:用有限数量的标记数据进行训练

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.152

M. Masud, Jing Gao, L. Khan, Jiawei Han, B. Thuraisingham

{"title":"A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data","authors":"M. Masud, Jing Gao, L. Khan, Jiawei Han, B. Thuraisingham","doi":"10.1109/ICDM.2008.152","DOIUrl":"https://doi.org/10.1109/ICDM.2008.152","url":null,"abstract":"Recent approaches in classifying evolving data streams are based on supervised learning algorithms, which can be trained with labeled data only. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment, where huge volumes of data appear at a high speed, labeled data may be very scarce. Thus, only a limited amount of training data may be available for building the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by building a classification model from a training set having both unlabeled and a small amount of labeled instances. This model is built as micro-clusters using semi-supervised clustering technique and classification is performed with kappa-nearest neighbor algorithm. An ensemble of these models is used to classify the unlabeled data. Empirical evaluation on both synthetic data and real botnet traffic reveals that our approach, using only a small amount of labeled data for training, outperforms state-of-the-art stream classification algorithms that use twenty times more labeled data than our approach.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122865404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 141

Collective Latent Dirichlet Allocation 集体潜在狄利克雷分配

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.75

Zhiyong Shen, Junyi Sun, Yi-Dong Shen

引用次数: 17

Modeling and Predicting the Helpfulness of Online Reviews 在线评论的有用性建模与预测

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.94

Yang Liu, Xiangji Huang, Aijun An, Xiaohui Yu

引用次数: 306

Sequence Mining Automata: A New Technique for Mining Frequent Sequences under Regular Expressions 序列挖掘自动机:正则表达式下频繁序列挖掘的新技术

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.111

R. Trasarti, F. Bonchi, Bart Goethals

引用次数: 19

Iterative Set Expansion of Named Entities Using the Web 基于Web的命名实体迭代集扩展

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.145

Richard C. Wang, William W. Cohen

{"title":"Iterative Set Expansion of Named Entities Using the Web","authors":"Richard C. Wang, William W. Cohen","doi":"10.1109/ICDM.2008.145","DOIUrl":"https://doi.org/10.1109/ICDM.2008.145","url":null,"abstract":"Set expansion refers to expanding a partial set of \"seed\" objects into a more complete set. One system that does set expansion is SEAL (set expander for any language), which expands entities automatically by utilizing resources from the Web in a language independent fashion. In a previous study, SEAL showed good set expansion performance using three seed entities; however, when given a larger set of seeds (e.g., ten), SEAL's expansion method performs poorly. In this paper, we present iterative SEAL (iSEAL), which allows a user to provide many seeds. Briefly, iSEAL makes several calls to SEAL, each call using a small number of seeds. We also show that iSEAL can be used in a \"bootstrapping\" manner, where each call to SEAL uses a mixture of user-provided and self-generated seeds. We show that the bootstrapping version of iSEAL obtains better results than SEAL even when using fewer user-provided seeds. In addition, we compare the performance of various ranking algorithms used in iSEAL, and show that the choice of ranking method has a small effect on performance when all seeds are user-provided, but a large effect when iSEAL is bootstrapped. In particular, we show that random walk with restart is nearly as good as Bayesian sets with user-provided seeds, and performs best with bootstrapped seeds.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116345411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 123

Using Wikipedia for Co-clustering Based Cross-Domain Text Classification 基于维基百科的协同聚类跨领域文本分类

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.136

Pu Wang, C. Domeniconi, Jian Hu

{"title":"Using Wikipedia for Co-clustering Based Cross-Domain Text Classification","authors":"Pu Wang, C. Domeniconi, Jian Hu","doi":"10.1109/ICDM.2008.136","DOIUrl":"https://doi.org/10.1109/ICDM.2008.136","url":null,"abstract":"Traditional approaches to document classification requires labeled data in order to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available, and often too expensive to obtain. Given a learning task for which training data are not available, abundant labeled data may exist for a different but related domain. One would like to use the related labeled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows to propagate labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114323157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Releasing the SVM Classifier with Privacy-Preservation 释放具有隐私保护的SVM分类器

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.19

Keng-Pei Lin, Ming-Syan Chen

{"title":"Releasing the SVM Classifier with Privacy-Preservation","authors":"Keng-Pei Lin, Ming-Syan Chen","doi":"10.1109/ICDM.2008.19","DOIUrl":"https://doi.org/10.1109/ICDM.2008.19","url":null,"abstract":"Support vector machine (SVM) is a widely used tool in classification problem. SVM solves a quadratic optimization problem to decide which instances of training dataset are support vectors, i.e., the necessarily informative instances to form the classifier. The support vectors are intact tuples taken from the training dataset. Releasing the SVM classifier to public use or shipping the SVM classifier to clients will disclose the private content of support vectors, violating the privacy-preservation requirement in some legal or commercial reasons. To the best of our knowledge, there has not been work extending the notion of privacy-preservation to releasing the SVM classifier. In this paper, we propose an approximation approach which post-processes the SVM classifier to protect the private content of support vectors. This approach is designed for the commonly used Gaussian radial basis function kernel. By applying this post-processor on the SVM classifier, the resulted privacy-preserving SVM classifier can be publicly released without exposing the private content of support vectors and is able to provide comparable classification accuracy to the original SVM classifier.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125599655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Estimating Aggregates over Multiple Sets 估计多个集合上的聚合

2008 Eighth IEEE International Conference on Data Mining Pub Date : 2008-12-15 DOI: 10.1109/ICDM.2008.110

E. Cohen, Haim Kaplan

引用次数: 2