{"title":"Learning Bayesian Networks: A MAP Criterion for Joint Selection of Model Structure and Parameter","authors":"C. Riggelsen","doi":"10.1109/ICDM.2008.14","DOIUrl":"https://doi.org/10.1109/ICDM.2008.14","url":null,"abstract":"For learning Bayesian Network (BN) structures, it has become common practice to use the Bayesian Dirichlet (BD) scoring criterion. In contrast to most other scoring metrics that functionally can be interpreted as regularized maximum likelihood criteria, the BD metric cannot be considered as such. The functional dissimilarity of the BD metric compared to other metrics is an obstacle from an analytical point of view; this is for instance becomes clear in the context of the structural EM algorithm for learning BNs from incomplete data. Also, it is not easy to pin-point why exactly and to what extend regularization is taken care of by applying the BD metric. We introduce a Bayesian scoring criterion that is closely related to the BD metric, but solves the obvious disadvantages of the BD metric. We arrive at this result by using the same basic assumptions as for the BD metric, but in contrast to the BD metric, where focus is on learning the model structure only, we aim at learning the most probable BN pair jointly, i.e., model structure and the parameter are selected as a pair. This approach yields a scoring metric that has the functional form of a regularized maximum likelihood metric. We perform experiments, and show that this MAP BN metric also yields better results than the BIC and BD metrics on independent test data.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128079522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Sun, G. Moss, Maria Prapopoulou, R. Adams, Marc B. Brown, N. Davey
{"title":"Prediction of Skin Penetration Using Machine Learning Methods","authors":"Yi Sun, G. Moss, Maria Prapopoulou, R. Adams, Marc B. Brown, N. Davey","doi":"10.1109/ICDM.2008.97","DOIUrl":"https://doi.org/10.1109/ICDM.2008.97","url":null,"abstract":"Improving predictions of the skin permeability coefficient is a difficult problem. It is also an important issue with the increasing use of skin patches as a means of drug delivery. In this work, we apply K-nearest-neighbour regression, single layer networks, mixture of experts and Gaussian processes to predict the permeability coefficient. We obtain a considerable improvement over the quantitative structure-activity relationship (QSARs) predictors. We show that using five features, which are molecular weight, solubility parameter, lipophilicity, the number of hydrogen bonding acceptor and donor groups, can produce better predictions than the one using only lipophilicity and the molecular weight. The Gaussian process regression with five compound features gives the best performance in this work.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132936094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Masud, Jing Gao, L. Khan, Jiawei Han, B. Thuraisingham
{"title":"A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data","authors":"M. Masud, Jing Gao, L. Khan, Jiawei Han, B. Thuraisingham","doi":"10.1109/ICDM.2008.152","DOIUrl":"https://doi.org/10.1109/ICDM.2008.152","url":null,"abstract":"Recent approaches in classifying evolving data streams are based on supervised learning algorithms, which can be trained with labeled data only. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment, where huge volumes of data appear at a high speed, labeled data may be very scarce. Thus, only a limited amount of training data may be available for building the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by building a classification model from a training set having both unlabeled and a small amount of labeled instances. This model is built as micro-clusters using semi-supervised clustering technique and classification is performed with kappa-nearest neighbor algorithm. An ensemble of these models is used to classify the unlabeled data. Empirical evaluation on both synthetic data and real botnet traffic reveals that our approach, using only a small amount of labeled data for training, outperforms state-of-the-art stream classification algorithms that use twenty times more labeled data than our approach.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122865404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collective Latent Dirichlet Allocation","authors":"Zhiyong Shen, Junyi Sun, Yi-Dong Shen","doi":"10.1109/ICDM.2008.75","DOIUrl":"https://doi.org/10.1109/ICDM.2008.75","url":null,"abstract":"In this paper, we propose a new variant of latent Dirichlet allocation (LDA): Collective LDA (C-LDA), for multiple corpora modeling. C-LDA combines multiple corpora during learning such that it can transfer knowledge from one corpus to another; meanwhile it keeps a discriminative node which represents the corpus ID to constrain the learned topics in each corpus. Compared with LDA locally applied to the target corpus, C-LDA results in refined topic-word distribution, while compared with applying LDA globally and straightforwardly to the combined corpus, C-LDA keeps each topic only for one corpus. We demonstrate that C-LDA has improved performance with these advantages by experiments on several benchmark document data sets.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127944027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling and Predicting the Helpfulness of Online Reviews","authors":"Yang Liu, Xiangji Huang, Aijun An, Xiaohui Yu","doi":"10.1109/ICDM.2008.94","DOIUrl":"https://doi.org/10.1109/ICDM.2008.94","url":null,"abstract":"Online reviews provide a valuable resource for potential customers to make purchase decisions. However, the sheer volume of available reviews as well as the large variations in the review quality present a big impediment to the effective use of the reviews, as the most helpful reviews may be buried in the large amount of low quality reviews. The goal of this paper is to develop models and algorithms for predicting the helpfulness of reviews, which provides the basis for discovering the most helpful reviews for given products. We first show that the helpfulness of a review depends on three important factors: the reviewerpsilas expertise, the writing style of the review, and the timeliness of the review. Based on the analysis of those factors, we present a nonlinear regression model for helpfulness prediction. Our empirical study on the IMDB movie reviews dataset demonstrates that the proposed approach is highly effective.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128836674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequence Mining Automata: A New Technique for Mining Frequent Sequences under Regular Expressions","authors":"R. Trasarti, F. Bonchi, Bart Goethals","doi":"10.1109/ICDM.2008.111","DOIUrl":"https://doi.org/10.1109/ICDM.2008.111","url":null,"abstract":"In this paper we study the problem of mining frequent sequences satisfying a given regular expression. Previous approaches to solve this problem were focusing on its search space, pushing (in some way) the given regular expression to prune unpromising candidate patterns. On the contrary, we focus completely on the given input data and regular expression. We introduce sequence mining automata (SMA), a specialized kind of Petri Net that while reading input sequences, it produces for each sequence all and only the patterns contained in the sequence and that satisfy the given regular expression. Based on this automaton, we develop a family of algorithms. Our thorough experimentation on different datasets and application domains confirms that in many cases our methods outperform the current state of the art of frequent sequence mining algorithms using regular expressions (in some cases of orders of magnitude).","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116316157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Iterative Set Expansion of Named Entities Using the Web","authors":"Richard C. Wang, William W. Cohen","doi":"10.1109/ICDM.2008.145","DOIUrl":"https://doi.org/10.1109/ICDM.2008.145","url":null,"abstract":"Set expansion refers to expanding a partial set of \"seed\" objects into a more complete set. One system that does set expansion is SEAL (set expander for any language), which expands entities automatically by utilizing resources from the Web in a language independent fashion. In a previous study, SEAL showed good set expansion performance using three seed entities; however, when given a larger set of seeds (e.g., ten), SEAL's expansion method performs poorly. In this paper, we present iterative SEAL (iSEAL), which allows a user to provide many seeds. Briefly, iSEAL makes several calls to SEAL, each call using a small number of seeds. We also show that iSEAL can be used in a \"bootstrapping\" manner, where each call to SEAL uses a mixture of user-provided and self-generated seeds. We show that the bootstrapping version of iSEAL obtains better results than SEAL even when using fewer user-provided seeds. In addition, we compare the performance of various ranking algorithms used in iSEAL, and show that the choice of ranking method has a small effect on performance when all seeds are user-provided, but a large effect when iSEAL is bootstrapped. In particular, we show that random walk with restart is nearly as good as Bayesian sets with user-provided seeds, and performs best with bootstrapped seeds.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116345411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Wikipedia for Co-clustering Based Cross-Domain Text Classification","authors":"Pu Wang, C. Domeniconi, Jian Hu","doi":"10.1109/ICDM.2008.136","DOIUrl":"https://doi.org/10.1109/ICDM.2008.136","url":null,"abstract":"Traditional approaches to document classification requires labeled data in order to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available, and often too expensive to obtain. Given a learning task for which training data are not available, abundant labeled data may exist for a different but related domain. One would like to use the related labeled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows to propagate labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114323157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Releasing the SVM Classifier with Privacy-Preservation","authors":"Keng-Pei Lin, Ming-Syan Chen","doi":"10.1109/ICDM.2008.19","DOIUrl":"https://doi.org/10.1109/ICDM.2008.19","url":null,"abstract":"Support vector machine (SVM) is a widely used tool in classification problem. SVM solves a quadratic optimization problem to decide which instances of training dataset are support vectors, i.e., the necessarily informative instances to form the classifier. The support vectors are intact tuples taken from the training dataset. Releasing the SVM classifier to public use or shipping the SVM classifier to clients will disclose the private content of support vectors, violating the privacy-preservation requirement in some legal or commercial reasons. To the best of our knowledge, there has not been work extending the notion of privacy-preservation to releasing the SVM classifier. In this paper, we propose an approximation approach which post-processes the SVM classifier to protect the private content of support vectors. This approach is designed for the commonly used Gaussian radial basis function kernel. By applying this post-processor on the SVM classifier, the resulted privacy-preserving SVM classifier can be publicly released without exposing the private content of support vectors and is able to provide comparable classification accuracy to the original SVM classifier.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125599655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating Aggregates over Multiple Sets","authors":"E. Cohen, Haim Kaplan","doi":"10.1109/ICDM.2008.110","DOIUrl":"https://doi.org/10.1109/ICDM.2008.110","url":null,"abstract":"Many datasets, including market basket data, text or hypertext documents, and measurement data collected in different nodes or time periods, are modeled as a collection of sets over a ground set of (weighted) items. We consider the problem of estimating basic aggregates such as the weight or selectivity of a subpopulation of the items. We extend classic summarization techniques based on sampling to this scenario when we have multiple sets and selection predicates based on membership in particular sets.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126678060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}