{"title":"Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm","authors":"Zuobing Xu, Christopher Hogan, Robert S. Bauer","doi":"10.1109/ICDMW.2009.38","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.38","url":null,"abstract":"Active learning algorithms actively select training examples to acquire labels from domain experts, which are very effective to reduce human labeling effort in the context of supervised learning. To reduce computational time in training, as well as provide more convenient user interaction environment, it is necessary to select batches of new training examples instead of a single example. Batch mode active learning algorithms incorporate a diversity measure to construct a batch of diversified candidate examples. Existing approaches use greedy algorithms to make it feasible to the scale of thousands of data. Greedy algorithms, however, are not efficient enough to scale to even larger real world classification applications, which contain millions of data. In this paper, we present an extremely efficient active learning algorithm. This new active learning algorithm achieves the same results as the traditional greedy algorithm, while the run time is reduced by a factor of several hundred times. We prove that the objective function of the algorithm is submodular, which guarantees to find the same solution as the greedy algorithm. We evaluate our approach on several largescale real-world text classification problems, and show that our new approach achieves substantial speedups, while obtaining the same classification accuracy.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125244229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Nguyen, Takahiro Hayashi, R. Onai, Yuhei Nishioka, Takamasa Takenaka, Masaya Mori
{"title":"A New Minimally Supervised Learning Method for Semantic Term Classification - Experimental Results on Classifying Ratable Aspects Discussed in Customer Reviews","authors":"T. Nguyen, Takahiro Hayashi, R. Onai, Yuhei Nishioka, Takamasa Takenaka, Masaya Mori","doi":"10.1109/ICDMW.2009.58","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.58","url":null,"abstract":"We present Bautext, a new minimally supervised approach for automatically extracting ratable aspects from customer reviews and classifying them to some previously defined categories. Bautext requires a small amount of seed words as supervised data and uses a bootstrapping mechanism o progressively collect new member for each category. Learning new category members and the category-specific terms for each category at the same time is the unique and featured classification mechanism of Bautext. Category-specific terms are terms that play important roles for properly extracting new category members. Furthermore, we proposed to use an additional Trash category to filter non-purpose aspects, thus led to a significant improvement in precision score but could constrain the trade-off in decreasing recall score. Experimental results, conducted on a Japanese hotel review dataset, showed that Bautext outperforms the alternative techniques in all terms of precision, recall score and significantly in running time. And in the further comparison to Adaboost (as the state-of-the-art machine learning technique for semantic term classification task), we found that Adaboost require about 50% training data to deliver a similar performance as Bautext does with less than ten selective seed words for each category.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131303488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"K-BestMatch Reconstruction and Comparison of Trajectory Data","authors":"M. Nanni, R. Trasarti","doi":"10.1109/ICDMW.2009.62","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.62","url":null,"abstract":"In this paper we propose a map matching method to overcoming the limitations of standard best-match reconstruction strategies. We use a more flexible approach which consider the k-optimal alternative paths to reconstruct the trajectories from the GPS raw data. The preliminary results, obtained on a real dataset of car users in Milan area, suggest that our method leads to beneficial effects on the successive analysis to be performed such as KNN and clustering.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"76 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120891606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Paradigm Shift: Combined Literature and Ontology-Driven Data Mining for Discovering Novel Relations in Biomedical Domain","authors":"Y. Sebastian, B. C. Loh, P. Then","doi":"10.1109/ICDMW.2009.56","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.56","url":null,"abstract":"We introduce a novel domain-driven rule discovery and evaluation algorithm based on Swanson’s logical relation approach. Over more than a decade, rules have been mined from large biomedical datasets and been evaluated solely based on statistical properties of the rules or user-belief specifications. This approach faces tremendous challenges to determine novel, actionable and interesting rules. In this paper, we introduce a new paradigm in addressing rule interestingness problem using domain knowledge. We demonstrate that novel and interesting association rules can be discovered from large medical datasets based on its ability to infer previously unknown relations in biomedical domain. Our data mining algorithm shows that we can effectively achieve this task by incorporating biomedical domain knowledge by combining both literatures and ontology. We outline the conceptual-architectural framework for future implementation of this methodology.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129464641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A WordNet-Based Semantic Model for Enhancing Text Clustering","authors":"Shady Shehata","doi":"10.1109/ICDMW.2009.86","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.86","url":null,"abstract":"Most of text mining techniques are based on word and/or phrase analysis of the text. The statistical analysis of a term (word or phrase) frequency captures the importance of the term within a document. However, to achieve a more accurate analysis, the underlying mining technique should indicate terms that capture the semantics of the text from which the importance of a term in a sentence and in the document can be derived. Incorporating semantic features from the WordNet lexical database is one of many approaches that have been tried to improve the accuracy of text clustering techniques. A new semantic-based model that analyzes documents based on their meaning is introduced. The proposed model analyzes terms and their corresponding synonyms and/or hypernyms on the sentence and document levels. In this model, if two documents contain different words and these words are semantically related, the proposed model can measure the semantic-based similarity between the two documents. The similarity between documents relies on a new semantic-based similarity measure which is applied to the matching concepts between documents. Experiments using the proposed semantic-based model in text clustering are conducted. Experimental results demonstrate that the newly developed semantic-based model enhances the clustering quality of sets of documents substantially.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116280151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Motivating Complex Dependence Structures in Data Mining: A Case Study with Anomaly Detection in Climate","authors":"S. Kao, A. Ganguly, K. Steinhaeuser","doi":"10.1109/ICDMW.2009.37","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.37","url":null,"abstract":"While data mining aims to identify hidden knowledge from massive and high dimensional datasets, the importance of dependence structure among time, space, and between different variables is less emphasized. Analogous to the use of probability density functions in modeling individual variables, it is now possible to characterize the complete dependence space mathematically through the application of copulas. By adopting copulas, the multivariate joint probability distribution can be constructed without constraint to specific types of marginal distributions. Some common assumptions, like normality and independence between variables, can also be relieved. This study provides fundamental introduction and illustration of dependence structure, aimed at the potential applicability of copulas in general data mining. The case study in hydro-climatic anomaly detection shows that the frequency of multivariate anomalies is affected by the dependence level between variables. The appropriate multivariate thresholds can be determined through a copula-based approach.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131866765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daria Sorokina, R. Caruana, Mirek Riedewald, W. Hochachka, S. Kelling
{"title":"Detecting and Interpreting Variable Interactions in Observational Ornithology Data","authors":"Daria Sorokina, R. Caruana, Mirek Riedewald, W. Hochachka, S. Kelling","doi":"10.1109/ICDMW.2009.84","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.84","url":null,"abstract":"In this paper we demonstrate a practical approach to interaction detection on real data describing the abundance of different species of birds in the prairies east of the southern Rocky Mountains. This data is very noisy---predictive models built from it perform only slightly better than baseline. Previous approaches for interaction detection, including a recently proposed algorithm based on Additive Groves, often do not work well on such noisy data for a number of reasons. We describe the issues that appear when working with such data sets and suggest solutions to them. In the end, we discuss results of our analysis for several bird species.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132870986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Cervone, A. Stefanidis, P. Franzese, P. Agouris
{"title":"Spatiotemporal Modeling and Monitoring of Atmospheric Hazardous Emissions Using Sensor Networks","authors":"G. Cervone, A. Stefanidis, P. Franzese, P. Agouris","doi":"10.1109/ICDMW.2009.67","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.67","url":null,"abstract":"A spatiotemporal methodology is presented for the analysis and visualization of atmospheric emissions in a metropolitan area. Numerical transport and dispersion models are used to build a library of time-dependent emissions of hazardous gases under various atmospheric conditions and from multiple potential sources in Washington DC. This library comprises representative emergency events that may involve natural or man-made hazardous emissions. To represent and analyze the events of this library we use the model of the spatiotemporal helix, which provides concise summaries of complex spatiotemporal events. We demonstrate the ability to compare emerging situations to library entries in order to predict their future evolution, thus recognizing potentially hazardous conditions early in their development.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124890269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a Universal Text Classifier: Transfer Learning Using Encyclopedic Knowledge","authors":"Pu Wang, C. Domeniconi","doi":"10.1109/ICDMW.2009.101","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.101","url":null,"abstract":"Document classification is a key task for many text mining applications. However, traditional text classification requires labeled data to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available. In this work, we propose a {textit {universal text classifier}}, which does not require any labeled document. Our approach simulates the capability of people to classify documents based on background knowledge. As such, we build a classifier that can effectively group documents based on their content, under the guidance of few words describing the classes of interest. Background knowledge is modeled using encyclopedic knowledge, namely Wikipedia. The universal text classifier can also be used to perform document retrieval. In our experiments with real data we test the feasibility of our approach for both the classification and retrieval tasks.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129616788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple Instance Transfer Learning","authors":"Dan Zhang, Luo Si","doi":"10.1109/ICDMW.2009.72","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.72","url":null,"abstract":"Transfer Learning is a very important branch in both Machine Learning and Data Mining. Its main objective is to transfer knowledge across domains, tasks and distributions that are similar but not the same. Currently, almost all of the transfer learning methods are designed to deal with the traditional single instance learning problems. However, in many real-world applications, such as drug design, Localized Content Based Image Retrieval (LCBIR), Text Categorization, we have to deal with multiple instance problems, where training patterns are given as {em bags} and each bag consists of some emph{instances}. This paper formulates a novel Multiple Instance Transfer Learning (MITL) problem and suggests a method to solve it. An extensive set of empirical results demonstrate the advantages of the proposed method against several existed ones.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117065422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}