{"title":"Tree-Based Approach to Missing Data Imputation","authors":"P. Vateekul, Kanoksri Sarinnapakorn","doi":"10.1109/ICDMW.2009.92","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.92","url":null,"abstract":"Missing data is a well-recognized issue in data mining, and imputation is one way to handle the problem. In this paper, we propose a novel tree-based imputation algorithm called “Imputation Tree” (ITree). It first studies the predictability of missingness using all observations by constructing a binary classification tree called “Missing Pattern Tree” (MPT). Then, missing values in each cluster or terminal node are estimated by a regression tree of observations at that node. We present empirical results using both synthetic and real data. Almost all experiments demonstrate that ITree is superior to other commonly used methods in estimating missing values. The algorithm not only produces an impressive accuracy, but also provides information on the nature of missingness.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128437939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic-Rich Markov Models for Web Prefetching","authors":"Nizar R. Mabroukeh, C. Ezeife","doi":"10.1109/ICDMW.2009.18","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.18","url":null,"abstract":"Domain knowledge for web applications is currently being made available as domain ontology with the advent of the semantic web, in which semantics govern relationships among objects of interest (e. g., commercial items to be purchased in an e-Commerce web site). Our earlier work proposed to integrate semantic information into all phases of the web usage mining process, for an intelligent semantics-aware web usage mining framework. There are ways to integrate semantic information into Markov models used in the third phase for next page request prediction. Semantic information is combined with the transition probability matrix of a Markov model. This way, it provides a low order Markov model with intelligent accurate predictions and less complexity than higher order models, also solving the problem of contradicting prediction. This paper proposes to use semantic information to prune states in Selective Markov models SMM, semantic information can lead to context-aware higher order Markov models with about 16% less space complexity.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128643539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Effective Network Partitioning Algorithm Based on Two-Point Diffusing Strategy","authors":"Chengying Mao","doi":"10.1109/ICDMW.2009.26","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.26","url":null,"abstract":"The network modeling and analysis have played important roles in fields of physics, sociology, biology, and computer science. Recently, community structure has been considered as an important character for complex networks, and its detection can bring great benefit in real world affairs. In the paper, a new heuristic algorithm based on two-point diffusing strategy is proposed. At first, two pseudo-core points are identified according to the clue of the longest path in a network. Then, two embryonic communities and an undecided node set are generated through performing diffusing operation on such two points. Subsequently, an experience rule is used to classify the undecided nodes to form the final community structure. In addition, the effectiveness and efficiency are validated by comparison experiments with four real-world networks. The experiment results show that our TPD algorithm can yield better community partition results and shorter computing time than the existing classical community detecting algorithms.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128656654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenxuan Gao, R. Grossman, Philip S. Yu, Yunhong Gu
{"title":"Why Naive Ensembles Do Not Work in Cloud Computing","authors":"Wenxuan Gao, R. Grossman, Philip S. Yu, Yunhong Gu","doi":"10.1109/ICDMW.2009.85","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.85","url":null,"abstract":"One of the greatest challenges of data mining is dealing with very large datasets. Cloud computing has demonstrated great advantages in processing very large datasets. When considering taking advantage of the high performance data cloud to do data mining, there are different approaches to make an existing data mining algorithm parallelizable in a cloud computing environment. One concern is how to achieve better performance by making use of the data in a more intelligent way. In this paper, we describe two different approaches to parallelize the existing random decision tree mining algorithm, which we have built on the Sector/Sphere cloud computing environment. We compare the cost and accuracy between those two different implementations and analyze the result of this experimental study.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130718516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haimonti Dutta, Xianshu Zhu, Tushar Mahule, H. Kargupta, K. Borne, Codrina Lauth, Florian Holz, Gerhard Heyer
{"title":"TagLearner: A P2P Classifier Learning System from Collaboratively Tagged Text Documents","authors":"Haimonti Dutta, Xianshu Zhu, Tushar Mahule, H. Kargupta, K. Borne, Codrina Lauth, Florian Holz, Gerhard Heyer","doi":"10.1109/ICDMW.2009.90","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.90","url":null,"abstract":"The amount of text data on the Internet is growing at a very fast rate. Online text repositories for news agencies, digital libraries and other organizations currently store giga and tera-bytes of data. Large amounts of unstructured text poses a serious challenge for data mining and knowledge extraction. End user participation coupled with distributed computation can play a crucial role in meeting these challenges. In many applications involving classification of text documents, web users often participate in the tagging process. This collaborative tagging results in the formation of large scale Peer-to-Peer (P2P) systems which can function, scale and self-organize in the presence of highly transient population of nodes and do not need a central server for co-ordination. In this paper, we describe TagLearner, a P2P classifier learning system for extracting patterns from text data where the end users can participate both in the task of labeling the data and building a distributed classifier on it. We present a novel distributed linear programming based classification algorithm which is asynchronous in nature. The paper also provides extensive empirical results on text data obtained from an online repository - the NSF Abstracts Data.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134274862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Attack on the Privacy of Sanitized Data that Fuses the Outputs of Multiple Data Miners","authors":"Michal Sramka, R. Safavi-Naini, J. Denzinger","doi":"10.1109/ICDMW.2009.28","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.28","url":null,"abstract":"Data sanitization has been used to restrict re-identification of individuals and disclosure of sensitive information from published data. We propose an attack on the privacy of the published sanitized data that simply fuses outputs of multiple data miners that are applied to the sanitized data. That attack is practical and does not require any background or additional information. We use a number of experiments to show scenarios where an adversary can combine outputs of multiple miners using a simple fusion strategy to increase their success chance of breaching privacy of individuals whose data is stored in the database. The fusion attack provides a powerful method of breaching privacy in the form of partial disclosure, for both anonymized and perturbed data. It also provides an effective way of approximating predictions of the best miner (a miner that provides the best results among all considered miners) when this miner cannot be determined.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114581056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Semi-supervised Framework for Simultaneous Classification and Regression of Zero-Inflated Time Series Data with Application to Precipitation Prediction","authors":"Zubin Abraham, P. Tan","doi":"10.1109/ICDMW.2009.80","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.80","url":null,"abstract":"Time series data with abundant number of zeros are common in many applications, including climate and ecological modeling, disease monitoring, manufacturing defect detection, and traffic accident monitoring. Classical regression models are inappropriate to handle data with such skewed distribution because they tend to underestimate the frequency of zeros and the magnitude of non-zero values in the data. This paper presents a hybrid framework that simultaneously perform classification and regression to accurately predict future values of a zero-inflated time series. A classifier is initially used to determine whether the value at a given time step is zero while a regression model is invoked to estimate its magnitude only if the predicted value has been classified as nonzero. The proposed framework is extended to a semi-supervised learning setting via graph regularization. The effectiveness of the framework is demonstrated via its application to the precipitation prediction problem for climate impact assessment studies.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116223883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feature Selection for Maximizing the Area Under the ROC Curve","authors":"Rui Wang, K. Tang","doi":"10.1109/ICDMW.2009.25","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.25","url":null,"abstract":"Feature selection is an important pre-processing step for solving classification problems. A good feature selection method may not only improve the performance of the final classifier, but also reduce the computational complexity of it. Traditionally, feature selection methods were developed to maximize the classification accuracy of a classifier. Recently, both theoretical and experimental studies revealed that a classifier with the highest accuracy might not be ideal in real-world problems. Instead, the Area Under the ROC Curve (AUC) has been suggested as the alternative metric, and many existing learning algorithms have been modified in order to seek the classifier with maximum AUC. However, little work was done to develop new feature selection methods to suit the requirement of AUC maximization. To fill this gap in the literature, we propose in this paper a novel algorithm, called AUC and Rank Correlation coefficient Optimization (ARCO) algorithm. ARCO adopts the general framework of a well-known method, namely minimal redundancy- maximal-relevance (mRMR) criterion, but defines the terms ”relevance” and ”redundancy” in totally different ways. Such a modification looks trivial from the perspective of algorithmic design. Nevertheless, experimental study on four gene expression data sets showed that feature subsets obtained by ARCO resulted in classifiers with significantly larger AUC than the feature subsets obtained by mRMR. Moreover, ARCO also outperformed the Feature Assessment by Sliding Thresholds algorithm, which was recently proposed for AUC maximization, and thus the efficacy of ARCO was validated.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122927520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Klinginsmith, M. Mahoui, Yuqing Wu, Josette F. Jones
{"title":"Discovering Domain Specific Concepts within User-Generated Taxonomies","authors":"Jonathan Klinginsmith, M. Mahoui, Yuqing Wu, Josette F. Jones","doi":"10.1109/ICDMW.2009.57","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.57","url":null,"abstract":"Collaborative tagging of resources on the Web has become a commonplace occurrence. Web sites allowing resources to be tagged provide a tremendous amount of user-generated taxonomic information. However, information seekers are hindered by the lack of organization within these tags as well as the multitude of domains encompassed within these sites. To address these issues, we propose a multi-step approach for creating domain specific concept hierarchies from collaborative tags. Each concept hierarchy is based on domain specific subject matters, which may span more than one tag, as opposed to related work which are only concerned with the relationships between single tags.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124545289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Lazarevic, Nisheeth Srivastava, Ashutosh Tiwari, Joshua D. Isom, N. Oza, J. Srivastava
{"title":"Theoretically Optimal Distributed Anomaly Detection","authors":"A. Lazarevic, Nisheeth Srivastava, Ashutosh Tiwari, Joshua D. Isom, N. Oza, J. Srivastava","doi":"10.1109/ICDMW.2009.40","DOIUrl":"https://doi.org/10.1109/ICDMW.2009.40","url":null,"abstract":"A novel general framework for distributed anomaly detection with theoretical performance guarantees is proposed. Our algorithmic approach combines existing anomaly detection procedures with a novel method for computing global statistics using local sufficient statistics. Under a Gaussian assumption, our distributed algorithm is guaranteed to perform as well as its centralized counterpart, a condition we call ‘zero information loss’. We further report experimental results on synthetic as well as real-world data to demonstrate the viability of our approach.","PeriodicalId":351078,"journal":{"name":"2009 IEEE International Conference on Data Mining Workshops","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122998603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}