{"title":"Inference Analysis in Privacy-Preserving Data Re-publishing","authors":"Guan Wang, Zutao Zhu, Wenliang Du, Zhouxuan Teng","doi":"10.1109/ICDM.2008.118","DOIUrl":"https://doi.org/10.1109/ICDM.2008.118","url":null,"abstract":"Privacy-Preserving Data Re-publishing (PPDR) deals with publishing microdata in dynamic scenarios. Due to privacy concerns, data must be disguised before being published. Research in privacy-preserving data publishing (PPDP) has proposed many such methods on static data. In PPDR, multiple appeared records can be used to infer private information of other records. Therefore, inference channels exist among different releases. To understand the privacy property of data re-publishing, we need to analyze the impact of these inference channels. Previous studies show such analysis when data are updated or disguised in special ways, however, no general method has been proposed. Using the Maximum Entropy Modeling method, we have developed a general solution. Our method can conduct inference analysis when data are arbitrarily updated or arbitrarily disguised using either generalization or bucketization, two most common data disguise methods in PPDR. Through analysis and experiments, we demonstrate the advantage and the effectiveness of our method.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126472788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Discovery of Statistically Significant Association Rules","authors":"W. Hämäläinen, M. Nykänen","doi":"10.1109/ICDM.2008.144","DOIUrl":"https://doi.org/10.1109/ICDM.2008.144","url":null,"abstract":"Searching statistically significant association rules is an important but neglected problem. Traditional association rules do not capture the idea of statistical dependence and the resulting rules can be spurious, while the most significant rules may be missing. This leads to erroneous models and predictions which often become expensive.The problem is computationally very difficult, because the significance is not a monotonic property. However, in this paper we prove several other properties, which can be used for pruning the search space. The properties are implemented in the StatApriori algorithm, which searches statistically significant, non-redundant association rules. Based on both theoretical and empirical observations, the resulting rules are very accurate compared to traditional association rules. In addition, StatApriori can work with extremely low frequencies, thus finding new interesting rules.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126002704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erheng Zhong, Sihong Xie, W. Fan, Jiangtao Ren, Jing Peng, Kun Zhang
{"title":"Graph-Based Iterative Hybrid Feature Selection","authors":"Erheng Zhong, Sihong Xie, W. Fan, Jiangtao Ren, Jing Peng, Kun Zhang","doi":"10.1109/ICDM.2008.63","DOIUrl":"https://doi.org/10.1109/ICDM.2008.63","url":null,"abstract":"When the number of labeled examples is limited, traditional supervised feature selection techniques often fail due to sample selection bias or unrepresentative sample problem. To solve this, semi-supervised feature selection techniques exploit the statistical information of both labeled and unlabeled examples in the same time. However, the results of semi-supervised feature selection can be at times unsatisfactory, and the culprit is on how to effectively use the unlabeled data. Quite different from both supervised and semi-supervised feature selection, we propose a ldquohybridrdquoframework based on graph models. We first apply supervised methods to select a small set of most critical features from the labeled data. Importantly, these initial features might otherwise be missed when selection is performed on the labeled and unlabeled examples simultaneously. Next,this initial feature set is expanded and corrected with the use of unlabeled data. We formally analyze why the expected performance of the hybrid framework is better than both supervised and semi-supervised feature selection. Experimental results demonstrate that the proposed method outperforms both traditional supervised and state-of-the-art semi-supervised feature selection algorithms by at least 10% inaccuracy on a number of text and biomedical problems with thousands of features to choose from. Software and dataset is available from the authors.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122330828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering Geospatial Objects via Hidden Markov Random Fields","authors":"Makoto Sato, S. Imahara","doi":"10.1109/ICDM.2008.70","DOIUrl":"https://doi.org/10.1109/ICDM.2008.70","url":null,"abstract":"This paper addresses the problem of clustering objects located and correlated geographically and containing multiple attributes. For the clustering problem, it is necessary to consider both the similarities of the attributes and the spatial dependencies of the objects. A new clustering framework using hidden Markov random fields (HMRFs) and Gaussian distributions and new potential models of HMRFs for irregularly located geospatial objects are proposed in this paper. Experimental results for systematic data and two real-world data showed the availability of the proposed algorithms.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"40 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128167148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring Proximity on Graphs with Side Information","authors":"Hanghang Tong, Huiming Qu, H. Jamjoom","doi":"10.1109/ICDM.2008.42","DOIUrl":"https://doi.org/10.1109/ICDM.2008.42","url":null,"abstract":"This paper studies how to incorporate side information (such as users' feedback) in measuring node proximity on large graphs. Our method (ProSIN) is motivated by the well-studied random walk with restart (RWR). The basic idea behind ProSIN is to leverage side information to refine the graph structure so that the random walk is biased towards/away from some specific zones on the graph. Our case studies demonstrate that ProSIN is well-suited in a variety of applications, including neighborhood search, center-piece subgraphs, and image caption. Given the potential computational complexity of ProSIN, we also propose a fast algorithm (Fast-ProSIN) that exploits the smoothness of the graph structures with/without side information. Our experimental evaluation shows that fast-ProSIN achieves significant speedups (up to 49x) over straightforward implementations.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130368752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing the Stability of Spectral Ordering with Sparsification and Partial Supervision: Application to Paleontological Data","authors":"Dimitrios Mavroeidis, Ella Bingham","doi":"10.1109/ICDM.2008.120","DOIUrl":"https://doi.org/10.1109/ICDM.2008.120","url":null,"abstract":"Recent studies have demonstrated the prospects of data mining algorithms for addressing the task of seriation in paleontological data (i.e. the age-based ordering of the sites of excavation). A prominent approach is spectral ordering that computes a similarity measure between the sites and orders them such that similar sites become adjacent and dissimilar sites are placed far apart. In the paleontological domain, the similarity measure is based on the mammal genera whose remains are retrieved at each site of excavation. Although spectral ordering achieves good performance in the seriation task, it ignores the background knowledge that is naturally present in the domain, as paleontologists can derive the ages of the sites of excavation within some accuracy. On the other hand, the age information is uncertain, so the best approach would be to combine the background knowledge with the information on mammal co-occurrences. Motivated by this kind of partial supervision we propose a novel semi-supervised spectral ordering algorithm. Our algorithm modifies the Laplacian matrix used in spectral ordering, such that domain knowledge of the ordering is taken into account. Also, it performs feature selection (sparsification) by discarding features that contribute most to the unwanted variability of the data in bootstrap sampling. The theoretical properties of the proposed algorithm are thoroughly analyzed and it is demonstrated that the proposed framework enhances the stability of the spectral ordering output and induces computational gains.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"304 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132916505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering Significant Patterns in Multi-stream Sequences","authors":"Robert Gwadera, F. Crestani","doi":"10.1109/ICDM.2008.146","DOIUrl":"https://doi.org/10.1109/ICDM.2008.146","url":null,"abstract":"Discovering significant patterns in synchronized multi-stream sequences also known as multi-attribute event sequences (multi-sequences), is an important problem in many domains, including monitoring systems and information retrieval. In this paper we propose a new approach for assessing significance of multi-stream patterns in multi-attribute event sequences. In experiments on physiological multi-stream data we show applicability of our method.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129335712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining","authors":"S. Papadimitriou, Jimeng Sun","doi":"10.1109/ICDM.2008.142","DOIUrl":"https://doi.org/10.1109/ICDM.2008.142","url":null,"abstract":"Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the distributed co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122340564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Chaoji, M. Hasan, Saeed Salem, Mohammed J. Zaki
{"title":"SPARCL: Efficient and Effective Shape-Based Clustering","authors":"V. Chaoji, M. Hasan, Saeed Salem, Mohammed J. Zaki","doi":"10.1109/ICDM.2008.73","DOIUrl":"https://doi.org/10.1109/ICDM.2008.73","url":null,"abstract":"Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages - the first stage runs a carefully initialized version of the K-means algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122612892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HIREL: An Incremental Clustering Algorithm for Relational Datasets","authors":"Tao Li, S. Anand","doi":"10.1109/ICDM.2008.116","DOIUrl":"https://doi.org/10.1109/ICDM.2008.116","url":null,"abstract":"Traditional clustering approaches usually analyze static datasets in which objects are kept unchanged after being processed, but many practical datasets are dynamically modified which means some previously learned patterns have to be updated accordingly. Re-clustering the whole dataset from scratch is not a good choice due to the frequent data modifications and the limited out-of-service time, so the development of incremental clustering approaches is highly desirable. Besides that, propositional clustering algorithms are not suitable for relational datasets because of their quadratic computational complexity. In this paper, we propose an incremental clustering algorithm that requires only one pass of the relational dataset. The utilization of the Representative Objects and the balanced Search Tree greatly accelerate the learning procedure. Experimental results prove the effectiveness of our algorithm.","PeriodicalId":252958,"journal":{"name":"2008 Eighth IEEE International Conference on Data Mining","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127351255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}