{"title":"A scalable two-stage approach for a class of dimensionality reduction techniques","authors":"Liang Sun, Betul Ceran, Jieping Ye","doi":"10.1145/1835804.1835846","DOIUrl":"https://doi.org/10.1145/1835804.1835846","url":null,"abstract":"Dimensionality reduction plays an important role in many data mining applications involving high-dimensional data. Many existing dimensionality reduction techniques can be formulated as a generalized eigenvalue problem, which does not scale to large-size problems. Prior work transforms the generalized eigenvalue problem into an equivalent least squares formulation, which can then be solved efficiently. However, the equivalence relationship only holds under certain assumptions without regularization, which severely limits their applicability in practice. In this paper, an efficient two-stage approach is proposed to solve a class of dimensionality reduction techniques, including Canonical Correlation Analysis, Orthonormal Partial Least Squares, linear Discriminant Analysis, and Hypergraph Spectral Learning. The proposed two-stage approach scales linearly in terms of both the sample size and data dimensionality. The main contributions of this paper include (1) we rigorously establish the equivalence relationship between the proposed two-stage approach and the original formulation without any assumption; and (2) we show that the equivalence relationship still holds in the regularization setting. We have conducted extensive experiments using both synthetic and real-world data sets. Our experimental results confirm the equivalence relationship established in this paper. Results also demonstrate the scalability of the proposed two-stage approach.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82384231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Document clustering via dirichlet process mixture model with feature selection","authors":"Guan Yu, Rui-zhang Huang, Zhaojun Wang","doi":"10.1145/1835804.1835901","DOIUrl":"https://doi.org/10.1145/1835804.1835901","url":null,"abstract":"One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79895791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GLS-SOD: a generalized local statistical approach for spatial outlier detection","authors":"F. Chen, Chang-Tien Lu, Arnold P. Boedihardjo","doi":"10.1145/1835804.1835939","DOIUrl":"https://doi.org/10.1145/1835804.1835939","url":null,"abstract":"Local based approach is a major category of methods for spatial outlier detection (SOD). Currently, there is a lack of systematic analysis on the statistical properties of this framework. For example, most methods assume identical and independent normal distributions (i.i.d. normal) for the calculated local differences, but no justifications for this critical assumption have been presented. The methods' detection performance on geostatistic data with linear or nonlinear trend is also not well studied. In addition, there is a lack of theoretical connections and empirical comparisons between local and global based SOD approaches. This paper discusses all these fundamental issues under the proposed Generalized Local Statistical (GLS) framework. Furthermore, robust estimation and outlier detection methods are designed for the new GLS model. Extensive simulations demonstrated that the SOD method based on the GLS model significantly outperformed all existing approaches when the spatial data exhibits a linear or nonlinear trend.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91381913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data winnowing","authors":"Y. Freund","doi":"10.1145/1835804.1835806","DOIUrl":"https://doi.org/10.1145/1835804.1835806","url":null,"abstract":"Massive quantities of digital data are being collected in every aspect of modern life. Examples include Personal photos and videos, biological and medical images and recordings from sensor arrays. To transform these massive data streams into useful information we use a sequence of \"winnowing\" stages. Each step reduces the size of the data by an order of magnitude; extracting the wheat form the chaff. In this talk I will describe this approach in a variety of contexts, ranging from the analysis of genetic pathways in fruit-fly embryos and C-Elegans worms to counting birds and helping elderly people living alone keep in touch with their family and caregivers.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81860181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cold start link prediction","authors":"V. Leroy, B. B. Cambazoglu, F. Bonchi","doi":"10.1145/1835804.1835855","DOIUrl":"https://doi.org/10.1145/1835804.1835855","url":null,"abstract":"In the traditional link prediction problem, a snapshot of a social network is used as a starting point to predict, by means of graph-theoretic measures, the links that are likely to appear in the future. In this paper, we introduce cold start link prediction as the problem of predicting the structure of a social network when the network itself is totally missing while some other information regarding the nodes is available. We propose a two-phase method based on the bootstrap probabilistic graph. The first phase generates an implicit social network under the form of a probabilistic graph. The second phase applies probabilistic graph-based measures to produce the final prediction. We assess our method empirically over a large data collection obtained from Flickr, using interest groups as the initial information. The experiments confirm the effectiveness of our approach.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78193228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Yan, Glenn Fung, Jennifer G. Dy, Rómer Rosales
{"title":"Medical coding classification by leveraging inter-code relationships","authors":"Yan Yan, Glenn Fung, Jennifer G. Dy, Rómer Rosales","doi":"10.1145/1835804.1835831","DOIUrl":"https://doi.org/10.1145/1835804.1835831","url":null,"abstract":"Medical coding or classification is the process of transforming information contained in patient medical records into standard predefined medical codes. There are several worldwide accepted medical coding conventions associated with diagnoses and medical procedures; however, in the United States the Ninth Revision of ICD(ICD-9) provides the standard for coding clinical records. Accurate medical coding is important since it is used by hospitals for insurance billing purposes. Since after discharge a patient can be assigned or classified to several ICD-9 codes, the coding problem can be seen as a multi-label classification problem. In this paper, we introduce a multi-label large-margin classifier that automatically learns the underlying inter-code structure and allows the controlled incorporation of prior knowledge about medical code relationships. In addition to refining and learning the code relationships, our classifier can also utilize this shared information to improve its performance. Experiments on a publicly available dataset containing clinical free text and their associated medical codes showed that our proposed multi-label classifier outperforms related multi-label models in this problem.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"331 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77141746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Indexing and mining time sequences","authors":"Lei Li, C. Faloutsos","doi":"10.1145/1835804.1866295","DOIUrl":"https://doi.org/10.1145/1835804.1866295","url":null,"abstract":"","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77896578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data mining with differential privacy","authors":"Arik Friedman, A. Schuster","doi":"10.1145/1835804.1835868","DOIUrl":"https://doi.org/10.1145/1835804.1835868","url":null,"abstract":"We consider the problem of data mining with formal privacy guarantees, given a data access interface based on the differential privacy framework. Differential privacy requires that computations be insensitive to changes in any particular individual's record, thereby restricting data leaks through the results. The privacy preserving interface ensures unconditionally safe access to the data and does not require from the data miner any expertise in privacy. However, as we show in the paper, a naive utilization of the interface to construct privacy preserving data mining algorithms could lead to inferior data mining results. We address this problem by considering the privacy and the algorithmic requirements simultaneously, focusing on decision tree induction as a sample application. The privacy mechanism has a profound effect on the performance of the methods chosen by the data miner. We demonstrate that this choice could make the difference between an accurate classifier and a completely useless one. Moreover, an improved algorithm can achieve the same level of accuracy and privacy as the naive implementation but with an order of magnitude fewer learning samples.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74292963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online discovery and maintenance of time series motifs","authors":"A. Mueen, Eamonn J. Keogh","doi":"10.1145/1835804.1835941","DOIUrl":"https://doi.org/10.1145/1835804.1835941","url":null,"abstract":"The detection of repeated subsequences, time series motifs, is a problem which has been shown to have great utility for several higher-level data mining algorithms, including classification, clustering, segmentation, forecasting, and rule discovery. In recent years there has been significant research effort spent on efficiently discovering these motifs in static offline databases. However, for many domains, the inherent streaming nature of time series demands online discovery and maintenance of time series motifs. In this paper, we develop the first online motif discovery algorithm which monitors and maintains motifs exactly in real time over the most recent history of a stream. Our algorithm has a worst-case update time which is linear to the window size and is extendible to maintain more complex pattern structures. In contrast, the current offline algorithms either need significant update time or require very costly pre-processing steps which online algorithms simply cannot afford. Our core ideas allow useful extensions of our algorithm to deal with arbitrary data rates and discovering multidimensional motifs. We demonstrate the utility of our algorithms with a variety of case studies in the domains of robotics, acoustic monitoring and online compression.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73669293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Research track 20: evolving and spatial data","authors":"Hui Xiong","doi":"10.1145/3248800","DOIUrl":"https://doi.org/10.1145/3248800","url":null,"abstract":"","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"328 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76365689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}