{"title":"Quantiles over data streams: an experimental study","authors":"Lu Wang, Ge Luo, K. Yi, Graham Cormode","doi":"10.1145/2463676.2465312","DOIUrl":"https://doi.org/10.1145/2463676.2465312","url":null,"abstract":"A fundamental problem in data management and analysis is to generate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data is described incrementally, and we must compute the quantiles in an online, streaming fashion. Yet while such algorithms have proved to be tremendously useful in practice, there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods, and describe efficient implementations. In doing so, we propose and analyze variations that have not been explicitly studied before, yet which turn out to perform the best. To illustrate this, we provide detailed experimental comparisons demonstrating the tradeoffs between space, time, and accuracy for quantile computation.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"96 1","pages":"737-748"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75928874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Armbrust, Eric Liang, Tim Kraska, A. Fox, M. Franklin, D. Patterson
{"title":"Generalized scale independence through incremental precomputation","authors":"Michael Armbrust, Eric Liang, Tim Kraska, A. Fox, M. Franklin, D. Patterson","doi":"10.1145/2463676.2465333","DOIUrl":"https://doi.org/10.1145/2463676.2465333","url":null,"abstract":"Developers of rapidly growing applications must be able to anticipate potential scalability problems before they cause performance issues in production environments. A new type of data independence, called scale independence, seeks to address this challenge by guaranteeing a bounded amount of work is required to execute all queries in an application, independent of the size of the underlying data. While optimization strategies have been developed to provide these guarantees for the class of queries that are scale-independent when executed using simple indexes, there are important queries for which such techniques are insufficient.\u0000 Executing these more complex queries scale-independently requires precomputation using incrementally-maintained materialized views. However, since this precomputation effectively shifts some of the query processing burden from execution time to insertion time, a scale-independent system must be careful to ensure that storage and maintenance costs do not threaten scalability. In this paper, we describe a scale-independent view selection and maintenance system, which uses novel static analysis techniques that ensure that created views do not themselves become scaling bottlenecks. Finally, we present an empirical analysis that includes all the queries from the TPC-W benchmark and validates our implementation's ability to maintain nearly constant high-quantile query and update latency even as an application scales to hundreds of machines.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"195 1","pages":"625-636"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78060049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Zhang, Xiaokui Xiao, Y. Yang, Zhenjie Zhang, M. Winslett
{"title":"PrivGene: differentially private model fitting using genetic algorithms","authors":"Jun Zhang, Xiaokui Xiao, Y. Yang, Zhenjie Zhang, M. Winslett","doi":"10.1145/2463676.2465330","DOIUrl":"https://doi.org/10.1145/2463676.2465330","url":null,"abstract":"epsilon-differential privacy is rapidly emerging as the state-of-the-art scheme for protecting individuals' privacy in published analysis results over sensitive data. The main idea is to perform random perturbations on the analysis results, such that any individual's presence in the data has negligible impact on the randomized results. This paper focuses on analysis tasks that involve model fitting, i.e., finding the parameters of a statistical model that best fit the dataset. For such tasks, the quality of the differentially private results depends upon both the effectiveness of the model fitting algorithm, and the amount of perturbations required to satisfy the privacy guarantees. Most previous studies start from a state-of-the-art, non-private model fitting algorithm, and develop a differentially private version. Unfortunately, many model fitting algorithms require intensive perturbations to satisfy -differential privacy, leading to poor overall result quality.\u0000 Motivated by this, we propose PrivGene, a general-purpose differentially private model fitting solution based on genetic algorithms (GA). PrivGene needs significantly less perturbations than previous methods, and it achieves higher overall result quality, even for model fitting tasks where GA is not the first choice without privacy considerations. Further, PrivGene performs the random perturbations using a novel technique called the enhanced exponential mechanism, which improves over the exponential mechanism by exploiting the special properties of model fitting tasks. As case studies, we apply PrivGene to three common analysis tasks involving model fitting: logistic regression, SVM classification, and k-means clustering. Extensive experiments using real data confirm the high result quality of PrivGene, and its superiority over existing methods.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"33 1","pages":"665-676"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82028188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theodoros Lappas, Marcos R. Vieira, D. Gunopulos, V. Tsotras
{"title":"STEM: a spatio-temporal miner for bursty activity","authors":"Theodoros Lappas, Marcos R. Vieira, D. Gunopulos, V. Tsotras","doi":"10.1145/2463676.2463688","DOIUrl":"https://doi.org/10.1145/2463676.2463688","url":null,"abstract":"Burst identification has been extensively studied in the context of document streams, where a burst is generally exhibited when an unusually high frequency is observed for a term t. Previous works have focused exclusively on either temporal or spatial burstiness patterns. The former represents bursty timeframes within a single stream, while the latter characterizes sets of streams that simultaneously exhibited a bursty behavior for a user-specified timeframe. Our previous work was the first to study the spatiotemporal burstiness of terms. In this context, a burstiness pattern consists of both a timeframe and a set of streams, both of which need to be identified automatically. In this paper we describe STEM (Spatio-TEmporal Miner), a system for finding spatiotemporal burstiness patterns in a collection of spatially distributed frequency streams. STEM implements the full functionality required to mine spatiotemporal burstiness patterns from virtually any collection of geostamped streams. Examples of such collections include document streams (e.g. online newspapers), geo-aware microblogging platforms (e.g. Twitter). This paper describes the STEM system and discusses how its features can be accessed via a user-friendly interface.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"40 1","pages":"1021-1024"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80608131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guillem Rull, P. Bernstein, I. Santos, Yannis Katsis, S. Melnik, Ernest Teniente
{"title":"Query containment in entity SQL","authors":"Guillem Rull, P. Bernstein, I. Santos, Yannis Katsis, S. Melnik, Ernest Teniente","doi":"10.1145/2463676.2463711","DOIUrl":"https://doi.org/10.1145/2463676.2463711","url":null,"abstract":"We describe a software architecture we have developed for a constructive containment checker of Entity SQL queries defined over extended ER schemas expressed in Microsoft's Entity Data Model. Our application of interest is compilation of object-to-relational mappings for Microsoft's ADO.NET Entity Framework, which has been shipping since 2007. The supported language includes several features which have been individually addressed in the past but, to the best of our knowledge, they have not been addressed all at once before. Moreover, when embarking on an implementation, we found no guidance in the literature on how to modularize the software or apply published algorithms to a commercially-supported language. This paper reports on our experience in addressing these real-world challenges.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"115 1","pages":"1169-1172"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80764528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaeyoung Do, Yang-Suk Kee, J. Patel, Chanik Park, Kwanghyun Park, D. DeWitt
{"title":"Query processing on smart SSDs: opportunities and challenges","authors":"Jaeyoung Do, Yang-Suk Kee, J. Patel, Chanik Park, Kwanghyun Park, D. DeWitt","doi":"10.1145/2463676.2465295","DOIUrl":"https://doi.org/10.1145/2463676.2465295","url":null,"abstract":"Data storage devices are getting \"smarter.\" Smart Flash storage devices (a.k.a. \"Smart SSD\") are on the horizon and will package CPU processing and DRAM storage inside a Smart SSD, and make that available to run user programs inside a Smart SSD. The focus of this paper is on exploring the opportunities and challenges associated with exploiting this functionality of Smart SSDs for relational analytic query processing. We have implemented an initial prototype of Microsoft SQL Server running on a Samsung Smart SSD. Our results demonstrate that significant performance and energy gains can be achieved by pushing selected query processing components inside the Smart SSDs. We also identify various changes that SSD device manufacturers can make to increase the benefits of using Smart SSDs for data processing applications, and also suggest possible research opportunities for the database community.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"20 3 1","pages":"1221-1230"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78311602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Liu, K. Taylor, Xiang Zhao, G. Squire, Xuemin Lin, C. Kloppers, Richard Miller
{"title":"CTrace: semantic comparison of multi-granularity process traces","authors":"Qing Liu, K. Taylor, Xiang Zhao, G. Squire, Xuemin Lin, C. Kloppers, Richard Miller","doi":"10.1145/2463676.2465268","DOIUrl":"https://doi.org/10.1145/2463676.2465268","url":null,"abstract":"A process trace describes the processes taken in a workflow to generate a particular result. Given many process traces, each with a large amount of very low level information, it is a challenge to make process traces meaningful to different users. It is more challenging to compare two complex process traces generated by heterogenous systems and have different levels of granularity. We present CTrace, a system that (1) lets users explore the conceptual abstraction of large process traces with different levels of granularity, and (2) provides semantic comparison among traces in which both the structural and the semantic similarity are considered. The above functions are underpinned by a novel notion of multi-granularity process trace and efficient multi-granularity similarity comparison algorithms.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"8 1","pages":"1121-1124"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78572665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Managing database technology at enterprise scale","authors":"P. Yaron","doi":"10.1145/2463676.2486083","DOIUrl":"https://doi.org/10.1145/2463676.2486083","url":null,"abstract":"Paul Yaron is responsible for Non-Mainframe, Relational Database Architecture, Engineering and Strategy for JPMC globally. JP Morgan is a leading financial services firm with assets over $2 trillion, operates 40 major datacenters around the globe, servicing over 60 countries with over 250,000 employees. It partners with 170 regulators and manages 230 Petabytes of data, JPMC depends on over 23,000 database instances to service multiple business units. With a deployment of such scope, JPMC leverages solutions from most major database, security and operating system vendors.\u0000 This talk will discuss the challenges and strategies of managing the evolving ecosystem of \"all data\", from information security, to internal virtualization strategies. Engineering reliable globally scalable and compliant data management solutions demands a model for proactively measuring the risk complexity of an ecosystem for expert focus and potential proactive remediation. The research for quantitative measurement of database (or other) ecosystem entropy appears sparse. JPMC is looking to share its ideas in this space with the academic community as the need for such quantitative measures are increasingly important as ecosystems move from islands of single tenant risk into multi-tenant risk clusters.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"13 1","pages":"919-920"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77855682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Zheng, Jeffrey Jestes, J. M. Phillips, Feifei Li
{"title":"Quality and efficiency for kernel density estimates in large data","authors":"Yan Zheng, Jeffrey Jestes, J. M. Phillips, Feifei Li","doi":"10.1145/2463676.2465319","DOIUrl":"https://doi.org/10.1145/2463676.2465319","url":null,"abstract":"Kernel density estimates are important for a broad variety of applications. Their construction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more efficient than previous algorithms. Our algorithms do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to implement our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scalability of our techniques.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"55 1","pages":"433-444"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86921925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EBM: an entropy-based model to infer social strength from spatiotemporal data","authors":"Huy Pham, C. Shahabi, Yan Liu","doi":"10.1145/2463676.2465301","DOIUrl":"https://doi.org/10.1145/2463676.2465301","url":null,"abstract":"The ubiquity of mobile devices and the popularity of location-based-services have generated, for the first time, rich datasets of people's location information at a very high fidelity. These location datasets can be used to study people's behavior - for example, social studies have shown that people, who are seen together frequently at the same place and at the same time, are most probably socially related. In this paper, we are interested in inferring these social connections by analyzing people's location information, which is useful in a variety of application domains from sales and marketing to intelligence analysis. In particular, we propose an entropy-based model (EBM) that not only infers social connections but also estimates the strength of social connections by analyzing people's co-occurrences in space and time. We examine two independent ways: diversity and weighted frequency, through which co-occurrences contribute to social strength. In addition, we take the characteristics of each location into consideration in order to compensate for cases where only limited location information is available. We conducted extensive sets of experiments with real-world datasets including both people's location data and their social connections, where we used the latter as the ground-truth to verify the results of applying our approach to the former. We show that our approach outperforms the competitors.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"14 1","pages":"265-276"},"PeriodicalIF":0.0,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86860599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}