Hyo-Sang Lim, Jae-Gil Lee, Min-Jae Lee, K. Whang, I. Song
{"title":"Continuous query processing in data streams using duality of data and queries","authors":"Hyo-Sang Lim, Jae-Gil Lee, Min-Jae Lee, K. Whang, I. Song","doi":"10.1145/1142473.1142509","DOIUrl":"https://doi.org/10.1145/1142473.1142509","url":null,"abstract":"Recent data stream systems such as TelegraphCQ have employed the well-known property of duality between data and queries. In these systems, query processing methods are classified into two dual categories -- data-initiative and query-initiative -- depending on whether query processing is initiated by selecting a data element or a query. Although the duality property has been widely recognized, previous data stream systems do not fully take advantages of this property since they use the two dual methods independently: data-initiative methods only for continuous queries and query-initiative methods only for ad-hoc queries. We contend that continuous query processing can be better optimized by adopting an approach that integrates the two dual methods. Our primary contribution is based on the observation that spatial join is a powerful tool for achieving this objective. In this paper, we first present a new viewpoint of transforming the continuous query processing problem to a multi-dimensional spatial join problem. We then present a continuous query processing algorithm based on spatial join, which we name Spatial Join CQ. This algorithm processes continuous queries by finding the pairs of overlapping regions from a set of data elements and a set of queries, both defined as regions in the multi-dimensional space. The algorithm achieves the advantages of the two dual methods simultaneously. Experimental results show that the proposed algorithm outperforms earlier algorithms by up to 36 times for simple selection continuous queries and by up to 7 times for sliding window join queries.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128323633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication-efficient distributed monitoring of thresholded counts","authors":"Ram Keralapura, Graham Cormode, J. Ramamirtham","doi":"10.1145/1142473.1142507","DOIUrl":"https://doi.org/10.1145/1142473.1142507","url":null,"abstract":"Monitoring is an issue of primary concern in current and next generation networked systems. For ex, the objective of sensor networks is to monitor their surroundings for a variety of different applications like atmospheric conditions, wildlife behavior, and troop movements among others. Similarly, monitoring in data networks is critical not only for accounting and management, but also for detecting anomalies and attacks. Such monitoring applications are inherently continuous and distributed, and must be designed to minimize the communication overhead that they introduce. In this context we introduce and study a fundamental class of problems called \"thresholded counts\" where we must return the aggregate frequency count of an event that is continuously monitored by distributed nodes with a user-specified accuracy whenever the actual count exceeds a given threshold value.In this paper we propose to address the problem of thresholded counts by setting local thresholds at each monitoring node and initiating communication only when the locally observed data exceeds these local thresholds. We explore algorithms in two categories: static and adaptive thresholds. In the static case, we consider thresholds based on a linear combination of two alternate strategies, and show that there exists an optimal blend of the two strategies that results in minimum communication overhead. We further show that this optimal blend can be found using a steepest descent search. In the adaptive case, we propose algorithms that adjust the local thresholds based on the observed distributions of updated information. We use extensive simulations not only to verify the accuracy of our algorithms and validate our theoretical results, but also to evaluate the performance of our algorithms. We find that both approaches yield significant savings over the naive approach of centralized processing.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124383508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trends in high performance analytics","authors":"Yossi Matias","doi":"10.1145/1142473.1142559","DOIUrl":"https://doi.org/10.1145/1142473.1142559","url":null,"abstract":"With the proliferation of analytic and business intelligence applications, and with the persistent growth in data sizes, there is an ever increasing need to support high performance analytics. This talk will present recent technological trends in addressing this need, and will particularly highlight the approach of facilitating high performance analytics in a relational database via a novel dichotomous combination with a non-relational aggregation-server.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131306901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast range-summable random variables for efficient aggregate estimation","authors":"Florin Rusu, A. Dobra","doi":"10.1145/1142473.1142496","DOIUrl":"https://doi.org/10.1145/1142473.1142496","url":null,"abstract":"Exact computation for aggregate queries usually requires large amounts of memory - constrained in data-streaming - or communication - constrained in distributed computation - and large processing times. In this situation, approximation techniques with provable guarantees, like sketches, are the only viable solution. The performance of sketches crucially depends on the ability to efficiently generate particular pseudo-random numbers. In this paper we investigate both theoretically and empirically the problem of generating k-wise independent pseudo-random numbers and, in particular, that of generating 3 and 4-wise independent pseudo-random numbers that are fast range-summable (i.e., they can be summed up in sub-linear time). Our specific contributions are: (a) we provide an empirical comparison of the various pseudo-random number generating schemes, (b) we study both theoretically and empirically the fast range-summation practicality for the 3 and 4-wise independent generating schemes and we provide efficient implementations for the 3-wise independent schemes, (c) we show convincing theoretical and empirical evidence that the extended Hamming scheme performs as well as any 4-wise independent scheme for estimating the size of join using AMS-sketches, even though it is only 3-wise independent. We use this generating scheme to produce estimators that significantly out-perform the state-of-the-art solutions for two problems - size of spatial joins and selectivity estimation.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132098153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elke Achtert, C. Böhm, Peer Kröger, Peter Kunath, A. Pryakhin, M. Renz
{"title":"Efficient reverse k-nearest neighbor search in arbitrary metric spaces","authors":"Elke Achtert, C. Böhm, Peer Kröger, Peter Kunath, A. Pryakhin, M. Renz","doi":"10.1145/1142473.1142531","DOIUrl":"https://doi.org/10.1145/1142473.1142531","url":null,"abstract":"The reverse k-nearest neighbor (RkNN) problem, i.e. finding all objects in a data set the k-nearest neighbors of which include a specified query object, is a generalization of the reverse 1-nearest neighbor problem which has received increasing attention recently. Many industrial and scientific applications call for solutions of the RkNN problem in arbitrary metric spaces where the data objects are not Euclidean and only a metric distance function is given for specifying object similarity. Usually, these applications need a solution for the generalized problem where the value of k is not known in advance and may change from query to query. However, existing approaches, except one, are designed for the specific R1NN problem. In addition - to the best of our knowledge - all previously proposed methods, especially the one for generalized RkNN search, are only applicable to Euclidean vector data but not for general metric objects. In this paper, we propose the first approach for efficient RkNN search in arbitrary metric spaces where the value of k is specified at query time. Our approach uses the advantages of existing metric index structures but proposes to use conservative and progressive distance approximations in order to filter out true drops and true hits. In particular, we approximate the k-nearest neighbor distance for each data object by upper and lower bounds using two functions of only two parameters each. Thus, our method does not generate any considerable storage overhead. We show in a broad experimental evaluation on real-world data the scalability and the usability of our novel approach.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133000502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating compression and execution in column-oriented database systems","authors":"D. Abadi, S. Madden, Miguel Ferreira","doi":"10.1145/1142473.1142548","DOIUrl":"https://doi.org/10.1145/1142473.1142548","url":null,"abstract":"Column-oriented database system architectures invite a re-evaluation of how and when data in databases is compressed. Storing data in a column-oriented fashion greatly increases the similarity of adjacent records on disk and thus opportunities for compression. The ability to compress many adjacent tuples at once lowers the per-tuple cost of compression, both in terms of CPU and space overheads.In this paper, we discuss how we extended C-Store (a column-oriented DBMS) with a compression sub-system. We show how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems. We then evaluate a set of compression schemes and show that the best scheme depends not only on the properties of the data but also on the nature of the query workload.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134421203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karl Schnaitter, S. Abiteboul, T. Milo, N. Polyzotis
{"title":"COLT: continuous on-line tuning","authors":"Karl Schnaitter, S. Abiteboul, T. Milo, N. Polyzotis","doi":"10.1145/1142473.1142592","DOIUrl":"https://doi.org/10.1145/1142473.1142592","url":null,"abstract":"The physical schema of a database plays a critical role in performance. Self-tuning is a cost-effective and elegant solution to optimize the physical configuration for the characteristics of the query load. Existing techniques operate in an off-line fashion, by choosing a fixed configuration that is tailored to a subset of the query load. The generated configurations therefore ignore any temporal patterns that may exist in the actual load submitted to the system.This demonstration introduces COLT (Continuous On-Line Tuning), a novel self-tuning framework that continuously monitors the incoming queries and adjusts the system configuration in order to maximize query performance. The key idea behind COLT is to gather performance statistics at different levels of detail and to carefully allocate profiling resources to the most promising candidate configurations. Moreover, COLT uses effective heuristics to regulate its own performance, lowering its overhead when the system is well-tuned, and being more aggressive when the workload shifts and it becomes necessary to re-tune the system. We present a specialization of COLT to the important problem of selecting an effective set of relational indices for the current query load. Our demonstration will use an implementation of our proposed framework in the PostgreSQL database system, showing the internal operation of COLT and the adaptive selection of indices as we vary the query load of the server.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133761318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identity resolution: 23 years of practical experience and observations at scale","authors":"Jeff Jonas","doi":"10.1145/1142473.1142556","DOIUrl":"https://doi.org/10.1145/1142473.1142556","url":null,"abstract":"Identity Resolution is a semantic reconciliation activity as applied to people and organizations. Identity resolution is most frequently quantified in terms of accuracy (false positives and false negatives), however, there are additional metrics by which to evaluate identity resolution algorithms including: methodology, persistence, streaming versus batch, data survivorship, operationalizing historical data, transaction/window size, ingestion speed, end-to-end latency, sequence neutrality, handling of ambiguous conditions, reconcilability, scalability, sustainability, and operational characteristics at scale. As well, a technique for \"analytics in the anonymized data space\" will be presented that makes it possible to resolve identities in a more privacy-preserving manner.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130737362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Injecting utility into anonymized datasets","authors":"Daniel Kifer, J. Gehrke","doi":"10.1145/1142473.1142499","DOIUrl":"https://doi.org/10.1145/1142473.1142499","url":null,"abstract":"Limiting disclosure in data publishing requires a careful balance between privacy and utility. Information about individuals must not be revealed, but a dataset should still be useful for studying the characteristics of a population. Privacy requirements such as k-anonymity and l-diversity are designed to thwart attacks that attempt to identify individuals in the data and to discover their sensitive information. On the other hand, the utility of such data has not been well-studied.In this paper we will discuss the shortcomings of current heuristic approaches to measuring utility and we will introduce a formal approach to measuring utility. Armed with this utility metric, we will show how to inject additional information into k-anonymous and l-diverse tables. This information has an intuitive semantic meaning, it increases the utility beyond what is possible in the original k-anonymity and l-diversity frameworks, and it maintains the privacy guarantees of k-anonymity and l-diversity.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115715043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DADA: a data cube for dominant relationship analysis","authors":"Cuiping Li, B. Ooi, A. Tung, Shan Wang","doi":"10.1145/1142473.1142547","DOIUrl":"https://doi.org/10.1145/1142473.1142547","url":null,"abstract":"The concept of dominance has recently attracted much interest in the context of skyline computation. Given an N-dimensional data set S, a point p is said to dominate q if p is better than q in at least one dimension and equal to or better than it in the remaining dimensions. In this paper, we propose extending the concept of dominance for business analysis from a microeconomic perspective. More specifically, we propose a new form of analysis, called Dominant Relationship Analysis (DRA), which aims to provide insight into the dominant relationships between products and potential buyers. By analyzing such relationships, companies can position their products more effectively while remaining profitable.To support DRA, we propose a novel data cube called DADA (Data Cube for Dominant Relationship Analysis), which captures the dominant relationships between products and customers. Three types of queries called Dominant Relationship Queries (DRQs) are consequently proposed for analysis purposes: 1)Linear Optimization Queries (LOQ), 2)Subspace Analysis Queries (SAQ), and 3)Comparative Dominant Queries (CDQ). Algorithms are designed for efficient computation of DADA and answering the DRQs using DADA. Results of our comprehensive experiments show the effectiveness and efficiency of DADA and its associated query processing strategies.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114246028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}