Dimitris Tsirogiannis, S. Harizopoulos, Mehul A. Shah, J. Wiener, G. Graefe
{"title":"Query processing techniques for solid state drives","authors":"Dimitris Tsirogiannis, S. Harizopoulos, Mehul A. Shah, J. Wiener, G. Graefe","doi":"10.1145/1559845.1559854","DOIUrl":"https://doi.org/10.1145/1559845.1559854","url":null,"abstract":"Solid state drives perform random reads more than 100x faster than traditional magnetic hard disks, while offering comparable sequential read and write bandwidth. Because of their potential to speed up applications, as well as their reduced power consumption, these new drives are expected to gradually replace hard disks as the primary permanent storage media in large data centers. However, although they may benefit applications that stress random reads immediately, they may not improve database applications, especially those running long data analysis queries. Database query processing engines have been designed around the speed mismatch between random and sequential I/O on hard disks and their algorithms currently emphasize sequential accesses for disk-resident data. In this paper, we investigate data structures and algorithms that leverage fast random reads to speed up selection, projection, and join operations in relational query processing. We first demonstrate how a column-based layout within each page reduces the amount of data read during selections and projections. We then introduce FlashJoin, a general pipelined join algorithm that minimizes accesses to base and intermediate relational data. FlashJoin's binary join kernel accesses only the join attributes, producing partial results in the form of a join index. Subsequently, its fetch kernel retrieves the attributes for later nodes in the query plan as they are needed. FlashJoin significantly reduces memory and I/O requirements for each join in the query. We implemented these techniques inside Postgres and experimented with an enterprise SSD drive. Our techniques improved query runtimes by up to 6x for queries ranging from simple relational scans and joins to full TPC-H queries.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130951450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Monitoring path nearest neighbor in road networks","authors":"Zaiben Chen, Heng Tao Shen, Xiaofang Zhou, J. Yu","doi":"10.1145/1559845.1559907","DOIUrl":"https://doi.org/10.1145/1559845.1559907","url":null,"abstract":"This paper addresses the problem of monitoring the k nearest neighbors to a dynamically changing path in road networks. Given a destination where a user is going to, this new query returns the k-NN with respect to the shortest path connecting the destination and the user's current location, and thus provides a list of nearest candidates for reference by considering the whole coming journey. We name this query the k-Path Nearest Neighbor query (k-PNN). As the user is moving and may not always follow the shortest path, the query path keeps changing. The challenge of monitoring the k-PNN for an arbitrarily moving user is to dynamically determine the update locations and then refresh the k-PNN efficiently. We propose a three-phase Best-first Network Expansion (BNE) algorithm for monitoring the k-PNN and the corresponding shortest path. In the searching phase, the BNE finds the shortest path to the destination, during which a candidate set that guarantees to include the k-PNN is generated at the same time. Then in the verification phase, a heuristic algorithm runs for examining candidates' exact distances to the query path, and it achieves significant reduction in the number of visited nodes. The monitoring phase deals with computing update locations as well as refreshing the k-PNN in different user movements. Since determining the network distance is a costly process, an expansion tree and the candidate set are carefully maintained by the BNE algorithm, which can provide efficient update on the shortest path and the k-PNN results. Finally, we conduct extensive experiments on real road networks and show that our methods achieve satisfactory performance.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"40 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121010066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lukasz Golab, T. Johnson, J. Spencer Seidel, Vladislav Shkapenyuk
{"title":"Stream warehousing with DataDepot","authors":"Lukasz Golab, T. Johnson, J. Spencer Seidel, Vladislav Shkapenyuk","doi":"10.1145/1559845.1559934","DOIUrl":"https://doi.org/10.1145/1559845.1559934","url":null,"abstract":"We describe DataDepot, a tool for generating warehouses from streaming data feeds, such as network-traffic traces, router alerts, financial tickers, transaction logs, and so on. DataDepot is a streaming data warehouse designed to automate the ingestion of streaming data from a wide variety of sources and to maintain complex materialized views over these sources. As a streaming warehouse, DataDepot is similar to Data Stream Management Systems (DSMSs) with its emphasis on temporal data, best-effort consistency, and real-time response. However, as a data warehouse, DataDepot is designed to store tens to hundreds of terabytes of historical data, allow time windows measured in years or decades, and allow both real-time queries on recent data and deep analyses on historical data. In this paper we discuss the DataDepot architecture, with an emphasis on several of its novel and critical features. DataDepot is currently being used for five very large warehousing projects within AT&T; one of these warehouses ingests 500 Mbytes per minute (and is growing). We use these installations to illustrate streaming warehouse use and behavior, and design choices made in developing DataDepot. We conclude with a discussion of DataDepot applications and the efficacy of some optimizations.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131418511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Research session 3: information extraction","authors":"Mirek Riedewal","doi":"10.1145/3257451","DOIUrl":"https://doi.org/10.1145/3257451","url":null,"abstract":"","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134094062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Demers, J. Gehrke, Christoph E. Koch, B. Sowell, Walker M. White
{"title":"Database research in computer games","authors":"A. Demers, J. Gehrke, Christoph E. Koch, B. Sowell, Walker M. White","doi":"10.1145/1559845.1559967","DOIUrl":"https://doi.org/10.1145/1559845.1559967","url":null,"abstract":"This tutorial presents an overview of the data management issues faced by computer games today. While many games do not use databases directly, they still have to process large amounts of data, and could benefit from the application of database technology. Other games, such as massively multiplayer online games (MMOs), must communicate with commercial databases and have their own unique challenges. In this tutorial we will present the state-of-the-art of data management in games that we learned from our interaction with various game studios. We will show how the issues involved motivate current research, and illustrate several possibilities for future work. Our tutorial will start with a description of data-driven design, which is the source of many of the data management issues that games face. We will show some of the tools that game developers use to create and manage content. We will discuss how this type of design can affect performance, and the data structures and techniques that developers use to ensure that the game is responsive. We will discuss the problem of consistency in games, and how games ensure that players all share the same view of the world. Finally, we will examine some of the engineering issues that game developers have to deal with when interacting with traditional databases. This tutorial is intended to be self-contained, and provides the background necessary for understanding how databases and database technology are relevant to computer games. This tutorial is accessible to students and researchers who, while perhaps not hardcore gamers themselves, are interested in ways in which they can use their expertise to solve problems in computer games.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134012748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stelios Paparizos, A. Ntoulas, J. Shafer, R. Agrawal
{"title":"Answering web queries using structured data sources","authors":"Stelios Paparizos, A. Ntoulas, J. Shafer, R. Agrawal","doi":"10.1145/1559845.1560000","DOIUrl":"https://doi.org/10.1145/1559845.1560000","url":null,"abstract":"In web search today, a user types a few keywords which are then matched against a large collection of unstructured web pages. This leaves a lot to be desired for when the best answer to a query is contained in structured data stores and/or when the user includes some structural semantics in the query. In our work, we include information from structured data sources into web results. Such sources can vary from fully relational DBs, to flat tables and XML files. In addition, we take advantage of information in such sources to automatically extract corresponding semantics from the query and use them appropriately in improving the overall relevance of results. For this demonstration, we show how we effectively capture, annotate and translate web user queries such as 'popular digital camera around $425' returning results from a shopping-like DB.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133055380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost based plan selection for xpath","authors":"H. Georgiadis, M. Charalambides, V. Vassalos","doi":"10.1145/1559845.1559909","DOIUrl":"https://doi.org/10.1145/1559845.1559909","url":null,"abstract":"We present a complete XPath cost-based optimization and execution framework and demonstrate its effectiveness and efficiency for a variety of queries and datasets. The framework is based on a logical XPath algebra with novel features and operators and a comprehensive set of rewriting rules that together enable us to algebraically capture many existing and novel processing strategies for XPath queries. An important part of the framework is PSA, a very efficient cost-based plan selection algorithm for XPath queries. In the presented experimental evaluation, PSA picked the cheapest estimated query plan in 100% of the cases. Our cost-based query optimizer independent of the underlying physical data model and storage system and of the available logical operator implementations, depending on a set of well-defined APIs. We also present an implementation of those APIs, including primitive access methods, a large pool of physical operators, statistics estimators and cost models, and experimentally demonstrate the effectiveness of our end-to-end query optimization system.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132315809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extreme visualisation of query optimizer search space","authors":"A. Nica, D. Brotherston, David William Hillis","doi":"10.1145/1559845.1559983","DOIUrl":"https://doi.org/10.1145/1559845.1559983","url":null,"abstract":"This demonstration showcases a system for visualizing and analyzing search spaces generated by the SQL Anywhere optimizer during the optimization process of a SQL statement. SQL Anywhere dynamically optimizes each statement every time it is executed. The decisions made by the optimizer during the optimization process are both cost-based and heuristics adapted to the current state of the server and the database instance. Many performance issues can be understood and resolved by analyzing the search space generated when optimizing a certain request. In our experience, there are two main classes of performance issues related to the decisions made by a query optimizer:(1) a request is very slow due to a suboptimal access plan; and (2) a request has a different, less optimal access plan than a previous execution. We have enhanced SQL Anywhere to log, in a very compact format, its search space during the optimization process when tracing mode is on. These search space logs can be used for performance analysis in the absence of the database instances or of extra information about the SQL Anywhere server state at the time the logs were generated. This demonstration introduces the SearchSpaceAnalyzer System, a research prototype used to analyze the search spaces of the SQL Anywhere optimizer. The system visualizes and analyzes (1) a single search space and (2) the differences between two search spaces generated for the same query by two different optimization processes. The SearchSpaceAnalyze System can be used for the analysis of any query optimizer search spaces as long as the logged data is recorded using the syntax understood by the system.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121462293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-organizing tuple reconstruction in column-stores","authors":"Stratos Idreos, M. Kersten, S. Manegold","doi":"10.1145/1559845.1559878","DOIUrl":"https://doi.org/10.1145/1559845.1559878","url":null,"abstract":"Column-stores gained popularity as a promising physical design alternative. Each attribute of a relation is physically stored as a separate column allowing queries to load only the required attributes. The overhead incurred is on-the-fly tuple reconstruction for multi-attribute queries. Each tuple reconstruction is a join of two columns based on tuple IDs, making it a significant cost component. The ultimate physical design is to have multiple presorted copies of each base table such that tuples are already appropriately organized in multiple different orders across the various columns. This requires the ability to predict the workload, idle time to prepare, and infrequent updates. In this paper, we propose a novel design, partial sideways cracking, that minimizes the tuple reconstruction cost in a self-organizing way. It achieves performance similar to using presorted data, but without requiring the heavy initial presorting step itself. Instead, it handles dynamic, unpredictable workloads with no idle time and frequent updates. Auxiliary dynamic data structures, called cracker maps, provide a direct mapping between pairs of attributes used together in queries for tuple reconstruction. A map is continuously physically reorganized as an integral part of query evaluation, providing faster and reduced data access for future queries. To enable flexible and self-organizing behavior in storage-limited environments, maps are materialized only partially as demanded by the workload. Each map is a collection of separate chunks that are individually reorganized, dropped or recreated as needed. We implemented partial sideways cracking in an open-source column-store. A detailed experimental analysis demonstrates that it brings significant performance benefits for multi-attribute queries.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121476561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Research session 10: probabilistic databases I","authors":"Jun Yang","doi":"10.1145/3257458","DOIUrl":"https://doi.org/10.1145/3257458","url":null,"abstract":"","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116775212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}