P. Boncz, Torsten Grust, M. V. Keulen, S. Manegold, J. Rittinger, J. Teubner
{"title":"MonetDB/XQuery: a fast XQuery processor powered by a relational engine","authors":"P. Boncz, Torsten Grust, M. V. Keulen, S. Manegold, J. Rittinger, J. Teubner","doi":"10.1145/1142473.1142527","DOIUrl":"https://doi.org/10.1145/1142473.1142527","url":null,"abstract":"Relational XQuery systems try to re-use mature relational data management infrastructures to create fast and scalable XML database technology. This paper describes the main features, key contributions, and lessons learned while implementing such a system. Its architecture consists of (i) a range-based encoding of XML documents into relational tables, (ii) a compilation technique that translates XQuery into a basic relational algebra, (iii) a restricted (order) property-aware peephole relational query optimization strategy, and (iv) a mapping from XML update statements into relational updates. Thus, this system implements all essential XML database functionalities (rather than a single feature) such that we can learn from the full consequences of our architectural decisions. While implementing this system, we had to extend the state-of-the-art with a number of new technical contributions, such as loop-lifted staircase join and efficient relational query evaluation strategies for XQuery theta-joins with existential semantics. These contributions as well as the architectural lessons learned are also deemed valuable for other relational back-end engines. The performance and scalability of the resulting system is evaluated on the XMark benchmark up to data sizes of 11GB. The performance section also provides an extensive benchmark comparison of all major XMark results published previously, which confirm that the goal of purely relational XQuery processing, namely speed and scalability, was met.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132685878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MAXENT: consistent cardinality estimation in action","authors":"V. Markl, M. Kutsch, T. Tran, P. Haas, N. Megiddo","doi":"10.1145/1142473.1142586","DOIUrl":"https://doi.org/10.1145/1142473.1142586","url":null,"abstract":"When comparing alternative query execution plans (QEPs), a cost-based query optimizer in a relational database management system needs to estimate the selectivity of conjunctive predicates. To avoid inaccurate independence assumptions, modern optimizers try to exploit multivariate statistics (MVS) that provide knowledge about joint frequencies in a table of a relation. Because the complete joint distribution is almost always too large to store, optimizers are given only partial knowledge about this distribution. As a result, there exist multiple, non-equivalent ways to estimate the selectivity of a conjunctive predicate. To consistently combine the partial knowledge during the estimation process, existing optimizers employ cumbersome ad hoc heuristics. These methods unjustifiably ignore valuable information, and the optimizer tends to favor QEPs for which the least information is available. This bias problem yields poor QEP quality and performance. We demonstrate MAXENT, a novel approach based on the maximum entropy principle, prototyped in IBM DB2 LUW. We illustrate MAXENT's ability to consistently estimate the selectivity of conjunctive predicates on a per-table basis. In contrast to the DB2 optimizer's current ad hoc methods, we show how MAXENT exploits all available information about the joint column distribution and thus avoids the bias problem. For some complex queries against a real-world database, we show that MAXENT improves selectivity estimates by orders of magnitude relative to the current DB2 optimizer, and also show how these improved estimate influence plan choices as well as query execution times.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114255315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Searching in time","authors":"Christian Plattner, Andreas Wapf, G. Alonso","doi":"10.1145/1142473.1142578","DOIUrl":"https://doi.org/10.1145/1142473.1142578","url":null,"abstract":"This demonstration shows how to use external databases to provide an efficient implementation of a timetravel service. The timetravel semantics are defined using snapshot isolation. The system presented not only allows to retrieve older snapshots but also to identify snapshots of interest.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"207 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114841632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Bychkovsky, Kevin Chen, M. Goraczko, Hongyi Hu, Bret Hull, Allen K. L. Miu, E. Shih, Yang Zhang, H. Balakrishnan, S. Madden
{"title":"Data management in the CarTel mobile sensor computing system","authors":"V. Bychkovsky, Kevin Chen, M. Goraczko, Hongyi Hu, Bret Hull, Allen K. L. Miu, E. Shih, Yang Zhang, H. Balakrishnan, S. Madden","doi":"10.1145/1142473.1142569","DOIUrl":"https://doi.org/10.1145/1142473.1142569","url":null,"abstract":"We propose a reusable data management system, called CarTel, for querying and collecting data from intermittently connected devices. CarTel provides a simple, incrementally-deployable platform for developing automobile-based sensor applications. Our platform provides a dynamic query system that allows both continuous (standing) and one-shot geo-spatial queries over car position, speed, and sensory data as well as a both a low-cost/high-bandwidth substrate for communicating with a large network of mobile devices.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127294027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Personalized privacy preservation","authors":"Yufei Tao, Xiaokui Xiao","doi":"10.1145/1142473.1142500","DOIUrl":"https://doi.org/10.1145/1142473.1142500","url":null,"abstract":"We study generalization for preserving privacy in publication of sensitive data. The existing methods focus on a universal approach that exerts the same amount of preservation for all persons, with-out catering for their concrete needs. The consequence is that we may be offering insufficient protection to a subset of people, while applying excessive privacy control to another subset. Motivated by this, we present a new generalization framework based on the concept of personalized anonymity. Our technique performs the minimum generalization for satisfying everybody's requirements, and thus, retains the largest amount of information from the microdata. We carry out a careful theoretical study that leads to valuable insight into the behavior of alternative solutions. In particular, our analysis mathematically reveals the circumstances where the previous work fails to protect privacy, and establishes the superiority of the proposed solutions. The theoretical findings are verified with extensive experiments.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"2008 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125621881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Ziegler, Christoph Kiefer, Christoph Sturm, K. Dittrich, A. Bernstein
{"title":"Generic similarity detection in ontologies with the SOQA-SimPack toolkit","authors":"P. Ziegler, Christoph Kiefer, Christoph Sturm, K. Dittrich, A. Bernstein","doi":"10.1145/1142473.1142577","DOIUrl":"https://doi.org/10.1145/1142473.1142577","url":null,"abstract":"Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different data sources often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. In this demo, we present the SOQA-SimPack Toolkit (SST), an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST's usefulness with the SOQA-SimPack Toolkit Browser that allows users to graphically perform similarity calculations in ontologies.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124056935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast approximate computation of statistics on views","authors":"C. Zuzarte, Xiaohui Yu","doi":"10.1145/1142473.1142564","DOIUrl":"https://doi.org/10.1145/1142473.1142564","url":null,"abstract":"Accurate estimation of the sizes of intermediate query results (cardinality estimation) is of critical importance to plan costing in query optimization. The common practice in current commercial database systems such as IBM DB2 Universal Database (DB2 UDB) is to derive the cardinality estimates from base-table statistics. However, this approach often suffers from simplifying yet unrealistic assumptions that have to be made about the underlying data (for example, different attributes are independently distributed).Ways for exploiting statistics on query expressions (or, statistics on views, or SITs) have been proposed to improve the accuracy of cardinality estimation. We propose a novel method for efficient computation of SITs for joins. In particular, we are concerned with statistics on join queries involving large fact tables and relatively small dimension tables. Rather than materializing the views, we make use of the frequency statistics that are available on the fact tables to obtain an approximate estimate of the statistics on various attributes in the join results. The dimension tables are generally much smaller than the fact table, and therefore we can afford to closely examine the dimension table, while at the same time avoid accessing the fact table. By closely examining the dimension table, we are able to capture the correlations between the attributes in the dimension table as well as the skew and domain range of the fact table join column values. This leads to reasonably accurate statistics on the join result. We prototyped this idea as a module on top of DB2 UDB, and our experience shows that employment of this technique results in a very significant speed-up in the computation of SITs, at the expense of only slight degradation in accuracy compared with the full-materialization method.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130661195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Jagadish, B. Ooi, K. Tan, Q. Vu, Rong-Juan Zhang
{"title":"Speeding up search in peer-to-peer networks with a multi-way tree structure","authors":"H. Jagadish, B. Ooi, K. Tan, Q. Vu, Rong-Juan Zhang","doi":"10.1145/1142473.1142475","DOIUrl":"https://doi.org/10.1145/1142473.1142475","url":null,"abstract":"Peer-to-Peer systems have recently become a popular means to share resources. Effective search is a critical requirement in such systems, and a number of distributed search structures have been proposed in the literature. Most of these structures provide \"log time search\" capability, where the logarithm is taken base 2. That is, in a system with N nodes, the cost of the search is O(log2N).In database systems, the importance of large fanout index structures has been well recognized. In P2P search too, the cost could be reduced considerably if this logarithm were taken to a larger base. In this paper, we propose a multi-way tree search structure, which reduces the cost of search to O(logmN), where m is the fanout. The penalty paid is a larger update cost, but we show how to keep this penalty to be no worse than linear in m. We experimentally explore this tradeoff between search and update cost as a function of m, and suggest how to find a good trade-off point.The multi-way tree structure we propose, BATON*, is derived from the BATON structure that has recently been suggested. In addition to multi-way fanout, BATON* also adds support for multi-attribute queries to BATON.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131275742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Programming for XML","authors":"D. Florescu, Donald Kossmann","doi":"10.1145/1142473.1142597","DOIUrl":"https://doi.org/10.1145/1142473.1142597","url":null,"abstract":"There are many emerging applications for XML. Although there are many tools availalbe, an open question is the right programming paradigm to process XML data. Today, the most popular solutions are based on extensions to existing programming languages (e.g., Java, Python or PHP) with XML-specific libraries and APIs. Such libraries either represent the XML data as a virtual tree, or they read the XML data in a streaming (push or pull) fashion. This approach has the obvious problems that arise from the impedance mismatch between the XML type system and the type system of the host language. Moreover, the code written in such programming languages cannot be (easily) optimized using traditional techniques; good performance, scalability, and service-level guarantees is difficult to achieve for such programs on large datasets. Recently, several proposals for new programming languages have been made in both industry and the research community. One prominent example is Microsoft's XLinQ language. Another prominent example of XML processing in Web-based applications is AJAX (Asynchronous Java Programming with XML). In academia, XL, XStatic, Links, and several other languages have been proposed. All these solutions follow different philosophies and address critical design questions in different ways. This tutorial gives an overview of the current generation of programming languages for data-intensive XML applications. Furthermore, this tutorial compares the possible solutions based on a few comparative practical criteria. The tutorial shows how each solution addresses the design questions in different ways and gives the tradeoffs in terms of capabilities and optimizability of these languages are.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122359287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven P. Callahan, J. Freire, E. Santos, C. Scheidegger, Cláudio T. Silva, H. Vo
{"title":"VisTrails: visualization meets data management","authors":"Steven P. Callahan, J. Freire, E. Santos, C. Scheidegger, Cláudio T. Silva, H. Vo","doi":"10.1145/1142473.1142574","DOIUrl":"https://doi.org/10.1145/1142473.1142574","url":null,"abstract":"Scientists are now faced with an incredible volume of data to analyze. To successfully analyze and validate various hypothesis, it is necessary to pose several queries, correlate disparate data, and create insightful visualizations of both the simulated processes and observed phenomena. Often, insight comes from comparing the results of multiple visualizations. Unfortunately, today this process is far from interactive and contains many error-prone and time-consuming tasks. As a result, the generation and maintenance of visualizations is a major bottleneck in the scientific process, hindering both the ability to mine scientific data and the actual use of the data. The VisTrails system represents our initial attempt to improve the scientific discovery process and reduce the time to insight. In VisTrails, we address the problem of visualization from a data management perspective: VisTrails manages the data and metadata of a visualization product. In this demonstration, we show the power and flexibility of our system by presenting actual scenarios in which scientific visualization is used and showing how our system improves usability, enables reproducibility, and greatly reduces the time required to create scientific visualizations.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124589396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}