Florian Wolf, Iraklis Psaroudakis, Norman May, A. Ailamaki, K. Sattler
{"title":"Extending database task schedulers for multi-threaded application code","authors":"Florian Wolf, Iraklis Psaroudakis, Norman May, A. Ailamaki, K. Sattler","doi":"10.1145/2791347.2791379","DOIUrl":"https://doi.org/10.1145/2791347.2791379","url":null,"abstract":"Modern databases can run application logic defined in stored procedures inside the database server to improve application speed. The SQL standard specifies how to call external stored routines implemented in programming languages, such as C, C++, or JAVA, to complement declarative SQL-based application logic. This is beneficial for scientific and analytical algorithms because they are usually too complex to be implemented entirely in SQL. At the same time, database applications like matrix calculations or data mining algorithms benefit from multi-threading to parallelize compute-intensive operations. Multi-threaded application code, however, introduces a resource competition between the threads of applications and the threads of the database task scheduler. In this paper, we show that multi-threaded application code can render the database's workload scheduling ineffective and decrease the core throughput of the database by up to 50%. We present a general approach to address this issue by integrating shared memory programming solutions into the task schedulers of databases. In particular, we describe the integration of OpenMP into databases. We implement and evaluate our approach using SAP HANA. Our experiments show that our integration does not introduce overhead, and can improve the throughput of core database operations by up to 15%.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133774289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Top-k representative queries with binary constraints","authors":"Arijit Khan, Vishwakarma Singh","doi":"10.1145/2791347.2791367","DOIUrl":"https://doi.org/10.1145/2791347.2791367","url":null,"abstract":"Given a collection of binary constraints that categorize whether a data object is relevant or not, we consider the problem of online retrieval of the top-k objects that best represent all other relevant objects in the underlying dataset. Such top-k representative queries naturally arise in a wide range of complex data analytic applications including advertisement, search, and recommendation. In this paper, we aim at identifying the top-k representative objects that are high-scoring, satisfy diverse subsets of given binary constraints, as well as representative of various other relevant objects in the dataset. We formulate our problem with the well-established notion of the top-k representative skylines, and we show that the problem is NP-hard. Hence, we design efficient techniques to solve our problem with theoretical performance guarantees. As a side-product of our algorithm, we also improve the asymptotic time-complexity of skyline computation to log-linear time in the number of data points when all dimensions except one are binary in nature. Our empirical results attest that the proposed method efficiently finds high-quality top-k representative objects, while our technique is one order of magnitude faster than state-of-the-art methods for finding the top-k skylines with binary constraints.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126060069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparent inclusion, utilization, and validation of main memory domain indexes","authors":"T. Truong, T. Risch","doi":"10.1145/2791347.2791375","DOIUrl":"https://doi.org/10.1145/2791347.2791375","url":null,"abstract":"Main-memory database systems (MMDBs) are viable solutions for many scientific applications. Scientific and engineering data often require special indexing methods, and there is a large number of domain specific main memory indexing implementations developed. However, adding an index structure into a database system can be challenging. Mexima (Main-memory External Index Manager) provides an MMDB where new main-memory index structures can be plugged-in without modifying the index implementations. This has allowed to plug-into Mexima complex and highly optimized index structures implemented in C/C++ without code changes. To utilize new user-defined indexes in queries transparently, Mexima automatically transforms query fragments into index operations based on index property tables containing index meta-data. For scalable processing of complex numerical query expressions, Mexima includes an algebraic query transformation mechanism that reasons on numerical expressions to expose potential utilization of indexes. The index property tables furthermore enable validating the correctness of an index implementation by executing automatically generated test queries based on index meta-data. Experiments show that the performance penalty of using an index plugged into Mexima is low compared to using the corresponding stand-alone C/C++ implementation. Substantial performance gains are shown by the index exposing rewrite mechanisms.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127180959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vertical partitioning for query processing over raw data","authors":"Weijie Zhao, Yu Cheng, Florin Rusu","doi":"10.1145/2791347.2791369","DOIUrl":"https://doi.org/10.1145/2791347.2791369","url":null,"abstract":"Traditional databases are not equipped with the adequate functionality to handle the volume and variety of \"Big Data\". Strict schema definition and data loading are prerequisites even for the most primitive query session. Raw data processing has been proposed as a schema-on-demand alternative that provides instant access to the data. When loading is an option, it is driven exclusively by the current-running query, resulting in sub-optimal performance across a query workload. In this paper, we investigate the problem of workload-driven raw data processing with partial loading. We model loading as fully-replicated binary vertical partitioning. We provide a linear mixed integer programming optimization formulation that we prove to be NP-hard. We design a two-stage heuristic that comes within close range of the optimal solution in a fraction of the time. We extend the optimization formulation and the heuristic to pipelined raw data processing, scenario in which data access and extraction are executed concurrently. We provide three case-studies over real data formats that confirm the accuracy of the model when implemented in a state-of-the-art pipelined operator for raw data processing.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117253378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julian Eberius, Maik Thiele, Katrin Braunschweig, Wolfgang Lehner
{"title":"DrillBeyond: processing multi-result open world SQL queries","authors":"Julian Eberius, Maik Thiele, Katrin Braunschweig, Wolfgang Lehner","doi":"10.1145/2791347.2791370","DOIUrl":"https://doi.org/10.1145/2791347.2791370","url":null,"abstract":"In a traditional relational database management system, queries can only be defined over attributes defined in the schema, but are guaranteed to give single, definitive answer structured exactly as specified in the query. In contrast, an information retrieval system allows the user to pose queries without knowledge of a schema, but the result will be a top-k list of possible answers, with no guarantees about the structure or content of the retrieved documents. In this paper, we present DrillBeyond, a novel IR/RDBMS hybrid system, in which the user seamlessly queries a relational database together with a large corpus of tables extracted from a web crawl. The system allows full SQL queries over the relational database, but additionally allows the user to use arbitrary additional attributes in the query that need not to be defined in the schema. The system then processes this semi-specified query by computing a top-k list of possible query evaluations, each based on different candidate web data sources, thus mixing properties of RDBMS and IR systems. We design a novel plan operator that encapsulates a web data retrieval and matching system and allows direct integration of such systems into relational query processing. We then present methods for efficiently processing multiple variants of a query, by producing plans that are optimized for large invariant intermediate results that can be reused between multiple query evaluations. We demonstrate the viability of the operator and our optimization strategies by implementing them in PostgreSQL and evaluating on a standard benchmark by adding arbitrary attributes to its queries.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114778023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probabilistic aggregate skyline join queries: skylines with aggregate operations over existentially uncertain relations","authors":"Arnab Bhattacharya, Shrikant Awate","doi":"10.1145/2791347.2791350","DOIUrl":"https://doi.org/10.1145/2791347.2791350","url":null,"abstract":"The multi-criteria decision making, made possible by the advent of skyline queries, has been successfully applied in many areas. Though most of the earlier work is concerned with only a single relation, several real world applications require finding the skyline set over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation and/or on aggregated values of attributes from different relations, has been proposed. In the meanwhile, uncertain datasets are witnessing increasing applications in many scientific and real-life situations. The problem of skyline computation for such datasets becomes even more challenging as every object can be classified as a skyline with some probability. In this paper, we introduce probabilistic aggregate skyline join queries (PASJQ) that ask for objects whose probability of being a skyline from a join of two uncertain relations is over a query probability threshold. The skyline preferences are on both local and aggregate attributes. Since the naïve algorithm can be impractical, we propose three algorithms to efficiently process such queries. The algorithms process the skylines as much as possible locally before computing the join to reduce the computation burden of finding skylines from the larger joined relation. Experiments with real and synthetic data exhibit the practicality and scalability of these algorithms with respect to query probability threshold, cardinality, dimensionality and other parameters of the uncertain relations.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129844307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Stockerl, Christoph Ringlstetter, Matthias Schubert, Eirini Ntoutsi, H. Kriegel
{"title":"Online template matching over a stream of digitized documents","authors":"M. Stockerl, Christoph Ringlstetter, Matthias Schubert, Eirini Ntoutsi, H. Kriegel","doi":"10.1145/2791347.2791354","DOIUrl":"https://doi.org/10.1145/2791347.2791354","url":null,"abstract":"Although living in the information age for decades, paperwork is still a tedious part of everybody's life. Assistance systems that implement techniques of digitization and document understanding may offer considerable reductions in time and effort for the users. A large portion of paper documents like invoices, delivery receipts or admonitions are based on a fixed company specific template and therefore exhibit a high degree of similarity. In this work, we propose a template extraction method over a stream of incoming documents and a template allocation method for assigning new instances from the stream to the most suitable templates. Our method employs text augmented by layout information to represent the digital image of the paper document. Document similarity is assessed with respect to both textual and layout parts of the document; the matching terms contribute accordingly to their distance to the query terms. To be more robust against distortions on the documents due to the digitization process, the templates are not static, rather they are maintained in an online fashion based on their new assigned documents. Real data experiments show that the combination of textual and layout information and the continuous template adaptation through online update, improves the template identification quality of earlier proposed methods.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125808968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aggregation and multidimensional analysis of big data for large-scale scientific applications: models, issues, analytics, and beyond","authors":"A. Cuzzocrea","doi":"10.1145/2791347.2791377","DOIUrl":"https://doi.org/10.1145/2791347.2791377","url":null,"abstract":"Aggregation and multidimensional analysis are well-known powerful tools for extracting useful knowledge, shaped in a summarized manner, which are being successfully applied to the annoying problem of managing and mining big data produced by large-scale scientific applications. Indeed, in the context of big data analytics, aggregation approaches allow us to provide meaningful descriptions of these data, otherwise impossible for alternative data-intensive analysis tools. On the other hand, multidimensional analysis methodologies introduce fortunate metaphors that significantly empathize the knowledge discovery phase from such huge amounts of data. Following this main trend, several big data aggregation and multidimensional analysis approaches have been proposed recently. The goal of this paper is to (i) provide a comprehensive overview of state-of-the-art techniques and (ii) depict open research challenges and future directions adhering to the reference scientific field.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128511562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Behrend, Ulrike Griefahn, H. Voigt, Philip Schmiegelt
{"title":"Optimizing continuous queries using update propagation with varying granularities","authors":"Andreas Behrend, Ulrike Griefahn, H. Voigt, Philip Schmiegelt","doi":"10.1145/2791347.2791368","DOIUrl":"https://doi.org/10.1145/2791347.2791368","url":null,"abstract":"We investigate the possibility to use update propagation methods for optimizing the evaluation of continuous queries. Update propagation allows for the efficient determination of induced changes to derived relations resulting from an explicitly performed base table update. In order to simplify the computation process, we propose the propagation of updates with different degrees of granularity which corresponds to an incremental query evaluation with different levels of accuracy. We show how propagation rules for different update granularities can be systematically derived, combined and further optimized by using Magic Sets. This way, the costly evaluation of certain subqueries within a continuous query can be systematically circumvented allowing for cutting down on the number of pipelined tuples considerably.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"43 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114196409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Orthogonal mechanism for answering batch queries with differential privacy","authors":"Dong Huang, Shuguo Han, X. Li, Philip S. Yu","doi":"10.1145/2791347.2791378","DOIUrl":"https://doi.org/10.1145/2791347.2791378","url":null,"abstract":"Differential privacy has recently become very promising in achieving data privacy guarantee. Typically, one can achieve ε-differential privacy by adding noise based on Laplace distribution to a query result. To reduce the noise magnitude for higher accuracy, various techniques have been proposed. They generally require high computational complexity, making them inapplicable to large-scale datasets. In this paper, we propose a novel orthogonal mechanism (OM) to represent a query set Q with a linear combination of a new query set Q, where Q consists of orthogonal query sets and is derived by exploiting the correlations between queries in Q. As a result of orthogonality of the derived queries, the proposed technique not only greatly reduces computational complexity, but also achieves better accuracy than the existing mechanisms. Extensive experimental results demonstrate the effectiveness and efficiency of the proposed technique.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"273 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114499047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}