Proceedings of the 27th International Conference on Scientific and Statistical Database Management最新文献_第2页

Extending database task schedulers for multi-threaded application code 扩展多线程应用程序代码的数据库任务调度器

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791379

Florian Wolf, Iraklis Psaroudakis, Norman May, A. Ailamaki, K. Sattler

{"title":"Extending database task schedulers for multi-threaded application code","authors":"Florian Wolf, Iraklis Psaroudakis, Norman May, A. Ailamaki, K. Sattler","doi":"10.1145/2791347.2791379","DOIUrl":"https://doi.org/10.1145/2791347.2791379","url":null,"abstract":"Modern databases can run application logic defined in stored procedures inside the database server to improve application speed. The SQL standard specifies how to call external stored routines implemented in programming languages, such as C, C++, or JAVA, to complement declarative SQL-based application logic. This is beneficial for scientific and analytical algorithms because they are usually too complex to be implemented entirely in SQL. At the same time, database applications like matrix calculations or data mining algorithms benefit from multi-threading to parallelize compute-intensive operations. Multi-threaded application code, however, introduces a resource competition between the threads of applications and the threads of the database task scheduler. In this paper, we show that multi-threaded application code can render the database's workload scheduling ineffective and decrease the core throughput of the database by up to 50%. We present a general approach to address this issue by integrating shared memory programming solutions into the task schedulers of databases. In particular, we describe the integration of OpenMP into databases. We implement and evaluate our approach using SAP HANA. Our experiments show that our integration does not introduce overhead, and can improve the throughput of core database operations by up to 15%.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133774289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Top-k representative queries with binary constraints 具有二进制约束的Top-k个代表性查询

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791367

Arijit Khan, Vishwakarma Singh

{"title":"Top-k representative queries with binary constraints","authors":"Arijit Khan, Vishwakarma Singh","doi":"10.1145/2791347.2791367","DOIUrl":"https://doi.org/10.1145/2791347.2791367","url":null,"abstract":"Given a collection of binary constraints that categorize whether a data object is relevant or not, we consider the problem of online retrieval of the top-k objects that best represent all other relevant objects in the underlying dataset. Such top-k representative queries naturally arise in a wide range of complex data analytic applications including advertisement, search, and recommendation. In this paper, we aim at identifying the top-k representative objects that are high-scoring, satisfy diverse subsets of given binary constraints, as well as representative of various other relevant objects in the dataset. We formulate our problem with the well-established notion of the top-k representative skylines, and we show that the problem is NP-hard. Hence, we design efficient techniques to solve our problem with theoretical performance guarantees. As a side-product of our algorithm, we also improve the asymptotic time-complexity of skyline computation to log-linear time in the number of data points when all dimensions except one are binary in nature. Our empirical results attest that the proposed method efficiently finds high-quality top-k representative objects, while our technique is one order of magnitude faster than state-of-the-art methods for finding the top-k skylines with binary constraints.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126060069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Transparent inclusion, utilization, and validation of main memory domain indexes 主存域索引的透明包含、利用和验证

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791375

T. Truong, T. Risch

{"title":"Transparent inclusion, utilization, and validation of main memory domain indexes","authors":"T. Truong, T. Risch","doi":"10.1145/2791347.2791375","DOIUrl":"https://doi.org/10.1145/2791347.2791375","url":null,"abstract":"Main-memory database systems (MMDBs) are viable solutions for many scientific applications. Scientific and engineering data often require special indexing methods, and there is a large number of domain specific main memory indexing implementations developed. However, adding an index structure into a database system can be challenging. Mexima (Main-memory External Index Manager) provides an MMDB where new main-memory index structures can be plugged-in without modifying the index implementations. This has allowed to plug-into Mexima complex and highly optimized index structures implemented in C/C++ without code changes. To utilize new user-defined indexes in queries transparently, Mexima automatically transforms query fragments into index operations based on index property tables containing index meta-data. For scalable processing of complex numerical query expressions, Mexima includes an algebraic query transformation mechanism that reasons on numerical expressions to expose potential utilization of indexes. The index property tables furthermore enable validating the correctness of an index implementation by executing automatically generated test queries based on index meta-data. Experiments show that the performance penalty of using an index plugged into Mexima is low compared to using the corresponding stand-alone C/C++ implementation. Substantial performance gains are shown by the index exposing rewrite mechanisms.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127180959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Vertical partitioning for query processing over raw data 用于对原始数据进行查询处理的垂直分区

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791369

Weijie Zhao, Yu Cheng, Florin Rusu

{"title":"Vertical partitioning for query processing over raw data","authors":"Weijie Zhao, Yu Cheng, Florin Rusu","doi":"10.1145/2791347.2791369","DOIUrl":"https://doi.org/10.1145/2791347.2791369","url":null,"abstract":"Traditional databases are not equipped with the adequate functionality to handle the volume and variety of \"Big Data\". Strict schema definition and data loading are prerequisites even for the most primitive query session. Raw data processing has been proposed as a schema-on-demand alternative that provides instant access to the data. When loading is an option, it is driven exclusively by the current-running query, resulting in sub-optimal performance across a query workload. In this paper, we investigate the problem of workload-driven raw data processing with partial loading. We model loading as fully-replicated binary vertical partitioning. We provide a linear mixed integer programming optimization formulation that we prove to be NP-hard. We design a two-stage heuristic that comes within close range of the optimal solution in a fraction of the time. We extend the optimization formulation and the heuristic to pipelined raw data processing, scenario in which data access and extraction are executed concurrently. We provide three case-studies over real data formats that confirm the accuracy of the model when implemented in a state-of-the-art pipelined operator for raw data processing.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117253378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

DrillBeyond: processing multi-result open world SQL queries DrillBeyond:处理多结果的开放世界SQL查询

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791370

Julian Eberius, Maik Thiele, Katrin Braunschweig, Wolfgang Lehner

{"title":"DrillBeyond: processing multi-result open world SQL queries","authors":"Julian Eberius, Maik Thiele, Katrin Braunschweig, Wolfgang Lehner","doi":"10.1145/2791347.2791370","DOIUrl":"https://doi.org/10.1145/2791347.2791370","url":null,"abstract":"In a traditional relational database management system, queries can only be defined over attributes defined in the schema, but are guaranteed to give single, definitive answer structured exactly as specified in the query. In contrast, an information retrieval system allows the user to pose queries without knowledge of a schema, but the result will be a top-k list of possible answers, with no guarantees about the structure or content of the retrieved documents. In this paper, we present DrillBeyond, a novel IR/RDBMS hybrid system, in which the user seamlessly queries a relational database together with a large corpus of tables extracted from a web crawl. The system allows full SQL queries over the relational database, but additionally allows the user to use arbitrary additional attributes in the query that need not to be defined in the schema. The system then processes this semi-specified query by computing a top-k list of possible query evaluations, each based on different candidate web data sources, thus mixing properties of RDBMS and IR systems. We design a novel plan operator that encapsulates a web data retrieval and matching system and allows direct integration of such systems into relational query processing. We then present methods for efficiently processing multiple variants of a query, by producing plans that are optimized for large invariant intermediate results that can be reused between multiple query evaluations. We demonstrate the viability of the operator and our optimization strategies by implementing them in PostgreSQL and evaluating on a standard benchmark by adding arbitrary attributes to its queries.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114778023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Probabilistic aggregate skyline join queries: skylines with aggregate operations over existentially uncertain relations 概率聚合天际线连接查询:在存在不确定关系上具有聚合操作的天际线

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791350

Arnab Bhattacharya, Shrikant Awate

{"title":"Probabilistic aggregate skyline join queries: skylines with aggregate operations over existentially uncertain relations","authors":"Arnab Bhattacharya, Shrikant Awate","doi":"10.1145/2791347.2791350","DOIUrl":"https://doi.org/10.1145/2791347.2791350","url":null,"abstract":"The multi-criteria decision making, made possible by the advent of skyline queries, has been successfully applied in many areas. Though most of the earlier work is concerned with only a single relation, several real world applications require finding the skyline set over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation and/or on aggregated values of attributes from different relations, has been proposed. In the meanwhile, uncertain datasets are witnessing increasing applications in many scientific and real-life situations. The problem of skyline computation for such datasets becomes even more challenging as every object can be classified as a skyline with some probability. In this paper, we introduce probabilistic aggregate skyline join queries (PASJQ) that ask for objects whose probability of being a skyline from a join of two uncertain relations is over a query probability threshold. The skyline preferences are on both local and aggregate attributes. Since the naïve algorithm can be impractical, we propose three algorithms to efficiently process such queries. The algorithms process the skylines as much as possible locally before computing the join to reduce the computation burden of finding skylines from the larger joined relation. Experiments with real and synthetic data exhibit the practicality and scalability of these algorithms with respect to query probability threshold, cardinality, dimensionality and other parameters of the uncertain relations.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129844307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online template matching over a stream of digitized documents 在线模板匹配在一个流的数字化文件

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791354

M. Stockerl, Christoph Ringlstetter, Matthias Schubert, Eirini Ntoutsi, H. Kriegel

{"title":"Online template matching over a stream of digitized documents","authors":"M. Stockerl, Christoph Ringlstetter, Matthias Schubert, Eirini Ntoutsi, H. Kriegel","doi":"10.1145/2791347.2791354","DOIUrl":"https://doi.org/10.1145/2791347.2791354","url":null,"abstract":"Although living in the information age for decades, paperwork is still a tedious part of everybody's life. Assistance systems that implement techniques of digitization and document understanding may offer considerable reductions in time and effort for the users. A large portion of paper documents like invoices, delivery receipts or admonitions are based on a fixed company specific template and therefore exhibit a high degree of similarity. In this work, we propose a template extraction method over a stream of incoming documents and a template allocation method for assigning new instances from the stream to the most suitable templates. Our method employs text augmented by layout information to represent the digital image of the paper document. Document similarity is assessed with respect to both textual and layout parts of the document; the matching terms contribute accordingly to their distance to the query terms. To be more robust against distortions on the documents due to the digitization process, the templates are not static, rather they are maintained in an online fashion based on their new assigned documents. Real data experiments show that the combination of textual and layout information and the continuous template adaptation through online update, improves the template identification quality of earlier proposed methods.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125808968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Aggregation and multidimensional analysis of big data for large-scale scientific applications: models, issues, analytics, and beyond 面向大规模科学应用的大数据聚合和多维分析:模型、问题、分析等

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791377

A. Cuzzocrea

引用次数: 16

Optimizing continuous queries using update propagation with varying granularities 使用不同粒度的更新传播优化连续查询

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791368

Andreas Behrend, Ulrike Griefahn, H. Voigt, Philip Schmiegelt

引用次数: 4

Orthogonal mechanism for answering batch queries with differential privacy 用差分隐私回答批量查询的正交机制

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI: 10.1145/2791347.2791378

Dong Huang, Shuguo Han, X. Li, Philip S. Yu

引用次数: 10