{"title":"Intelligent Transaction Scheduling via Conflict Prediction in OLTP DBMS","authors":"Tieying Zhang, Anthony Tomasic, Andrew Pavlo","doi":"arxiv-2409.01675","DOIUrl":"https://doi.org/arxiv-2409.01675","url":null,"abstract":"Current architectures for main-memory online transaction processing (OLTP)\u0000database management systems (DBMS) typically use random scheduling to assign\u0000transactions to threads. This approach achieves uniform load across threads but\u0000it ignores the likelihood of conflicts between transactions. If the DBMS could\u0000estimate the potential for transaction conflict and then intelligently schedule\u0000transactions to avoid conflicts, then the system could improve its performance.\u0000Such estimation of transaction conflict, however, is non-trivial for several\u0000reasons. First, conflicts occur under complex conditions that are far removed\u0000in time from the scheduling decision. Second, transactions must be represented\u0000in a compact and efficient manner to allow for fast conflict detection. Third,\u0000given some evidence of potential conflict, the DBMS must schedule transactions\u0000in such a way that minimizes this conflict. In this paper, we systematically\u0000explore the design decisions for solving these problems. We then empirically\u0000measure the performance impact of different representations on standard OLTP\u0000benchmarks. Our results show that intelligent scheduling using a history\u0000increases throughput by $sim$40% on 20-core machine.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computing Range Consistent Answers to Aggregation Queries via Rewriting","authors":"Aziz Amezian El Khalfioui, Jef Wijsen","doi":"arxiv-2409.01648","DOIUrl":"https://doi.org/arxiv-2409.01648","url":null,"abstract":"We consider the problem of answering conjunctive queries with aggregation on\u0000database instances that may violate primary key constraints. In SQL, these\u0000queries follow the SELECT-FROM-WHERE-GROUP BY format, where the WHERE-clause\u0000involves a conjunction of equalities, and the SELECT-clause can incorporate\u0000aggregate operators like MAX, MIN, SUM, AVG, or COUNT. Repairs of a database\u0000instance are defined as inclusion-maximal subsets that satisfy all primary\u0000keys. For a given query, our primary objective is to identify repairs that\u0000yield the lowest aggregated value among all possible repairs. We particularly\u0000investigate queries for which this lowest aggregated value can be determined\u0000through a rewriting in first-order logic with aggregate operators.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dean Light, Ahmad Aiashy, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, Benny Kimelfeld
{"title":"SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow","authors":"Dean Light, Ahmad Aiashy, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, Benny Kimelfeld","doi":"arxiv-2409.01736","DOIUrl":"https://doi.org/arxiv-2409.01736","url":null,"abstract":"Document spanners have been proposed as a formal framework for declarative\u0000Information Extraction (IE) from text, following IE products from the industry\u0000and academia. Over the past decade, the framework has been studied thoroughly\u0000in terms of expressive power, complexity, and the ability to naturally combine\u0000text analysis with relational querying. This demonstration presents SpannerLib\u0000a library for embedding document spanners in Python code. SpannerLib\u0000facilitates the development of IE programs by providing an implementation of\u0000Spannerlog (Datalog-based documentspanners) that interacts with the Python code\u0000in two directions: rules can be embedded inside Python, and they can invoke\u0000custom Python code (e.g., calls to ML-based NLP models) via user-defined\u0000functions. The demonstration scenarios showcase IE programs, with increasing\u0000levels of complexity, within Jupyter Notebook.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multilevel Verification on a Single Digital Decentralized Distributed (DDD) Ledger","authors":"Ayush Thada, Aanchal Kandpal, Dipanwita Sinha Mukharjee","doi":"arxiv-2409.11410","DOIUrl":"https://doi.org/arxiv-2409.11410","url":null,"abstract":"This paper presents an approach to using decentralized distributed digital\u0000(DDD) ledgers like blockchain with multi-level verification. In regular DDD\u0000ledgers like Blockchain, only a single level of verification is available,\u0000which makes it not useful for those systems where there is a hierarchy and\u0000verification is required on each level. In systems where hierarchy emerges\u0000naturally, the inclusion of hierarchy in the solution for the problem of the\u0000system enables us to come up with a better solution. Introduction to hierarchy\u0000means there could be several verification within a level in the hierarchy and\u0000more than one level of verification, which implies other challenges induced by\u0000an interaction between the various levels of hierarchies that also need to be\u0000addressed, like verification of the work of the previous level of hierarchy by\u0000given level in the hierarchy. The paper will address all these issues, and\u0000provide a road map to trace the state of the system at any given time and\u0000probability of failure of the system.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker
{"title":"BEAVER: An Enterprise Benchmark for Text-to-SQL","authors":"Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker","doi":"arxiv-2409.02038","DOIUrl":"https://doi.org/arxiv-2409.02038","url":null,"abstract":"Existing text-to-SQL benchmarks have largely been constructed using publicly\u0000available tables from the web with human-generated tests containing question\u0000and SQL statement pairs. They typically show very good results and lead people\u0000to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply\u0000off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In\u0000this environment, LLMs perform poorly, even when standard prompt engineering\u0000and RAG techniques are utilized. As we will show, the reasons for poor\u0000performance are largely due to three characteristics: (1) public LLMs cannot\u0000train on enterprise data warehouses because they are largely in the \"dark web\",\u0000(2) schemas of enterprise tables are more complex than the schemas in public\u0000data, which leads the SQL-generation task innately harder, and (3)\u0000business-oriented questions are often more complex, requiring joins over\u0000multiple tables and aggregations. As a result, we propose a new dataset BEAVER,\u0000sourced from real enterprise data warehouses together with natural language\u0000queries and their correct SQL statements which we collected from actual user\u0000history. We evaluated this dataset using recent LLMs and demonstrated their\u0000poor performance on this task. We hope this dataset will facilitate future\u0000researchers building more sophisticated text-to-SQL systems which can do better\u0000on this important class of data.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Split Learning-based Privacy-Preserving Record Linkage","authors":"Michail Zervas, Alexandros Karakasidis","doi":"arxiv-2409.01088","DOIUrl":"https://doi.org/arxiv-2409.01088","url":null,"abstract":"Split Learning has been recently introduced to facilitate applications where\u0000user data privacy is a requirement. However, it has not been thoroughly studied\u0000in the context of Privacy-Preserving Record Linkage, a problem in which the\u0000same real-world entity should be identified among databases from different\u0000dataholders, but without disclosing any additional information. In this paper,\u0000we investigate the potentials of Split Learning for Privacy-Preserving Record\u0000Matching, by introducing a novel training method through the utilization of\u0000Reference Sets, which are publicly available data corpora, showcasing minimal\u0000matching impact against a traditional centralized SVM-based technique.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amélie Gheerbrant, Leonid Libkin, Liat Peterfreund, Alexandra Rogova
{"title":"GQL and SQL/PGQ: Theoretical Models and Expressive Power","authors":"Amélie Gheerbrant, Leonid Libkin, Liat Peterfreund, Alexandra Rogova","doi":"arxiv-2409.01102","DOIUrl":"https://doi.org/arxiv-2409.01102","url":null,"abstract":"SQL/PGQ and GQL are very recent international standards for querying property\u0000graphs: SQL/PGQ specifies how to query relational representations of property\u0000graphs in SQL, while GQL is a standalone language for graph databases. The\u0000rapid industrial development of these standards left the academic community\u0000trailing in its wake. While digests of the languages have appeared, we do not\u0000yet have concise foundational models like relational algebra and calculus for\u0000relational databases that enable the formal study of languages, including their\u0000expressiveness and limitations. At the same time, work on the next versions of\u0000the standards has already begun, to address the perceived limitations of their\u0000first versions. Motivated by this, we initiate a formal study of SQL/PGQ and GQL,\u0000concentrating on their concise formal model and expressiveness. For the former,\u0000we define simple core languages -- Core GQL and Core PGQ -- that capture the\u0000essence of the new standards, are amenable to theoretical analysis, and fully\u0000clarify the difference between PGQ's bottom up evaluation versus GQL's linear,\u0000or pipelined approach. Equipped with these models, we both confirm the\u0000necessity to extend the language to fill in the expressiveness gaps and\u0000identify the source of these deficiencies. We complement our theoretical\u0000analysis with an experimental study, demonstrating that existing workarounds in\u0000full GQL and PGQ are impractical which further underscores the necessity to\u0000correct deficiencies in the language design.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Serverless Query Processing with Flexible Performance SLAs and Prices","authors":"Haoqiong Bian, Dongyang Geng, Yunpeng Chai, Anastasia Ailamaki","doi":"arxiv-2409.01388","DOIUrl":"https://doi.org/arxiv-2409.01388","url":null,"abstract":"Serverless query processing has become increasingly popular due to its\u0000auto-scaling, high elasticity, and pay-as-you-go pricing. It allows cloud data\u0000warehouse (or lakehouse) users to focus on data analysis without the burden of\u0000managing systems and resources. Accordingly, in serverless query services,\u0000users become more concerned about cost-efficiency under acceptable performance\u0000than performance under fixed resources. This poses new challenges for\u0000serverless query engine design in providing flexible performance service-level\u0000agreements (SLAs) and cost-efficiency (i.e., prices). In this paper, we first define the problem of flexible performance SLAs and\u0000prices in serverless query processing and discuss its significance. Then, we\u0000envision the challenges and solutions for solving this problem and the\u0000opportunities it raises for other database research. Finally, we present\u0000PixelsDB, an open-source prototype with three service levels supported by\u0000dedicated architectural designs. Evaluations show that PixelsDB reduces\u0000resource costs by 65.5% for near-real-world workloads generated by Cloud\u0000Analytics Benchmark (CAB) while not violating the pending time guarantees.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bryan-Elliott Tam, Ruben Taelman, Julián Rojas Meléndez, Pieter Colpaert
{"title":"Optimizing Traversal Queries of Sensor Data Using a Rule-Based Reachability Approach","authors":"Bryan-Elliott Tam, Ruben Taelman, Julián Rojas Meléndez, Pieter Colpaert","doi":"arxiv-2408.17157","DOIUrl":"https://doi.org/arxiv-2408.17157","url":null,"abstract":"Link Traversal queries face challenges in completeness and long execution\u0000time due to the size of the web. Reachability criteria define completeness by\u0000restricting the links followed by engines. However, the number of links to\u0000dereference remains the bottleneck of the approach. Web environments often have\u0000structures exploitable by query engines to prune irrelevant sources. Current\u0000criteria rely on using information from the query definition and predefined\u0000predicate. However, it is difficult to use them to traverse environments where\u0000logical expressions indicate the location of resources. We propose to use a\u0000rule-based reachability criterion that captures logical statements expressed in\u0000hypermedia descriptions within linked data documents to prune irrelevant\u0000sources. In this poster paper, we show how the Comunica link traversal engine\u0000is modified to take hints from a hypermedia control vocabulary, to prune\u0000irrelevant sources. Our preliminary findings show that by using this strategy,\u0000the query engine can significantly reduce the number of HTTP requests and the\u0000query execution time without sacrificing the completeness of results. Our work\u0000shows that the investigation of hypermedia controls in link pruning of\u0000traversal queries is a worthy effort for optimizing web queries of unindexed\u0000decentralized databases.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empowering Open Data Sharing for Social Good: A Privacy-Aware Approach","authors":"Tânia Carvalho, Luís Antunes, Cristina Costa, Nuno Moniz","doi":"arxiv-2408.17378","DOIUrl":"https://doi.org/arxiv-2408.17378","url":null,"abstract":"The Covid-19 pandemic has affected the world at multiple levels. Data sharing\u0000was pivotal for advancing research to understand the underlying causes and\u0000implement effective containment strategies. In response, many countries have\u0000promoted the availability of daily cases to support research initiatives,\u0000fostering collaboration between organisations and making such data available to\u0000the public through open data platforms. Despite the several advantages of data\u0000sharing, one of the major concerns before releasing health data is its impact\u0000on individuals' privacy. Such a sharing process should be based on\u0000state-of-the-art methods in Data Protection by Design and by Default. In this\u0000paper, we use a data set related to Covid-19 cases in the second largest\u0000hospital in Portugal to show how it is feasible to ensure data privacy while\u0000improving the quality and maintaining the utility of the data. Our goal is to\u0000demonstrate how knowledge exchange in multidisciplinary teams of healthcare\u0000practitioners, data privacy, and data science experts is crucial to\u0000co-developing strategies that ensure high utility of de-identified data.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}