Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey
{"title":"Learned Indexes with Distribution Smoothing via Virtual Points","authors":"Kasun Amarasinghe, Farhana Choudhury, Jianzhong Qi, James Bailey","doi":"arxiv-2408.06134","DOIUrl":"https://doi.org/arxiv-2408.06134","url":null,"abstract":"Recent research on learned indexes has created a new perspective for indexes\u0000as models that map keys to their respective storage locations. These learned\u0000indexes are created to approximate the cumulative distribution function of the\u0000key set, where using only a single model may have limited accuracy. To overcome\u0000this limitation, a typical method is to use multiple models, arranged in a\u0000hierarchical manner, where the query performance depends on two aspects: (i)\u0000traversal time to find the correct model and (ii) search time to find the key\u0000in the selected model. Such a method may cause some key space regions that are\u0000difficult to model to be placed at deeper levels in the hierarchy. To address\u0000this issue, we propose an alternative method that modifies the key space as\u0000opposed to any structural or model modifications. This is achieved through\u0000making the key set more learnable (i.e., smoothing the distribution) by\u0000inserting virtual points. Further, we develop an algorithm named CSV to\u0000integrate our virtual point insertion method into existing learned indexes,\u0000reducing both their traversal and search time. We implement CSV on\u0000state-of-the-art learned indexes and evaluate them on real-world datasets. The\u0000extensive experimental results show significant query performance improvement\u0000for the keys in deeper levels of the index structures at a low storage cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anes Bendimerad, Romain Mathonat, Youcef Remil, Mehdi Kaytoue
{"title":"Exploiting Formal Concept Analysis for Data Modeling in Data Lakes","authors":"Anes Bendimerad, Romain Mathonat, Youcef Remil, Mehdi Kaytoue","doi":"arxiv-2408.13265","DOIUrl":"https://doi.org/arxiv-2408.13265","url":null,"abstract":"Data lakes are widely used to store extensive and heterogeneous datasets for\u0000advanced analytics. However, the unstructured nature of data in these\u0000repositories introduces complexities in exploiting them and extracting\u0000meaningful insights. This motivates the need of exploring efficient approaches\u0000for consolidating data lakes and deriving a common and unified schema. This\u0000paper introduces a practical data visualization and analysis approach rooted in\u0000Formal Concept Analysis (FCA) to systematically clean, organize, and design\u0000data structures within a data lake. We explore diverse data structures stored\u0000in our data lake at Infologic, including InfluxDB measurements and\u0000Elasticsearch indexes, aiming to derive conventions for a more accessible data\u0000model. Leveraging FCA, we represent data structures as objects, analyze the\u0000concept lattice, and present two strategies-top-down and bottom-up-to unify\u0000these structures and establish a common schema. Our methodology yields\u0000significant results, enabling the identification of common concepts in the data\u0000structures, such as resources along with their underlying shared fields\u0000(timestamp, type, usedRatio, etc.). Moreover, the number of distinct data\u0000structure field names is reduced by 54 percent (from 190 to 88) in the studied\u0000subset of our data lake. We achieve a complete coverage of 80 percent of data\u0000structures with only 34 distinct field names, a significant improvement from\u0000the initial 121 field names that were needed to reach such coverage. The paper\u0000provides insights into the Infologic ecosystem, problem formulation,\u0000exploration strategies, and presents both qualitative and quantitative results.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memento Filter: A Fast, Dynamic, and Robust Range Filter","authors":"Navid Eslami, Niv Dayan","doi":"arxiv-2408.05625","DOIUrl":"https://doi.org/arxiv-2408.05625","url":null,"abstract":"Range filters are probabilistic data structures that answer approximate range\u0000emptiness queries. They aid in avoiding processing empty range queries and have\u0000use cases in many application domains such as key-value stores and social web\u0000analytics. However, current range filter designs do not support dynamically\u0000changing and growing datasets. Moreover, several of these designs also exhibit\u0000impractically high false positive rates under correlated workloads, which are\u0000common in practice. These impediments restrict the applicability of range\u0000filters across a wide range of use cases. We introduce Memento filter, the first range filter to offer dynamicity, fast\u0000operations, and a robust false positive rate guarantee for any workload.\u0000Memento filter partitions the key universe and clusters its keys according to\u0000this partitioning. For each cluster, it stores a fingerprint and a list of key\u0000suffixes contiguously. The encoding of these lists makes them amenable to\u0000existing dynamic filter structures. Due to the well-defined one-to-one mapping\u0000from keys to suffixes, Memento filter supports inserts and deletes and can even\u0000expand to accommodate a growing dataset. We implement Memento filter on top of a Rank-and-Select Quotient filter and\u0000InfiniFilter and demonstrate that it achieves competitive false positive rates\u0000and performance with the state-of-the-art while also providing dynamicity. Due\u0000to its dynamicity, Memento filter is the first range filter applicable to\u0000B-Trees. We showcase this by integrating Memento filter into WiredTiger, a\u0000B-Tree-based key-value store. Memento filter doubles WiredTiger's range query\u0000throughput when 50% of the queries are empty while keeping all other cost\u0000metrics unharmed.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Context-Driven Index Trimming: A Data Quality Perspective to Enhancing Precision of RALMs","authors":"Kexin Ma, Ruochun Jin, Xi Wang, Huan Chen, Jing Ren, Yuhua Tang","doi":"arxiv-2408.05524","DOIUrl":"https://doi.org/arxiv-2408.05524","url":null,"abstract":"Retrieval-Augmented Large Language Models (RALMs) have made significant\u0000strides in enhancing the accuracy of generated responses.However, existing\u0000research often overlooks the data quality issues within retrieval results,\u0000often caused by inaccurate existing vector-distance-based retrieval methods.We\u0000propose to boost the precision of RALMs' answers from a data quality\u0000perspective through the Context-Driven Index Trimming (CDIT) framework, where\u0000Context Matching Dependencies (CMDs) are employed as logical data quality rules\u0000to capture and regulate the consistency between retrieved contexts.Based on the\u0000semantic comprehension capabilities of Large Language Models (LLMs), CDIT can\u0000effectively identify and discard retrieval results that are inconsistent with\u0000the query context and further modify indexes in the database, thereby improving\u0000answer quality.Experiments demonstrate on challenging question-answering\u0000tasks.Also, the flexibility of CDIT is verified through its compatibility with\u0000various language models and indexing methods, which offers a promising approach\u0000to bolster RALMs' data quality and retrieval precision jointly.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiqi Wang, Long Yuan, Wenjie Zhang, Xuemin Lin, Zi Chen, Qing Liu
{"title":"Simpler is More: Efficient Top-K Nearest Neighbors Search on Large Road Networks","authors":"Yiqi Wang, Long Yuan, Wenjie Zhang, Xuemin Lin, Zi Chen, Qing Liu","doi":"arxiv-2408.05432","DOIUrl":"https://doi.org/arxiv-2408.05432","url":null,"abstract":"Top-k Nearest Neighbors (kNN) problem on road network has numerous\u0000applications on location-based services. As direct search using the Dijkstra's\u0000algorithm results in a large search space, a plethora of complex-index-based\u0000approaches have been proposed to speedup the query processing. However, even\u0000with the current state-of-the-art approach, long query processing delays\u0000persist, along with significant space overhead and prohibitively long indexing\u0000time. In this paper, we depart from the complex index designs prevalent in\u0000existing literature and propose a simple index named KNN-Index. With KNN-Index,\u0000we can answer a kNN query optimally and progressively with small and\u0000size-bounded index. To improve the index construction performance, we propose a\u0000bidirectional construction algorithm which can effectively share the common\u0000computation during the construction. Theoretical analysis and experimental\u0000results on real road networks demonstrate the superiority of KNN-Index over the\u0000state-of-the-art approach in query processing performance, index size, and\u0000index construction efficiency.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SEA-SQL: Semantic-Enhanced Text-to-SQL with Adaptive Refinement","authors":"Chaofan Li, Yingxia Shao, Zheng Liu","doi":"arxiv-2408.04919","DOIUrl":"https://doi.org/arxiv-2408.04919","url":null,"abstract":"Recent advancements in large language models (LLMs) have significantly\u0000contributed to the progress of the Text-to-SQL task. A common requirement in\u0000many of these works is the post-correction of SQL queries. However, the\u0000majority of this process entails analyzing error cases to develop prompts with\u0000rules that eliminate model bias. And there is an absence of execution\u0000verification for SQL queries. In addition, the prevalent techniques primarily\u0000depend on GPT-4 and few-shot prompts, resulting in expensive costs. To\u0000investigate the effective methods for SQL refinement in a cost-efficient\u0000manner, we introduce Semantic-Enhanced Text-to-SQL with Adaptive Refinement\u0000(SEA-SQL), which includes Adaptive Bias Elimination and Dynamic Execution\u0000Adjustment, aims to improve performance while minimizing resource expenditure\u0000with zero-shot prompts. Specifically, SEA-SQL employs a semantic-enhanced\u0000schema to augment database information and optimize SQL queries. During the SQL\u0000query generation, a fine-tuned adaptive bias eliminator is applied to mitigate\u0000inherent biases caused by the LLM. The dynamic execution adjustment is utilized\u0000to guarantee the executability of the bias eliminated SQL query. We conduct\u0000experiments on the Spider and BIRD datasets to demonstrate the effectiveness of\u0000this framework. The results demonstrate that SEA-SQL achieves state-of-the-art\u0000performance in the GPT3.5 scenario with 9%-58% of the generation cost.\u0000Furthermore, SEA-SQL is comparable to GPT-4 with only 0.9%-5.3% of the\u0000generation cost.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang
{"title":"A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?","authors":"Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang","doi":"arxiv-2408.05109","DOIUrl":"https://doi.org/arxiv-2408.05109","url":null,"abstract":"Translating users' natural language queries (NL) into SQL queries (i.e.,\u0000NL2SQL) can significantly reduce barriers to accessing relational databases and\u0000support various commercial applications. The performance of NL2SQL has been\u0000greatly enhanced with the emergence of Large Language Models (LLMs). In this\u0000survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs,\u0000covering its entire lifecycle from the following four aspects: (1) Model:\u0000NL2SQL translation techniques that tackle not only NL ambiguity and\u0000under-specification, but also properly map NL with database schema and\u0000instances; (2) Data: From the collection of training data, data synthesis due\u0000to training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating\u0000NL2SQL methods from multiple angles using different metrics and granularities;\u0000and (4) Error Analysis: analyzing NL2SQL errors to find the root cause and\u0000guiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for\u0000developing NL2SQL solutions. Finally, we discuss the research challenges and\u0000open problems of NL2SQL in the LLMs era.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"271 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dakai Kang, Suyash Gupta, Dahlia Malkhi, Mohammad Sadoghi
{"title":"HotStuff-1: Linear Consensus with One-Phase Speculation","authors":"Dakai Kang, Suyash Gupta, Dahlia Malkhi, Mohammad Sadoghi","doi":"arxiv-2408.04728","DOIUrl":"https://doi.org/arxiv-2408.04728","url":null,"abstract":"This paper introduces HotStuff-1, a BFT consensus protocol that improves the\u0000latency of HotStuff-2 by two network-hops while maintaining linear\u0000communication complexity against faults. Additionally, HotStuff-1 incorporates\u0000an incentive-compatible leader rotation regime that motivates leaders to commit\u0000consensus decisions promptly. HotStuff-1 achieves a reduction by two network hops by sending clients early\u0000finality confirmations speculatively, after one phase of the protocol. Unlike\u0000previous speculation regimes, the early finality confirmation path of\u0000HotStuff-1 is fault-tolerant and the latency improvement does not rely on\u0000optimism. An important consideration for speculation regimes in general, which\u0000is referred to as the prefix speculation dilemma, is exposed and resolved. HotStuff-1 embodies an additional mechanism, slotting, that thwarts\u0000real-world delays caused by rationally-incentivized leaders. Leaders may also\u0000be inclined to sabotage each other's progress. The slotting mechanism allows\u0000leaders to drive multiple decisions, thus mitigating both threats, while\u0000dynamically adapting the number of allowed decisions per leader to network\u0000transmission delays.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"133 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding","authors":"Sophia Ho, Jinsol Park, Patrick Wang","doi":"arxiv-2408.04678","DOIUrl":"https://doi.org/arxiv-2408.04678","url":null,"abstract":"We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign\u0000of REST that allows it to be effectively \"compacted\". REST is a drafting\u0000technique for speculative decoding based on retrieving exact n-gram matches of\u0000the most recent n tokens generated by the target LLM from a datastore. The key\u0000idea of CREST is to only store a subset of the smallest and most common n-grams\u0000in the datastore with the hope of achieving comparable performance with less\u0000storage space. We found that storing a subset of n-grams both reduces storage\u0000space and improves performance. CREST matches REST's accepted token length with\u000010.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance\u0000length than REST using the same storage space on the HumanEval and MT Bench\u0000benchmarks.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Programmable Dataflows: Abstraction and Programming Model for Data Sharing","authors":"Siyuan Xia, Chris Zhu, Tapan Srivastava, Bridget Fahey, Raul Castro Fernandez","doi":"arxiv-2408.04092","DOIUrl":"https://doi.org/arxiv-2408.04092","url":null,"abstract":"Data sharing is central to a wide variety of applications such as fraud\u0000detection, ad matching, and research. The lack of data sharing abstractions\u0000makes the solution to each data sharing problem bespoke and cost-intensive,\u0000hampering value generation. In this paper, we first introduce a data sharing\u0000model to represent every data sharing problem with a sequence of dataflows.\u0000From the model, we distill an abstraction, the contract, which agents use to\u0000communicate the intent of a dataflow and evaluate its consequences, before the\u0000dataflow takes place. This helps agents move towards a common sharing goal\u0000without violating any regulatory and privacy constraints. Then, we design and\u0000implement the contract programming model (CPM), which allows agents to program\u0000data sharing applications catered to each problem's needs. Contracts permit data sharing, but their interactive nature may introduce\u0000inefficiencies. To mitigate those inefficiencies, we extend the CPM so that it\u0000can save intermediate outputs of dataflows, and skip computation if a dataflow\u0000tries to access data that it does not have access to. In our evaluation, we\u0000show that 1) the contract abstraction is general enough to represent a wide\u0000range of sharing problems, 2) we can write programs for complex data sharing\u0000problems and exhibit qualitative improvements over other alternate\u0000technologies, and 3) quantitatively, our optimizations make sharing programs\u0000written with the CPM efficient.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141938141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}