Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang
{"title":"Updateable Data-Driven Cardinality Estimator with Bounded Q-error","authors":"Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang","doi":"arxiv-2408.17209","DOIUrl":"https://doi.org/arxiv-2408.17209","url":null,"abstract":"Modern Cardinality Estimators struggle with data updates. This research\u0000tackles this challenge within single-table. We introduce ICE, an Index-based\u0000Cardinality Estimator, the first data-driven estimator that enables instant,\u0000tuple-leveled updates. ICE has learned two key lessons from the multidimensional index and applied\u0000them to solve cardinality estimation in dynamic scenarios: (1) Index possesses\u0000the capability for swift training and seamless updating amidst vast\u0000multidimensional data. (2) Index offers precise data distribution, staying\u0000synchronized with the latest database version. These insights endow the index\u0000with the ability to be a highly accurate, data-driven model that rapidly adapts\u0000to data updates and is resilient to out-of-distribution challenges during query\u0000testing. To make a solitary index support cardinality estimation, we have\u0000crafted sophisticated algorithms for training, updating, and estimating,\u0000analyzing unbiasedness and variance. Extensive experiments demonstrate the superiority of ICE. ICE offers precise\u0000estimations and fast updates/construction across diverse workloads. Compared to\u0000state-of-the-art real-time query-driven models, ICE boasts superior accuracy\u0000(2-3 orders of magnitude more precise), faster updates (4.7-6.9 times faster),\u0000and significantly reduced training time (up to 1-3 orders of magnitude faster).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Volodymyr A. Shekhovtsov, Bence Slajcho, Aron Sacherer, Johann Eder
{"title":"CollectionLocator Level 1: Metadata-Based Search for Collections in Federated Biobanks","authors":"Volodymyr A. Shekhovtsov, Bence Slajcho, Aron Sacherer, Johann Eder","doi":"arxiv-2408.16422","DOIUrl":"https://doi.org/arxiv-2408.16422","url":null,"abstract":"Biobanks are indispensable resources for medical research collecting\u0000biological material and associated data and making them available for research\u0000projects and medical studies. For that, the biobank data has to meet certain\u0000criteria which can be formulated as adherence to the FAIR (findable,\u0000accessible, interoperable and reusable) principles. We developed a tool, CollectionLocator, which aims at increasing the FAIR\u0000compliance of biobank data by supporting researchers in identifying which\u0000biobank and which collection are likely to contain cases (material and data)\u0000satisfying the requirements of a defined research project when the detailed\u0000sample data is not available due to privacy restrictions. The CollectionLocator\u0000is based on an ontology-based metadata model to address the enormous\u0000heterogeneities and ensure the privacy of the donors of the biological samples\u0000and the data. Furthermore, the CollectionLocator represents the data and\u0000metadata quality of the collections such that the quality requirements of the\u0000requester can be matched with the quality of the available data. The concept of\u0000CollectionLocator is evaluated with a proof-of-concept implementation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao
{"title":"MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake","authors":"Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao","doi":"arxiv-2408.16237","DOIUrl":"https://doi.org/arxiv-2408.16237","url":null,"abstract":"Multimodal data has become a crucial element in the realm of big data\u0000analytics, driving advancements in data exploration, data mining, and\u0000empowering artificial intelligence applications. To support high-quality\u0000retrieval for these cutting-edge applications, a robust data retrieval platform\u0000should meet the requirements for transparent data storage, rich hybrid queries,\u0000effective feature representation, and high query efficiency. However, among the\u0000existing platforms, traditional schema-on-write systems, multi-model databases,\u0000vector databases, and data lakes, which are the primary options for multimodal\u0000data retrieval, are difficult to fulfill these requirements simultaneously.\u0000Therefore, there is an urgent need to develop a more versatile multimodal data\u0000retrieval platform to address these issues. In this paper, we introduce a Multimodal Data Retrieval Platform with\u0000Query-aware Feature Representation and Learned Index based on Data Lake\u0000(MQRLD). It leverages the transparent storage capabilities of data lakes,\u0000integrates the multimodal open API to provide a unified interface that supports\u0000rich hybrid queries, introduces a query-aware multimodal data feature\u0000representation strategy to obtain effective features, and offers\u0000high-dimensional learned indexes to optimize data query. We conduct a\u0000comparative analysis of the query performance of MQRLD against other methods\u0000for rich hybrid queries. Our results underscore the superior efficiency of\u0000MQRLD in handling multimodal data retrieval tasks, demonstrating its potential\u0000to significantly improve retrieval performance in complex environments. We also\u0000clarify some potential concerns in the discussion.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan
{"title":"CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases","authors":"Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan","doi":"arxiv-2408.16170","DOIUrl":"https://doi.org/arxiv-2408.16170","url":null,"abstract":"Cardinality estimation is crucial for enabling high query performance in\u0000relational databases. Recently learned cardinality estimation models have been\u0000proposed to improve accuracy but there is no systematic benchmark or datasets\u0000which allows researchers to evaluate the progress made by new learned\u0000approaches and even systematically develop new learned approaches. In this\u0000paper, we are releasing a benchmark, containing thousands of queries over 20\u0000distinct real-world databases for learned cardinality estimation. In contrast\u0000to other initial benchmarks, our benchmark is much more diverse and can be used\u0000for training and testing learned models systematically. Using this benchmark,\u0000we explored whether learned cardinality estimation can be transferred to an\u0000unseen dataset in a zero-shot manner. We trained GNN-based and\u0000transformer-based models to study the problem in three setups: 1-)\u0000instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while\u0000we get promising results for zero-shot cardinality estimation on simple single\u0000table queries; as soon as we add joins, the accuracy drops. However, we show\u0000that with fine-tuning, we can still utilize pre-trained models for cardinality\u0000estimation, significantly reducing training overheads compared to instance\u0000specific models. We are open sourcing our scripts to collect statistics,\u0000generate queries and training datasets to foster more extensive research, also\u0000from the ML community on the important problem of cardinality estimation and in\u0000particular improve on recent directions such as pre-trained cardinality\u0000estimation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LLM-assisted Labeling Function Generation for Semantic Type Detection","authors":"Chenjie Li, Dan Zhang, Jin Wang","doi":"arxiv-2408.16173","DOIUrl":"https://doi.org/arxiv-2408.16173","url":null,"abstract":"Detecting semantic types of columns in data lake tables is an important\u0000application. A key bottleneck in semantic type detection is the availability of\u0000human annotation due to the inherent complexity of data lakes. In this paper,\u0000we propose using programmatic weak supervision to assist in annotating the\u0000training data for semantic type detection by leveraging labeling functions. One\u0000challenge in this process is the difficulty of manually writing labeling\u0000functions due to the large volume and low quality of the data lake table\u0000datasets. To address this issue, we explore employing Large Language Models\u0000(LLMs) for labeling function generation and introduce several prompt\u0000engineering strategies for this purpose. We conduct experiments on real-world\u0000web table datasets. Based on the initial results, we perform extensive analysis\u0000and provide empirical insights and future directions for researchers in this\u0000field.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enrique Barra, Sonsoles López-Pernas, Aldo Gordillo, Alejandro Pozo, Andres Muñoz-Arcentales, Javier Conde
{"title":"Empowering Database Learning Through Remote Educational Escape Rooms","authors":"Enrique Barra, Sonsoles López-Pernas, Aldo Gordillo, Alejandro Pozo, Andres Muñoz-Arcentales, Javier Conde","doi":"arxiv-2409.08284","DOIUrl":"https://doi.org/arxiv-2409.08284","url":null,"abstract":"Learning about databases is indispensable for individuals studying software\u0000engineering or computer science or those involved in the IT industry. We\u0000analyzed a remote educational escape room for teaching about databases in four\u0000different higher education courses in two consecutive academic years. We\u0000employed three instruments for evaluation: a pre- and post-test to assess the\u0000escape room's effectiveness for student learning, a questionnaire to gather\u0000students' perceptions, and a Web platform that unobtrusively records students'\u0000interactions and performance. We show novel evidence that educational escape\u0000rooms conducted remotely can be engaging as well as effective for teaching\u0000about databases.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enumeration of Minimal Hitting Sets Parameterized by Treewidth","authors":"Batya Kenig, Dan Shlomo Mizrahi","doi":"arxiv-2408.15776","DOIUrl":"https://doi.org/arxiv-2408.15776","url":null,"abstract":"Enumerating the minimal hitting sets of a hypergraph is a problem which\u0000arises in many data management applications that include constraint mining,\u0000discovering unique column combinations, and enumerating database repairs.\u0000Previously, Eiter et al. showed that the minimal hitting sets of an $n$-vertex\u0000hypergraph, with treewidth $w$, can be enumerated with delay $O^*(n^{w})$\u0000(ignoring polynomial factors), with space requirements that scale with the\u0000output size. We improve this to fixed-parameter-linear delay, following an FPT\u0000preprocessing phase. The memory consumption of our algorithm is exponential\u0000with respect to the treewidth of the hypergraph.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Li, Chenyu Ma, Rong Gao, Youxi Wu, Jinyan Li, Wenjian Wang, Xindong Wu
{"title":"Order-preserving pattern mining with forgetting mechanism","authors":"Yan Li, Chenyu Ma, Rong Gao, Youxi Wu, Jinyan Li, Wenjian Wang, Xindong Wu","doi":"arxiv-2408.15563","DOIUrl":"https://doi.org/arxiv-2408.15563","url":null,"abstract":"Order-preserving pattern (OPP) mining is a type of sequential pattern mining\u0000method in which a group of ranks of time series is used to represent an OPP.\u0000This approach can discover frequent trends in time series. Existing OPP mining\u0000algorithms consider data points at different time to be equally important;\u0000however, newer data usually have a more significant impact, while older data\u0000have a weaker impact. We therefore introduce the forgetting mechanism into OPP\u0000mining to reduce the importance of older data. This paper explores the mining\u0000of OPPs with forgetting mechanism (OPF) and proposes an algorithm called\u0000OPF-Miner that can discover frequent OPFs. OPF-Miner performs two tasks,\u0000candidate pattern generation and support calculation. In candidate pattern\u0000generation, OPF-Miner employs a maximal support priority strategy and a group\u0000pattern fusion strategy to avoid redundant pattern fusions. For support\u0000calculation, we propose an algorithm called support calculation with forgetting\u0000mechanism, which uses prefix and suffix pattern pruning strategies to avoid\u0000redundant support calculations. The experiments are conducted on nine datasets\u0000and 12 alternative algorithms. The results verify that OPF-Miner is superior to\u0000other competitive algorithms. More importantly, OPF-Miner yields good\u0000clustering performance for time series, since the forgetting mechanism is\u0000employed. All algorithms can be downloaded from\u0000https://github.com/wuc567/Pattern-Mining/tree/master/OPF-Miner.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia
{"title":"Text2SQL is Not Enough: Unifying AI and Databases with TAG","authors":"Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia","doi":"arxiv-2408.14717","DOIUrl":"https://doi.org/arxiv-2408.14717","url":null,"abstract":"AI systems that serve natural language questions over databases promise to\u0000unlock tremendous value. Such systems would allow users to leverage the\u0000powerful reasoning and knowledge capabilities of language models (LMs)\u0000alongside the scalable computational power of data management systems. These\u0000combined capabilities would empower users to ask arbitrary natural language\u0000questions over custom data sources. However, existing methods and benchmarks\u0000insufficiently explore this setting. Text2SQL methods focus solely on natural\u0000language questions that can be expressed in relational algebra, representing a\u0000small subset of the questions real users wish to ask. Likewise,\u0000Retrieval-Augmented Generation (RAG) considers the limited subset of queries\u0000that can be answered with point lookups to one or a few data records within the\u0000database. We propose Table-Augmented Generation (TAG), a unified and\u0000general-purpose paradigm for answering natural language questions over\u0000databases. The TAG model represents a wide range of interactions between the LM\u0000and database that have been previously unexplored and creates exciting research\u0000opportunities for leveraging the world knowledge and reasoning capabilities of\u0000LMs over data. We systematically develop benchmarks to study the TAG problem\u0000and find that standard methods answer no more than 20% of queries correctly,\u0000confirming the need for further research in this area. We release code for the\u0000benchmark at https://github.com/TAG-Research/TAG-Bench.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shunit AgmonTechnion - Israel Institute of Technology, Amir GiladHebrew University, Brit YoungmannTechnion - Israel Institute of Technology, Shahar ZoaretsTechnion - Israel Institute of Technology, Benny KimelfeldTechnion - Israel Institute of Technology
{"title":"Finding Convincing Views to Endorse a Claim","authors":"Shunit AgmonTechnion - Israel Institute of Technology, Amir GiladHebrew University, Brit YoungmannTechnion - Israel Institute of Technology, Shahar ZoaretsTechnion - Israel Institute of Technology, Benny KimelfeldTechnion - Israel Institute of Technology","doi":"arxiv-2408.14974","DOIUrl":"https://doi.org/arxiv-2408.14974","url":null,"abstract":"Recent studies investigated the challenge of assessing the strength of a\u0000given claim extracted from a dataset, particularly the claim's potential of\u0000being misleading and cherry-picked. We focus on claims that compare answers to\u0000an aggregate query posed on a view that selects tuples. The strength of a claim\u0000amounts to the question of how likely it is that the view is carefully chosen\u0000to support the claim, whereas less careful choices would lead to contradictory\u0000claims. We embark on the study of the reverse task that offers a complementary\u0000angle in the critical assessment of data-based claims: given a claim, find\u0000useful supporting views. The goal of this task is twofold. On the one hand, we\u0000aim to assist users in finding significant evidence of phenomena of interest.\u0000On the other hand, we wish to provide them with machinery to criticize or\u0000counter given claims by extracting evidence of opposing statements. To be effective, the supporting sub-population should be significant and\u0000defined by a ``natural'' view. We discuss several measures of naturalness and\u0000propose ways of extracting the best views under each measure (and combinations\u0000thereof). The main challenge is the computational cost, as na\"ive search is\u0000infeasible. We devise anytime algorithms that deploy two main steps: (1) a\u0000preliminary construction of a ranked list of attribute combinations that are\u0000assessed using fast-to-compute features, and (2) an efficient search for the\u0000actual views based on each attribute combination. We present a thorough\u0000experimental study that shows the effectiveness of our algorithms in terms of\u0000quality and execution cost. We also present a user study to assess the\u0000usefulness of the naturalness measures.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}