arXiv - CS - Databases最新文献_第4页

Updateable Data-Driven Cardinality Estimator with Bounded Q-error 具有有界 Q 误差的可更新数据驱动卡方估计器

arXiv - CS - Databases Pub Date : 2024-08-30 DOI: arxiv-2408.17209

Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang

{"title":"Updateable Data-Driven Cardinality Estimator with Bounded Q-error","authors":"Yingze Li, Xianglong Liu, Hongzhi Wang, Kaixin Zhang, Zixuan Wang","doi":"arxiv-2408.17209","DOIUrl":"https://doi.org/arxiv-2408.17209","url":null,"abstract":"Modern Cardinality Estimators struggle with data updates. This research\u0000tackles this challenge within single-table. We introduce ICE, an Index-based\u0000Cardinality Estimator, the first data-driven estimator that enables instant,\u0000tuple-leveled updates. ICE has learned two key lessons from the multidimensional index and applied\u0000them to solve cardinality estimation in dynamic scenarios: (1) Index possesses\u0000the capability for swift training and seamless updating amidst vast\u0000multidimensional data. (2) Index offers precise data distribution, staying\u0000synchronized with the latest database version. These insights endow the index\u0000with the ability to be a highly accurate, data-driven model that rapidly adapts\u0000to data updates and is resilient to out-of-distribution challenges during query\u0000testing. To make a solitary index support cardinality estimation, we have\u0000crafted sophisticated algorithms for training, updating, and estimating,\u0000analyzing unbiasedness and variance. Extensive experiments demonstrate the superiority of ICE. ICE offers precise\u0000estimations and fast updates/construction across diverse workloads. Compared to\u0000state-of-the-art real-time query-driven models, ICE boasts superior accuracy\u0000(2-3 orders of magnitude more precise), faster updates (4.7-6.9 times faster),\u0000and significantly reduced training time (up to 1-3 orders of magnitude faster).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CollectionLocator Level 1: Metadata-Based Search for Collections in Federated Biobanks CollectionLocator 1 级：基于元数据的联合生物库藏品搜索

arXiv - CS - Databases Pub Date : 2024-08-29 DOI: arxiv-2408.16422

Volodymyr A. Shekhovtsov, Bence Slajcho, Aron Sacherer, Johann Eder

{"title":"CollectionLocator Level 1: Metadata-Based Search for Collections in Federated Biobanks","authors":"Volodymyr A. Shekhovtsov, Bence Slajcho, Aron Sacherer, Johann Eder","doi":"arxiv-2408.16422","DOIUrl":"https://doi.org/arxiv-2408.16422","url":null,"abstract":"Biobanks are indispensable resources for medical research collecting\u0000biological material and associated data and making them available for research\u0000projects and medical studies. For that, the biobank data has to meet certain\u0000criteria which can be formulated as adherence to the FAIR (findable,\u0000accessible, interoperable and reusable) principles. We developed a tool, CollectionLocator, which aims at increasing the FAIR\u0000compliance of biobank data by supporting researchers in identifying which\u0000biobank and which collection are likely to contain cases (material and data)\u0000satisfying the requirements of a defined research project when the detailed\u0000sample data is not available due to privacy restrictions. The CollectionLocator\u0000is based on an ontology-based metadata model to address the enormous\u0000heterogeneities and ensure the privacy of the donors of the biological samples\u0000and the data. Furthermore, the CollectionLocator represents the data and\u0000metadata quality of the collections such that the quality requirements of the\u0000requester can be matched with the quality of the available data. The concept of\u0000CollectionLocator is evaluated with a proof-of-concept implementation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake MQRLD：基于数据湖的具有查询感知特征表示和学习索引的多模态数据检索平台

arXiv - CS - Databases Pub Date : 2024-08-29 DOI: arxiv-2408.16237

Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao

{"title":"MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake","authors":"Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao","doi":"arxiv-2408.16237","DOIUrl":"https://doi.org/arxiv-2408.16237","url":null,"abstract":"Multimodal data has become a crucial element in the realm of big data\u0000analytics, driving advancements in data exploration, data mining, and\u0000empowering artificial intelligence applications. To support high-quality\u0000retrieval for these cutting-edge applications, a robust data retrieval platform\u0000should meet the requirements for transparent data storage, rich hybrid queries,\u0000effective feature representation, and high query efficiency. However, among the\u0000existing platforms, traditional schema-on-write systems, multi-model databases,\u0000vector databases, and data lakes, which are the primary options for multimodal\u0000data retrieval, are difficult to fulfill these requirements simultaneously.\u0000Therefore, there is an urgent need to develop a more versatile multimodal data\u0000retrieval platform to address these issues. In this paper, we introduce a Multimodal Data Retrieval Platform with\u0000Query-aware Feature Representation and Learned Index based on Data Lake\u0000(MQRLD). It leverages the transparent storage capabilities of data lakes,\u0000integrates the multimodal open API to provide a unified interface that supports\u0000rich hybrid queries, introduces a query-aware multimodal data feature\u0000representation strategy to obtain effective features, and offers\u0000high-dimensional learned indexes to optimize data query. We conduct a\u0000comparative analysis of the query performance of MQRLD against other methods\u0000for rich hybrid queries. Our results underscore the superior efficiency of\u0000MQRLD in handling multimodal data retrieval tasks, demonstrating its potential\u0000to significantly improve retrieval performance in complex environments. We also\u0000clarify some potential concerns in the discussion.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"441 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases CardBench：关系数据库中学习到的卡片性估计基准

arXiv - CS - Databases Pub Date : 2024-08-28 DOI: arxiv-2408.16170

Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan

{"title":"CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases","authors":"Yannis Chronis, Yawen Wang, Yu Gan, Sami Abu-El-Haija, Chelsea Lin, Carsten Binnig, Fatma Özcan","doi":"arxiv-2408.16170","DOIUrl":"https://doi.org/arxiv-2408.16170","url":null,"abstract":"Cardinality estimation is crucial for enabling high query performance in\u0000relational databases. Recently learned cardinality estimation models have been\u0000proposed to improve accuracy but there is no systematic benchmark or datasets\u0000which allows researchers to evaluate the progress made by new learned\u0000approaches and even systematically develop new learned approaches. In this\u0000paper, we are releasing a benchmark, containing thousands of queries over 20\u0000distinct real-world databases for learned cardinality estimation. In contrast\u0000to other initial benchmarks, our benchmark is much more diverse and can be used\u0000for training and testing learned models systematically. Using this benchmark,\u0000we explored whether learned cardinality estimation can be transferred to an\u0000unseen dataset in a zero-shot manner. We trained GNN-based and\u0000transformer-based models to study the problem in three setups: 1-)\u0000instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while\u0000we get promising results for zero-shot cardinality estimation on simple single\u0000table queries; as soon as we add joins, the accuracy drops. However, we show\u0000that with fine-tuning, we can still utilize pre-trained models for cardinality\u0000estimation, significantly reducing training overheads compared to instance\u0000specific models. We are open sourcing our scripts to collect statistics,\u0000generate queries and training datasets to foster more extensive research, also\u0000from the ML community on the important problem of cardinality estimation and in\u0000particular improve on recent directions such as pre-trained cardinality\u0000estimation.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLM-assisted Labeling Function Generation for Semantic Type Detection 用于语义类型检测的 LLM 辅助标记功能生成

arXiv - CS - Databases Pub Date : 2024-08-28 DOI: arxiv-2408.16173

Chenjie Li, Dan Zhang, Jin Wang

引用次数: 0

Empowering Database Learning Through Remote Educational Escape Rooms 通过远程教育密室增强数据库学习能力

arXiv - CS - Databases Pub Date : 2024-08-28 DOI: arxiv-2409.08284

Enrique Barra, Sonsoles López-Pernas, Aldo Gordillo, Alejandro Pozo, Andres Muñoz-Arcentales, Javier Conde

引用次数: 0

Enumeration of Minimal Hitting Sets Parameterized by Treewidth 以树宽为参数的最小命中集枚举

arXiv - CS - Databases Pub Date : 2024-08-28 DOI: arxiv-2408.15776

Batya Kenig, Dan Shlomo Mizrahi

引用次数: 0

Order-preserving pattern mining with forgetting mechanism 具有遗忘机制的保序模式挖掘

arXiv - CS - Databases Pub Date : 2024-08-28 DOI: arxiv-2408.15563

Yan Li, Chenyu Ma, Rong Gao, Youxi Wu, Jinyan Li, Wenjian Wang, Xindong Wu

{"title":"Order-preserving pattern mining with forgetting mechanism","authors":"Yan Li, Chenyu Ma, Rong Gao, Youxi Wu, Jinyan Li, Wenjian Wang, Xindong Wu","doi":"arxiv-2408.15563","DOIUrl":"https://doi.org/arxiv-2408.15563","url":null,"abstract":"Order-preserving pattern (OPP) mining is a type of sequential pattern mining\u0000method in which a group of ranks of time series is used to represent an OPP.\u0000This approach can discover frequent trends in time series. Existing OPP mining\u0000algorithms consider data points at different time to be equally important;\u0000however, newer data usually have a more significant impact, while older data\u0000have a weaker impact. We therefore introduce the forgetting mechanism into OPP\u0000mining to reduce the importance of older data. This paper explores the mining\u0000of OPPs with forgetting mechanism (OPF) and proposes an algorithm called\u0000OPF-Miner that can discover frequent OPFs. OPF-Miner performs two tasks,\u0000candidate pattern generation and support calculation. In candidate pattern\u0000generation, OPF-Miner employs a maximal support priority strategy and a group\u0000pattern fusion strategy to avoid redundant pattern fusions. For support\u0000calculation, we propose an algorithm called support calculation with forgetting\u0000mechanism, which uses prefix and suffix pattern pruning strategies to avoid\u0000redundant support calculations. The experiments are conducted on nine datasets\u0000and 12 alternative algorithms. The results verify that OPF-Miner is superior to\u0000other competitive algorithms. More importantly, OPF-Miner yields good\u0000clustering performance for time series, since the forgetting mechanism is\u0000employed. All algorithms can be downloaded from\u0000https://github.com/wuc567/Pattern-Mining/tree/master/OPF-Miner.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Text2SQL is Not Enough: Unifying AI and Databases with TAG 仅有 Text2SQL 是不够的：用 TAG 统一人工智能和数据库

arXiv - CS - Databases Pub Date : 2024-08-27 DOI: arxiv-2408.14717

Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia

{"title":"Text2SQL is Not Enough: Unifying AI and Databases with TAG","authors":"Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia","doi":"arxiv-2408.14717","DOIUrl":"https://doi.org/arxiv-2408.14717","url":null,"abstract":"AI systems that serve natural language questions over databases promise to\u0000unlock tremendous value. Such systems would allow users to leverage the\u0000powerful reasoning and knowledge capabilities of language models (LMs)\u0000alongside the scalable computational power of data management systems. These\u0000combined capabilities would empower users to ask arbitrary natural language\u0000questions over custom data sources. However, existing methods and benchmarks\u0000insufficiently explore this setting. Text2SQL methods focus solely on natural\u0000language questions that can be expressed in relational algebra, representing a\u0000small subset of the questions real users wish to ask. Likewise,\u0000Retrieval-Augmented Generation (RAG) considers the limited subset of queries\u0000that can be answered with point lookups to one or a few data records within the\u0000database. We propose Table-Augmented Generation (TAG), a unified and\u0000general-purpose paradigm for answering natural language questions over\u0000databases. The TAG model represents a wide range of interactions between the LM\u0000and database that have been previously unexplored and creates exciting research\u0000opportunities for leveraging the world knowledge and reasoning capabilities of\u0000LMs over data. We systematically develop benchmarks to study the TAG problem\u0000and find that standard methods answer no more than 20% of queries correctly,\u0000confirming the need for further research in this area. We release code for the\u0000benchmark at https://github.com/TAG-Research/TAG-Bench.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Finding Convincing Views to Endorse a Claim 寻找令人信服的观点来支持主张

arXiv - CS - Databases Pub Date : 2024-08-27 DOI: arxiv-2408.14974

Shunit AgmonTechnion - Israel Institute of Technology, Amir GiladHebrew University, Brit YoungmannTechnion - Israel Institute of Technology, Shahar ZoaretsTechnion - Israel Institute of Technology, Benny KimelfeldTechnion - Israel Institute of Technology

{"title":"Finding Convincing Views to Endorse a Claim","authors":"Shunit AgmonTechnion - Israel Institute of Technology, Amir GiladHebrew University, Brit YoungmannTechnion - Israel Institute of Technology, Shahar ZoaretsTechnion - Israel Institute of Technology, Benny KimelfeldTechnion - Israel Institute of Technology","doi":"arxiv-2408.14974","DOIUrl":"https://doi.org/arxiv-2408.14974","url":null,"abstract":"Recent studies investigated the challenge of assessing the strength of a\u0000given claim extracted from a dataset, particularly the claim's potential of\u0000being misleading and cherry-picked. We focus on claims that compare answers to\u0000an aggregate query posed on a view that selects tuples. The strength of a claim\u0000amounts to the question of how likely it is that the view is carefully chosen\u0000to support the claim, whereas less careful choices would lead to contradictory\u0000claims. We embark on the study of the reverse task that offers a complementary\u0000angle in the critical assessment of data-based claims: given a claim, find\u0000useful supporting views. The goal of this task is twofold. On the one hand, we\u0000aim to assist users in finding significant evidence of phenomena of interest.\u0000On the other hand, we wish to provide them with machinery to criticize or\u0000counter given claims by extracting evidence of opposing statements. To be effective, the supporting sub-population should be significant and\u0000defined by a ``natural'' view. We discuss several measures of naturalness and\u0000propose ways of extracting the best views under each measure (and combinations\u0000thereof). The main challenge is the computational cost, as na\"ive search is\u0000infeasible. We devise anytime algorithms that deploy two main steps: (1) a\u0000preliminary construction of a ranked list of attribute combinations that are\u0000assessed using fast-to-compute features, and (2) an efficient search for the\u0000actual views based on each attribute combination. We present a thorough\u0000experimental study that shows the effectiveness of our algorithms in terms of\u0000quality and execution cost. We also present a user study to assess the\u0000usefulness of the naturalness measures.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"184 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0