Xinjie Zhou, Mengxuan Zhang, Lei Li, Xiaofang Zhou
{"title":"High Throughput Shortest Distance Query Processing on Large Dynamic Road Networks","authors":"Xinjie Zhou, Mengxuan Zhang, Lei Li, Xiaofang Zhou","doi":"arxiv-2409.06148","DOIUrl":"https://doi.org/arxiv-2409.06148","url":null,"abstract":"Shortest path (SP) computation is the building block for many location-based\u0000services, and achieving high throughput SP query processing is an essential\u0000goal for the real-time response of those services. However, the large number of\u0000queries submitted in large-scale dynamic road networks still poses challenges\u0000to this goal. Therefore, in this work, we propose a novel framework aiming to\u0000process SP queries with high throughput in large and dynamic road networks, by\u0000leveraging the Partitioned Shortest Path (PSP) index. Specifically, we first\u0000put forward a cross-boundary strategy to accelerate the query processing of PSP\u0000index and analyze its efficiency upper-bound by discovering the curse of PSP\u0000index query efficiency. After that, we propose a non-trivial Partitioned\u0000Multi-stage Hub Labeling (PMHL) that utilizes multiple PSP strategies and\u0000thread parallelization to achieve consecutive query efficiency improvement and\u0000fast index maintenance. Finally, to further increase query throughput, we\u0000design tree decomposition-based graph partitioning and propose Post-partitioned\u0000Multi-stage Hub Labeling (PostMHL) with faster query processing and index\u0000update than PMHL. Experiments on real-world road networks show that our methods\u0000outperform state-of-the-art baselines in query throughput, yielding up to 1-4\u0000orders of magnitude improvement.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman
{"title":"A System and Benchmark for LLM-based Q&A on Heterogeneous Data","authors":"Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath III, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman","doi":"arxiv-2409.05735","DOIUrl":"https://doi.org/arxiv-2409.05735","url":null,"abstract":"In many industrial settings, users wish to ask questions whose answers may be\u0000found in structured data sources such as a spreadsheets, databases, APIs, or\u0000combinations thereof. Often, the user doesn't know how to identify or access\u0000the right data source. This problem is compounded even further if multiple (and\u0000potentially siloed) data sources must be assembled to derive the answer.\u0000Recently, various Text-to-SQL applications that leverage Large Language Models\u0000(LLMs) have addressed some of these problems by enabling users to ask questions\u0000in natural language. However, these applications remain impractical in\u0000realistic industrial settings because they fail to cope with the data source\u0000heterogeneity that typifies such environments. In this paper, we address\u0000heterogeneity by introducing the siwarex platform, which enables seamless\u0000natural language access to both databases and APIs. To demonstrate the\u0000effectiveness of siwarex, we extend the popular Spider dataset and benchmark by\u0000replacing some of its tables by data retrieval APIs. We find that siwarex does\u0000a good job of coping with data source heterogeneity. Our modified Spider\u0000benchmark will soon be available to the research community","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Van Ho Long, Nguyen Ho, Trinh Le Cong, Anh-Vu Dinh-Duc, Tu Nguyen Ngoc
{"title":"Efficient Rare Temporal Pattern Mining in Time Series","authors":"Van Ho Long, Nguyen Ho, Trinh Le Cong, Anh-Vu Dinh-Duc, Tu Nguyen Ngoc","doi":"arxiv-2409.05042","DOIUrl":"https://doi.org/arxiv-2409.05042","url":null,"abstract":"Time series data from various domains are increasing continuously. Extracting\u0000and analyzing the temporal patterns in these series can reveal significant\u0000insights. Temporal pattern mining (TPM) extends traditional pattern mining by\u0000incorporating event time intervals into extracted patterns, enhancing their\u0000expressiveness but increasing time and space complexities. One valuable type of\u0000temporal pattern is known as rare temporal patterns (RTPs), which occur rarely\u0000but with high confidence. There exist several challenges when mining rare\u0000temporal patterns. The support measure is set very low, leading to a further\u0000combinatorial explosion and potentially producing too many uninteresting\u0000patterns. Thus, an efficient approach to rare temporal pattern mining is\u0000needed. This paper introduces our Rare Temporal Pattern Mining from Time Series\u0000(RTPMfTS) method for discovering rare temporal patterns, featuring the\u0000following key contributions: (1) An end-to-end RTPMfTS process that takes time\u0000series data as input and yields rare temporal patterns as output. (2) An\u0000efficient Rare Temporal Pattern Mining (RTPM) algorithm that uses optimized\u0000data structures for quick event and pattern retrieval and utilizes effective\u0000pruning techniques for much faster mining. (3) A thorough experimental\u0000evaluation of RTPM, showing that RTPM outperforms the baseline in terms of\u0000runtime and memory usage.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere
{"title":"Graph versioning for evolving urban data","authors":"Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere","doi":"arxiv-2409.04498","DOIUrl":"https://doi.org/arxiv-2409.04498","url":null,"abstract":"The continuous evolution of cities poses significant challenges in terms of\u0000managing and understanding their complex dynamics. With the increasing demand\u0000for transparency and the growing availability of open urban data, it has become\u0000important to ensure the reproducibility of scientific research and computations\u0000in urban planning. To understand past decisions and other possible scenarios,\u0000we require solutions that go beyond the management of urban knowledge graphs.\u0000In this work, we explore existing solutions and their limits and explain the\u0000need and possible approaches for querying across multiple graph versions.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere
{"title":"ConVer-G: Concurrent versioning of knowledge graphs","authors":"Jey Puget Gil, Emmanuel Coquery, John Samuel, Gilles Gesquiere","doi":"arxiv-2409.04499","DOIUrl":"https://doi.org/arxiv-2409.04499","url":null,"abstract":"The multiplication of platforms offering open data has facilitated access to\u0000information that can be used for research, innovation, and decision-making.\u0000Providing transparency and availability, open data is regularly updated,\u0000allowing us to observe their evolution over time. We are particularly interested in the evolution of urban data that allows\u0000stakeholders to better understand dynamics and propose solutions to improve the\u0000quality of life of citizens. In this context, we are interested in the\u0000management of evolving data, especially urban data and the ability to query\u0000these data across the available versions. In order to have the ability to\u0000understand our urban heritage and propose new scenarios, we must be able to\u0000search for knowledge through concurrent versions of urban knowledge graphs. In this work, we present the ConVer-G (Concurrent Versioning of knowledge\u0000Graphs) system for storage and querying through multiple concurrent versions of\u0000graphs.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter
{"title":"AnyMatch -- Efficient Zero-Shot Entity Matching with a Small Language Model","authors":"Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter","doi":"arxiv-2409.04073","DOIUrl":"https://doi.org/arxiv-2409.04073","url":null,"abstract":"Entity matching (EM) is the problem of determining whether two records refer\u0000to same real-world entity, which is crucial in data integration, e.g., for\u0000product catalogs or address databases. A major drawback of many EM approaches\u0000is their dependence on labelled examples. We thus focus on the challenging\u0000setting of zero-shot entity matching where no labelled examples are available\u0000for an unseen target dataset. Recently, large language models (LLMs) have shown\u0000promising results for zero-shot EM, but their low throughput and high\u0000deployment cost limit their applicability and scalability. We revisit the zero-shot EM problem with AnyMatch, a small language model\u0000fine-tuned in a transfer learning setup. We propose several novel data\u0000selection techniques to generate fine-tuning data for our model, e.g., by\u0000selecting difficult pairs to match via an AutoML filter, by generating\u0000additional attribute-level examples, and by controlling label imbalance in the\u0000data. We conduct an extensive evaluation of the prediction quality and deployment\u0000cost of our model, in a comparison to thirteen baselines on nine benchmark\u0000datasets. We find that AnyMatch provides competitive prediction quality despite\u0000its small parameter size: it achieves the second-highest F1 score overall, and\u0000outperforms several other approaches that employ models with hundreds of\u0000billions of parameters. Furthermore, our approach exhibits major cost benefits:\u0000the average prediction quality of AnyMatch is within 4.4% of the\u0000state-of-the-art method MatchGPT with the proprietary trillion-parameter model\u0000GPT-4, yet AnyMatch requires four orders of magnitude less parameters and\u0000incurs a 3,899 times lower inference cost (in dollars per 1,000 tokens).","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models","authors":"Malte Luttermann, Ralf Möller, Mattis Hartwig","doi":"arxiv-2409.04194","DOIUrl":"https://doi.org/arxiv-2409.04194","url":null,"abstract":"Probabilistic relational models provide a well-established formalism to\u0000combine first-order logic and probabilistic models, thereby allowing to\u0000represent relationships between objects in a relational domain. At the same\u0000time, the field of artificial intelligence requires increasingly large amounts\u0000of relational training data for various machine learning tasks. Collecting\u0000real-world data, however, is often challenging due to privacy concerns, data\u0000protection regulations, high costs, and so on. To mitigate these challenges,\u0000the generation of synthetic data is a promising approach. In this paper, we\u0000solve the problem of generating synthetic relational data via probabilistic\u0000relational models. In particular, we propose a fully-fledged pipeline to go\u0000from relational database to probabilistic relational model, which can then be\u0000used to sample new synthetic relational data points from its underlying\u0000probability distribution. As part of our proposed pipeline, we introduce a\u0000learning algorithm to construct a probabilistic relational model from a given\u0000relational database.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li
{"title":"Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation","authors":"Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li","doi":"arxiv-2409.04475","DOIUrl":"https://doi.org/arxiv-2409.04475","url":null,"abstract":"The development of Large Language Models (LLMs) has revolutionized Q&A across\u0000various industries, including the database domain. However, there is still a\u0000lack of a comprehensive benchmark to evaluate the capabilities of different\u0000LLMs and their modular components in database Q&A. To this end, we introduce\u0000DQA, the first comprehensive database Q&A benchmark. DQA features an innovative\u0000LLM-based method for automating the generation, cleaning, and rewriting of\u0000database Q&A, resulting in over 240,000 Q&A pairs in English and Chinese. These\u0000Q&A pairs cover nearly all aspects of database knowledge, including database\u0000manuals, database blogs, and database tools. This inclusion allows for\u0000additional assessment of LLMs' Retrieval-Augmented Generation (RAG) and Tool\u0000Invocation Generation (TIG) capabilities in the database Q&A task. Furthermore,\u0000we propose a comprehensive LLM-based database Q&A testbed on DQA. This testbed\u0000is highly modular and scalable, with both basic and advanced components like\u0000Question Classification Routing (QCR), RAG, TIG, and Prompt Template\u0000Engineering (PTE). Besides, DQA provides a complete evaluation pipeline,\u0000featuring diverse metrics and a standardized evaluation process to ensure\u0000comprehensiveness, accuracy, and fairness. We use DQA to evaluate the database\u0000Q&A capabilities under the proposed testbed comprehensively. The evaluation\u0000reveals findings like (i) the strengths and limitations of nine different\u0000LLM-based Q&A bots and (ii) the performance impact and potential improvements\u0000of various service components (e.g., QCR, RAG, TIG). We hope our benchmark and\u0000findings will better guide the future development of LLM-based database Q&A\u0000research.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Baochao Chen, Liyuan Ma, Hao Xu, Juncheng Ma, Dengcheng Hu, Xiulong Liu, Jie Wu, Jianrong Wang, Keqiu Li
{"title":"A Comprehensive Survey of Blockchain Scalability: Shaping Inner-Chain and Inter-Chain Perspectives","authors":"Baochao Chen, Liyuan Ma, Hao Xu, Juncheng Ma, Dengcheng Hu, Xiulong Liu, Jie Wu, Jianrong Wang, Keqiu Li","doi":"arxiv-2409.02968","DOIUrl":"https://doi.org/arxiv-2409.02968","url":null,"abstract":"Blockchain is widely applied in logistics, finance, and agriculture. As\u0000single blockchain users grow, scalability becomes crucial. However, existing\u0000works lack a comprehensive summary of blockchain scalability. They focus on\u0000single chains or cross-chain technologies. This survey summarizes scalability\u0000across the physical and logical layers, as well as inner-chain, inter-chain,\u0000and technology dimensions. The physical layer covers data and protocols, while\u0000the logical layer represents blockchain architecture. Each component is\u0000analyzed from inner-chain and inter-chain perspectives, considering\u0000technological factors. The aim is to enhance researchers' understanding of\u0000blockchain's architecture, data, and protocols to advance scalability research.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Auditable and reusable crosswalks for fast, scaled integration of scattered tabular data","authors":"Gavin Chait","doi":"arxiv-2409.01517","DOIUrl":"https://doi.org/arxiv-2409.01517","url":null,"abstract":"This paper presents an open-source curatorial toolkit intended to produce\u0000well-structured and interoperable data. Curation is divided into discrete\u0000components, with a schema-centric focus for auditable restructuring of complex\u0000and scattered tabular data to conform to a destination schema. Task separation\u0000allows development of software and analysis without source data being present.\u0000Transformations are captured as high-level sequential scripts describing\u0000schema-to-schema mappings, reducing complexity and resource requirements.\u0000Ultimately, data are transformed, but the objective is that any data meeting a\u0000schema definition can be restructured using a crosswalk. The toolkit is\u0000available both as a Python package, and as a 'no-code' visual web application.\u0000A visual example is presented, derived from a longitudinal study where\u0000scattered source data from hundreds of local councils are integrated into a\u0000single database.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}