{"title":"EHL*: Memory-Budgeted Indexing for Ultrafast Optimal Euclidean Pathfinding","authors":"Jinchun Du, Bojie Shen, Muhammad Aamir Cheema","doi":"arxiv-2408.11341","DOIUrl":"https://doi.org/arxiv-2408.11341","url":null,"abstract":"The Euclidean Shortest Path Problem (ESPP), which involves finding the\u0000shortest path in a Euclidean plane with polygonal obstacles, is a classic\u0000problem with numerous real-world applications. The current state-of-the-art\u0000solution, Euclidean Hub Labeling (EHL), offers ultra-fast query performance,\u0000outperforming existing techniques by 1-2 orders of magnitude in runtime\u0000efficiency. However, this performance comes at the cost of significant memory\u0000overhead, requiring up to tens of gigabytes of storage on large maps, which can\u0000limit its applicability in memory-constrained environments like mobile phones\u0000or smaller devices. Additionally, EHL's memory usage can only be determined\u0000after index construction, and while it provides a memory-runtime tradeoff, it\u0000does not fully optimize memory utilization. In this work, we introduce an\u0000improved version of EHL, called EHL*, which overcomes these limitations. A key\u0000contribution of EHL* is its ability to create an index that adheres to a\u0000specified memory budget while optimizing query runtime performance. Moreover,\u0000EHL* can leverage preknown query distributions, a common scenario in many\u0000real-world applications to further enhance runtime efficiency. Our results show\u0000that EHL* can reduce memory usage by up to 10-20 times without much impact on\u0000query runtime performance compared to EHL, making it a highly effective\u0000solution for optimal pathfinding in memory-constrained environments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Privacy-Preserving Data Management using Blockchains","authors":"Michael Mireku Kwakye","doi":"arxiv-2408.11263","DOIUrl":"https://doi.org/arxiv-2408.11263","url":null,"abstract":"Privacy-preservation policies are guidelines formulated to protect data\u0000providers private data. Previous privacy-preservation methodologies have\u0000addressed privacy in which data are permanently stored in repositories and\u0000disconnected from changing data provider privacy preferences. This occurrence\u0000becomes evident as data moves to another data repository. Hence, the need for\u0000data providers to control and flexibly update their existing privacy\u0000preferences due to changing data usage continues to remain a problem. This\u0000paper proposes a blockchain-based methodology for preserving data providers\u0000private and sensitive data. The research proposes to tightly couple data\u0000providers private attribute data element to privacy preferences and data\u0000accessor data element into a privacy tuple. The implementation presents a\u0000framework of tightly-coupled relational database and blockchains. This delivers\u0000secure, tamper-resistant, and query-efficient platform for data management and\u0000query processing. The evaluation analysis from the implementation validates\u0000efficient query processing of privacy-aware queries on the privacy\u0000infrastructure.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper
{"title":"The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery","authors":"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper","doi":"arxiv-2408.09506","DOIUrl":"https://doi.org/arxiv-2408.09506","url":null,"abstract":"Line charts are a valuable tool for data analysis and exploration, distilling\u0000essential insights from a dataset. However, access to the underlying dataset\u0000behind a line chart is rarely readily available. In this paper, we explore a\u0000novel dataset discovery problem, dataset discovery via line charts, focusing on\u0000the use of line charts as queries to discover datasets within a large data\u0000repository that are capable of generating similar line charts. To solve this\u0000problem, we propose a novel approach called Fine-grained Cross-modal Relevance\u0000Learning Model (FCM), which aims to estimate the relevance between a line chart\u0000and a candidate dataset. To achieve this goal, FCM first employs a visual\u0000element extractor to extract informative visual elements, i.e., lines and\u0000y-ticks, from a line chart. Then, two novel segment-level encoders are adopted\u0000to learn representations for a line chart and a dataset, preserving\u0000fine-grained information, followed by a cross-modal matcher to match the\u0000learned representations in a fine-grained way. Furthermore, we extend FCM to\u0000support line chart queries generated based on data aggregation. Last, we\u0000propose a benchmark tailored for this problem since no such dataset exists.\u0000Extensive evaluation on the new benchmark verifies the effectiveness of our\u0000proposed method. Specifically, our proposed approach surpasses the best\u0000baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The temporal conceptual data modelling language TREND","authors":"Sonia Berman, C. Maria Keet, Tamindran Shunmugam","doi":"arxiv-2408.09427","DOIUrl":"https://doi.org/arxiv-2408.09427","url":null,"abstract":"Temporal conceptual data modelling, as an extension to regular conceptual\u0000data modelling languages such as EER and UML class diagrams, has received\u0000intermittent attention across the decades. It is receiving renewed interest in\u0000the context of, among others, business process modelling that needs robust\u0000expressive data models to complement them. None of the proposed temporal\u0000conceptual data modelling languages have been tested on understandability and\u0000usability by modellers, however, nor is it clear which temporal constraints\u0000would be used by modellers or whether the ones included are the relevant\u0000temporal constraints. We therefore sought to investigate temporal\u0000representations in temporal conceptual data modelling languages, design a, to\u0000date, most expressive language, TREND, through small-scale qualitative\u0000experiments, and finalise the graphical notation and modelling and\u0000understanding in large scale experiments. This involved a series of 11\u0000experiments with over a thousand participants in total, having created 246\u0000temporal conceptual data models. Key outcomes are that choice of label for\u0000transition constraints had limited impact, as did extending explanations of the\u0000modelling language, but expressing what needs to be modelled in controlled\u0000natural language did improve model quality. The experiments also indicate that\u0000more training may be needed, in particular guidance for domain experts, to\u0000achieve adoption of temporal conceptual data modelling by the community.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NFDI4DSO: Towards a BFO Compliant Ontology for Data Science","authors":"Genet Asefa Gesese, Jörg Waitelonis, Zongxiong Chen, Sonja Schimmler, Harald Sack","doi":"arxiv-2408.08698","DOIUrl":"https://doi.org/arxiv-2408.08698","url":null,"abstract":"The NFDI4DataScience (NFDI4DS) project aims to enhance the accessibility and\u0000interoperability of research data within Data Science (DS) and Artificial\u0000Intelligence (AI) by connecting digital artifacts and ensuring they adhere to\u0000FAIR (Findable, Accessible, Interoperable, and Reusable) principles. To this\u0000end, this poster introduces the NFDI4DS Ontology, which describes resources in\u0000DS and AI and models the structure of the NFDI4DS consortium. Built upon the\u0000NFDICore ontology and mapped to the Basic Formal Ontology (BFO), this ontology\u0000serves as the foundation for the NFDI4DS knowledge graph currently under\u0000development.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The (Elementary) Mathematical Data Model Revisited","authors":"Christian Mancas","doi":"arxiv-2408.08367","DOIUrl":"https://doi.org/arxiv-2408.08367","url":null,"abstract":"This paper presents the current version of our (Elementary) Mathematical Data\u0000Model ((E)MDM), which is based on the na\"ive theory of sets, relations, and\u0000functions, as well as on the first-order predicate calculus with equality. Many\u0000real-life examples illustrate its 4 types of sets, 4 types of functions, and 76\u0000types of constraints. This rich panoply of constraints is the main strength of\u0000this model, guaranteeing that any data value stored in a database is plausible,\u0000which is the highest possible level of syntactical data quality. A (E)MDM\u0000example scheme is presented and contrasted with some popular family tree\u0000software products. The paper also presents the main (E)MDM related approaches\u0000in data modeling and processing.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, Raymond Chi-Wing Wong
{"title":"DataVisT5: A Pre-trained Language Model for Jointly Understanding Text and Data Visualization","authors":"Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, Raymond Chi-Wing Wong","doi":"arxiv-2408.07401","DOIUrl":"https://doi.org/arxiv-2408.07401","url":null,"abstract":"Data visualization (DV) is the fundamental and premise tool to improve the\u0000efficiency in conveying the insights behind the big data, which has been widely\u0000accepted in existing data-driven world. Task automation in DV, such as\u0000converting natural language queries to visualizations (i.e., text-to-vis),\u0000generating explanations from visualizations (i.e., vis-to-text), answering\u0000DV-related questions in free form (i.e. FeVisQA), and explicating tabular data\u0000(i.e., table-to-text), is vital for advancing the field. Despite their\u0000potential, the application of pre-trained language models (PLMs) like T5 and\u0000BERT in DV has been limited by high costs and challenges in handling\u0000cross-modal information, leading to few studies on PLMs for DV. We introduce\u0000textbf{DataVisT5}, a novel PLM tailored for DV that enhances the T5\u0000architecture through a hybrid objective pre-training and multi-task fine-tuning\u0000strategy, integrating text and DV datasets to effectively interpret cross-modal\u0000semantics. Extensive evaluations on public datasets show that DataVisT5\u0000consistently outperforms current state-of-the-art models on various DV-related\u0000tasks. We anticipate that DataVisT5 will not only inspire further research on\u0000vertical PLMs but also expand the range of applications for PLMs.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"440 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Luca Scheerer, Anton Lykov, Moe Kayali, Ilias Fountalis, Dan Olteanu, Nikolaos Vasiloglou, Dan Suciu
{"title":"QirK: Question Answering via Intermediate Representation on Knowledge Graphs","authors":"Jan Luca Scheerer, Anton Lykov, Moe Kayali, Ilias Fountalis, Dan Olteanu, Nikolaos Vasiloglou, Dan Suciu","doi":"arxiv-2408.07494","DOIUrl":"https://doi.org/arxiv-2408.07494","url":null,"abstract":"We demonstrate QirK, a system for answering natural language questions on\u0000Knowledge Graphs (KG). QirK can answer structurally complex questions that are\u0000still beyond the reach of emerging Large Language Models (LLMs). It does so\u0000using a unique combination of database technology, LLMs, and semantic search\u0000over vector embeddings. The glue for these components is an intermediate\u0000representation (IR). The input question is mapped to IR using LLMs, which is\u0000then repaired into a valid relational database query with the aid of a semantic\u0000search on vector embeddings. This allows a practical synthesis of LLM\u0000capabilities and KG reliability. A short video demonstrating QirK is available at\u0000https://youtu.be/6c81BLmOZ0U.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Re-Thinking Process Mining in the AI-Based Agents Era","authors":"Alessandro Berti, Mayssa Maatallah, Urszula Jessen, Michal Sroka, Sonia Ayachi Ghannouchi","doi":"arxiv-2408.07720","DOIUrl":"https://doi.org/arxiv-2408.07720","url":null,"abstract":"Large Language Models (LLMs) have emerged as powerful conversational\u0000interfaces, and their application in process mining (PM) tasks has shown\u0000promising results. However, state-of-the-art LLMs struggle with complex\u0000scenarios that demand advanced reasoning capabilities. In the literature, two\u0000primary approaches have been proposed for implementing PM using LLMs: providing\u0000textual insights based on a textual abstraction of the process mining artifact,\u0000and generating code executable on the original artifact. This paper proposes\u0000utilizing the AI-Based Agents Workflow (AgWf) paradigm to enhance the\u0000effectiveness of PM on LLMs. This approach allows for: i) the decomposition of\u0000complex tasks into simpler workflows, and ii) the integration of deterministic\u0000tools with the domain knowledge of LLMs. We examine various implementations of\u0000AgWf and the types of AI-based tasks involved. Additionally, we discuss the\u0000CrewAI implementation framework and present examples related to process mining.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ASPEN: ASP-Based System for Collective Entity Resolution","authors":"Zhiliang Xiang, Meghyn Bienvenu, Gianluca Cima, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García","doi":"arxiv-2408.06961","DOIUrl":"https://doi.org/arxiv-2408.06961","url":null,"abstract":"In this paper, we present ASPEN, an answer set programming (ASP)\u0000implementation of a recently proposed declarative framework for collective\u0000entity resolution (ER). While an ASP encoding had been previously suggested,\u0000several practical issues had been neglected, most notably, the question of how\u0000to efficiently compute the (externally defined) similarity facts that are used\u0000in rule bodies. This leads us to propose new variants of the encodings\u0000(including Datalog approximations) and show how to employ different\u0000functionalities of ASP solvers to compute (maximal) solutions, and\u0000(approximations of) the sets of possible and certain merges. A comprehensive\u0000experimental evaluation of ASPEN on real-world datasets shows that the approach\u0000is promising, achieving high accuracy in real-life ER scenarios. Our\u0000experiments also yield useful insights into the relative merits of different\u0000types of (approximate) ER solutions, the impact of recursion, and factors\u0000influencing performance.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}