{"title":"Process Trace Querying using Knowledge Graphs and Notation3","authors":"William Van Woensel","doi":"arxiv-2409.04452","DOIUrl":"https://doi.org/arxiv-2409.04452","url":null,"abstract":"In process mining, a log exploration step allows making sense of the event\u0000traces; e.g., identifying event patterns and illogical traces, and gaining\u0000insight into their variability. To support expressive log exploration, the\u0000event log can be converted into a Knowledge Graph (KG), which can then be\u0000queried using general-purpose languages. We explore the creation of semantic KG\u0000using the Resource Description Framework (RDF) as a data model, combined with\u0000the general-purpose Notation3 (N3) rule language for querying. We show how\u0000typical trace querying constraints, inspired by the state of the art, can be\u0000implemented in N3. We convert case- and object-centric event logs into a\u0000trace-based semantic KG; OCEL2 logs are hereby \"flattened\" into traces based on\u0000object paths through the KG. This solution offers (a) expressivity, as queries\u0000can instantiate constraints in multiple ways and arbitrarily constrain\u0000attributes and relations (e.g., actors, resources); (b) flexibility, as OCEL2\u0000event logs can be serialized as traces in arbitrary ways based on the KG; and\u0000(c) extensibility, as others can extend our library by leveraging the same\u0000implementation patterns.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-variable Quantification of BDDs in External Memory using Nested Sweeping (Extended Paper)","authors":"Steffan Christ Sølvsten, Jaco van de Pol","doi":"arxiv-2408.14216","DOIUrl":"https://doi.org/arxiv-2408.14216","url":null,"abstract":"Previous research on the Adiar BDD package has been successful at designing\u0000algorithms capable of handling large Binary Decision Diagrams (BDDs) stored in\u0000external memory. To do so, it uses consecutive sweeps through the BDDs to\u0000resolve computations. Yet, this approach has kept algorithms for multi-variable\u0000quantification, the relational product, and variable reordering out of its\u0000scope. In this work, we address this by introducing the nested sweeping framework.\u0000Here, multiple concurrent sweeps pass information between eachother to compute\u0000the result. We have implemented the framework in Adiar and used it to create a\u0000new external memory multi-variable quantification algorithm. Compared to\u0000conventional depth-first implementations, Adiar with nested sweeping is able to\u0000solve more instances of our benchmarks and/or solve them faster.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"$boldsymbol{Steiner}$-Hardness: A Query Hardness Measure for Graph-Based ANN Indexes","authors":"Zeyu Wang, Qitong Wang, Xiaoxing Cheng, Peng Wang, Themis Palpanas, Wei Wang","doi":"arxiv-2408.13899","DOIUrl":"https://doi.org/arxiv-2408.13899","url":null,"abstract":"Graph-based indexes have been widely employed to accelerate approximate\u0000similarity search of high-dimensional vectors. However, the performance of\u0000graph indexes to answer different queries varies vastly, leading to an unstable\u0000quality of service for downstream applications. This necessitates an effective\u0000measure to test query hardness on graph indexes. Nonetheless, popular\u0000distance-based hardness measures like LID lose their effects due to the\u0000ignorance of the graph structure. In this paper, we propose $Steiner$-hardness,\u0000a novel connection-based graph-native query hardness measure. Specifically, we\u0000first propose a theoretical framework to analyze the minimum query effort on\u0000graph indexes and then define $Steiner$-hardness as the minimum effort on a\u0000representative graph. Moreover, we prove that our $Steiner$-hardness is highly\u0000relevant to the classical Directed $Steiner$ Tree (DST) problems. In this case,\u0000we design a novel algorithm to reduce our problem to DST problems and then\u0000leverage their solvers to help calculate $Steiner$-hardness efficiently.\u0000Compared with LID and other similar measures, $Steiner$-hardness shows a\u0000significantly better correlation with the actual query effort on various\u0000datasets. Additionally, an unbiased evaluation designed based on\u0000$Steiner$-hardness reveals new ranking results, indicating a meaningful\u0000direction for enhancing the robustness of graph indexes. This paper is accepted\u0000by PVLDB 2025.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a Converged Relational-Graph Optimization Framework","authors":"Yunkai Lou, Longbin Lai, Bingqing Lyu, Yufan Yang, Xiaoli Zhou, Wenyuan Yu, Ying Zhang, Jingren Zhou","doi":"arxiv-2408.13480","DOIUrl":"https://doi.org/arxiv-2408.13480","url":null,"abstract":"The recent ISO SQL:2023 standard adopts SQL/PGQ (Property Graph Queries),\u0000facilitating graph-like querying within relational databases. This advancement,\u0000however, underscores a significant gap in how to effectively optimize SQL/PGQ\u0000queries within relational database systems. To address this gap, we extend the\u0000foundational SPJ(Select-Project-Join) queries to SPJM queries, which include an\u0000additional matching operator for representing graph pattern matching in\u0000SQL/PGQ. Although SPJM queries can be converted to SPJ queries and optimized\u0000using existing relational query optimizers, our analysis shows that such a\u0000graph-agnostic method fails to benefit from graph-specific optimization\u0000techniques found in the literature. To address this issue, we develop a\u0000converged relational-graph optimization framework called RelGo for optimizing\u0000SPJM queries, leveraging joint efforts from both relational and graph query\u0000optimizations. Using DuckDB as the underlying relational execution engine, our\u0000experiments show that RelGo can generate efficient execution plans for SPJM\u0000queries. On well-established benchmarks, these plans exhibit an average speedup\u0000of 21.90$times$ compared to those produced by the graph-agnostic optimizer.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vasileios Nakos, Hung Q. Ngo, Charalampos E. Tsourakakis
{"title":"Targeted Least Cardinality Candidate Key for Relational Databases","authors":"Vasileios Nakos, Hung Q. Ngo, Charalampos E. Tsourakakis","doi":"arxiv-2408.13540","DOIUrl":"https://doi.org/arxiv-2408.13540","url":null,"abstract":"Functional dependencies (FDs) are a central theme in databases, playing a\u0000major role in the design of database schemas and the optimization of queries.\u0000In this work, we introduce the {it targeted least cardinality candidate key\u0000problem} (TCAND). This problem is defined over a set of functional dependencies\u0000$F$ and a target variable set $T subseteq V$, and it aims to find the smallest\u0000set $X subseteq V$ such that the FD $X to T$ can be derived from $F$. The\u0000TCAND problem generalizes the well-known NP-hard problem of finding the least\u0000cardinality candidate key~cite{lucchesi1978candidate}, which has been\u0000previously demonstrated to be at least as difficult as the set cover problem. We present an integer programming (IP) formulation for the TCAND problem,\u0000analogous to a layered set cover problem. We analyze its linear programming\u0000(LP) relaxation from two perspectives: we propose two approximation algorithms\u0000and investigate the integrality gap. Our findings indicate that the\u0000approximation upper bounds for our algorithms are not significantly improvable\u0000through LP rounding, a notable distinction from the standard set cover problem.\u0000Additionally, we discover that a generalization of the TCAND problem is\u0000equivalent to a variant of the set cover problem, named red-blue set\u0000cover~cite{carr1999red}, which cannot be approximated within a sub-polynomial\u0000factor in polynomial time under plausible\u0000conjectures~cite{chlamtavc2023approximating}. Despite the extensive history\u0000surrounding the issue of identifying the least cardinality candidate key, our\u0000research contributes new theoretical insights, novel algorithms, and\u0000demonstrates that the general TCAND problem poses complexities beyond those\u0000encountered in the set cover problem.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GNN: Graph Neural Network and Large Language Model Based for Data Discovery","authors":"Thomas Hoang","doi":"arxiv-2408.13609","DOIUrl":"https://doi.org/arxiv-2408.13609","url":null,"abstract":"Our algorithm GNN: Graph Neural Network and Large Language Model Based for\u0000Data Discovery inherits the benefits of cite{hoang2024plod} (PLOD: Predictive\u0000Learning Optimal Data Discovery), cite{Hoang2024BODBO} (BOD: Blindly Optimal\u0000Data Discovery) in terms of overcoming the challenges of having to predefine\u0000utility function and the human input for attribute ranking, which helps prevent\u0000the time-consuming loop process. In addition to these previous works, our\u0000algorithm GNN leverages the advantages of graph neural networks and large\u0000language models to understand text type values that cannot be understood by\u0000PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN\u0000could be seen as an extension of PLOD in terms of understanding the text type\u0000value and the user's preferences based on not only numerical values but also\u0000text values, making the promise of data science and analytics purposes.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost-Aware Uncertainty Reduction in Schema Matching with GPT-4: The Prompt-Matcher Framework","authors":"Longyu Feng, Huahang Li, Chen Jason Zhang","doi":"arxiv-2408.14507","DOIUrl":"https://doi.org/arxiv-2408.14507","url":null,"abstract":"Schema matching is the process of identifying correspondences between the\u0000elements of two given schemata, essential for database management systems, data\u0000integration, and data warehousing. The inherent uncertainty of current schema\u0000matching algorithms leads to the generation of a set of candidate matches.\u0000Storing these results necessitates the use of databases and systems capable of\u0000handling probabilistic queries. This complicates the querying process and\u0000increases the associated storage costs. Motivated by GPT-4 outstanding\u0000performance, we explore its potential to reduce uncertainty. Our proposal is to\u0000supplant the role of crowdworkers with GPT-4 for querying the set of candidate\u0000matches. To get more precise correspondence verification responses from GPT-4,\u0000We have crafted Semantic-match and Abbreviation-match prompt for GPT-4,\u0000achieving state-of-the-art results on two benchmark datasets DeepMDatasets 100%\u0000(+0.0) and Fabricated-Datasets 91.8% (+2.2) recall rate. To optimise budget\u0000utilisation, we have devised a cost-aware solution. Within the constraints of\u0000the budget, our solution delivers favourable outcomes with minimal time\u0000expenditure. We introduce a novel framework, Prompt-Matcher, to reduce the uncertainty in\u0000the process of integration of multiple automatic schema matching algorithms and\u0000the selection of complex parameterization. It assists users in diminishing the\u0000uncertainty associated with candidate schema match results and in optimally\u0000ranking the most promising matches. We formally define the Correspondence\u0000Selection Problem, aiming to optimise the revenue within the confines of the\u0000GPT-4 budget. We demonstrate that CSP is NP-Hard and propose an approximation\u0000algorithm with minimal time expenditure. Ultimately, we demonstrate the\u0000efficacy of Prompt-Matcher through rigorous experiments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, Lars Hoemke
{"title":"BIPeC: A Combined Change-Point Analyzer to Identify Performance Regressions in Large-scale Database Systems","authors":"Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, Lars Hoemke","doi":"arxiv-2408.12414","DOIUrl":"https://doi.org/arxiv-2408.12414","url":null,"abstract":"Performance testing in large-scale database systems like SAP HANA is a\u0000crucial yet labor-intensive task, involving extensive manual analysis of\u0000thousands of measurements, such as CPU time and elapsed time. Manual\u0000maintenance of these metrics is time-consuming and susceptible to human error,\u0000making early detection of performance regressions challenging. We address these\u0000issues by proposing an automated approach to detect performance regressions in\u0000such measurements. Our approach integrates Bayesian inference with the Pruned\u0000Exact Linear Time (PELT) algorithm, enhancing the detection of change points\u0000and performance regressions with high precision and efficiency compared to\u0000previous approaches. Our method minimizes false negatives and ensures SAP\u0000HANA's system's reliability and performance quality. The proposed solution can\u0000accelerate testing and contribute to more sustainable performance management\u0000practices in large-scale data management environments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik
{"title":"SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging","authors":"Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik","doi":"arxiv-2408.12733","DOIUrl":"https://doi.org/arxiv-2408.12733","url":null,"abstract":"Text-to-SQL systems, which convert natural language queries into SQL\u0000commands, have seen significant progress primarily for the SQLite dialect.\u0000However, adapting these systems to other SQL dialects like BigQuery and\u0000PostgreSQL remains a challenge due to the diversity in SQL syntax and\u0000functions. We introduce SQL-GEN, a framework for generating high-quality\u0000dialect-specific synthetic data guided by dialect-specific tutorials, and\u0000demonstrate its effectiveness in creating training datasets for multiple\u0000dialects. Our approach significantly improves performance, by up to 20%, over\u0000previous methods and reduces the gap with large-scale human-annotated datasets.\u0000Moreover, combining our synthetic data with human-annotated data provides\u0000additional performance boosts of 3.3% to 5.6%. We also introduce a novel\u0000Mixture of Experts (MoE) initialization method that integrates dialect-specific\u0000models into a unified system by merging self-attention layers and initializing\u0000the gates with dialect-specific keywords, further enhancing performance across\u0000different SQL dialects.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finn Klessascheck, Stephan A. Fahrenkrog-Petersen, Jan Mendling, Luise Pufahl
{"title":"Unlocking Sustainability Compliance: Characterizing the EU Taxonomy for Business Process Management","authors":"Finn Klessascheck, Stephan A. Fahrenkrog-Petersen, Jan Mendling, Luise Pufahl","doi":"arxiv-2408.11386","DOIUrl":"https://doi.org/arxiv-2408.11386","url":null,"abstract":"To promote sustainable business practices, and to achieve climate neutrality\u0000by 2050, the EU has developed the taxonomy of sustainable activities, which\u0000describes when exactly business practices can be considered sustainable. While\u0000the taxonomy has only been recently established, progressively more companies\u0000will have to report how much of their revenue was created via sustainably\u0000executed business processes. To help companies prepare to assess whether their\u0000business processes comply with the constraints outlined in the taxonomy, we\u0000investigate in how far these criteria can be used for conformance checking,\u0000that is, assessing in a data-driven manner, whether business process executions\u0000adhere to regulatory constraints. For this, we develop a few-shot learning\u0000pipeline to characterize the constraints of the taxonomy with the help of an\u0000LLM as to the process dimensions they relate to. We find that many constraints\u0000of the taxonomy are useable for conformance checking, particularly in the\u0000sectors of energy, manufacturing, and transport. This will aid companies in\u0000preparing to monitor regulatory compliance with the taxonomy automatically, by\u0000characterizing what kind of information they need to extract, and by providing\u0000a better understanding of sectors where such an assessment is feasible and\u0000where it is not.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}