Paolo Atzeni , Teodoro Baldazzi , Luigi Bellomarini , Eleonora Laurenza , Emanuel Sallinger
{"title":"Semantic-aware query answering with Large Language Models","authors":"Paolo Atzeni , Teodoro Baldazzi , Luigi Bellomarini , Eleonora Laurenza , Emanuel Sallinger","doi":"10.1016/j.datak.2025.102494","DOIUrl":"10.1016/j.datak.2025.102494","url":null,"abstract":"<div><div>In the modern data-driven world, answering queries over heterogeneous and semantically inconsistent data remains a significant challenge. Modern datasets originate from diverse sources, such as relational databases, semi-structured repositories, and unstructured documents, leading to substantial variability in schemas, terminologies, and data formats. Traditional systems, constrained by rigid syntactic matching and strict data binding, struggle to capture critical semantic connections and schema ambiguities, failing to meet the growing demand among data scientists for advanced forms of flexibility and context-awareness in query answering. In parallel, the advent of Large Language Models (LLMs) has introduced new capabilities in natural language interpretation, making them highly promising for addressing such challenges. However, LLMs alone lack the systematic rigor and explainability required for robust query processing and decision-making in high-stakes domains. In this paper, we propose Soft Query Answering (Soft QA), a novel hybrid approach that integrates LLMs as an intermediate semantic layer within the query processing pipeline. Soft QA enhances query answering adaptability and flexibility by injecting semantic understanding through context-aware, schema-informed prompts, and leverages LLMs to semantically link entities, resolve ambiguities, and deliver accurate query results in complex settings. We demonstrate its practical effectiveness through real-world examples, highlighting its ability to resolve semantic mismatches and improve query outcomes without requiring extensive data cleaning or restructuring.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102494"},"PeriodicalIF":2.7,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144830326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Granularity History Graph Network for temporal knowledge graph reasoning","authors":"Jun Zhu , Yan Fu , Junlin Zhou , Duanbing Chen","doi":"10.1016/j.datak.2025.102496","DOIUrl":"10.1016/j.datak.2025.102496","url":null,"abstract":"<div><div>Reasoning on knowledge graphs (KGs) can be categorized into two main categories: predicting missing facts and predicting unknown facts in the future. However, when it comes to future prediction, it becomes crucial to incorporate temporal information and add timestamps to KGs, thereby forming temporal knowledge graphs (TKGs). The key aspect of reasoning lies in treating a TKG as a sequence of static KGs in order to effectively grasp temporal information. Additionally, it is equally important to consider the evolution of facts from various perspectives. Existing models tend to replicate the original time granularity of data while modeling TKGs, often disregarding the impact of the minimum time period in the evolution process. Furthermore, historical information is typically perceived as a single sequence of facts, with a lack of diversity in strategies (e.g., modeling sequences with varying granularities or lengths) to capture complex temporal dynamics. This unified approach may lead to the loss of critical information during the modeling process. However, the process of historical evolution often exhibits complex periodic transformation characteristics, and associated events do not necessarily follow a fixed time period. Therefore, a single granularity is insufficient to model periodic events with dynamic changes in history. Consequently, we propose the Multi-Granularity History Graph Network (MGHGN), an innovative model for TKG reasoning. MGHGN dynamically models various event evolution periods by constructing representations with multiple time granularities, and integrates various modeling methods to reason the potential facts in the future. Our model adeptly captures valuable insights from the history of multi-granularity and employs diverse approaches to model historical information. The experimental results on six benchmark datasets demonstrate that the MGHGN outperforms state-of-the-art methods.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102496"},"PeriodicalIF":2.7,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian H. Goldmann, Marcos R. Machado, Joerg R. Osterrieder
{"title":"Advancing credit risk assessment in the retail banking industry: A hybrid approach using time series and supervised learning models","authors":"Sebastian H. Goldmann, Marcos R. Machado, Joerg R. Osterrieder","doi":"10.1016/j.datak.2025.102490","DOIUrl":"10.1016/j.datak.2025.102490","url":null,"abstract":"<div><div>Credit risk assessment remains a central challenge in retail banking, with conventional models often falling short in predictive accuracy and adaptability to granular customer behavior. This study explores the potential of Time Series Classification (TSC) algorithms to enhance credit risk modeling by analyzing customers’ historical end-of-day balance data. We compare traditional Machine Learning (ML) models – including Logistic Regression and XGBoost – with advanced TSC methods such as Shapelets, Long Short-Term Memory (LSTM) networks, and Canonical Interval Forests (CIF). Our results show that TSC algorithms, particularly CIF and Shapelet-based methods, significantly outperform traditional approaches. When using CIF-derived Probability of Default (PD) estimates as additional features in an XGBoost model, predictive performance improved notably: the combined model achieved an Area under the Curve (AUC) of 0.81, compared to 0.79 for CIF alone and 0.77 for XGBoost without the CIF input. These findings underscore the value of integrating temporal features into credit risk assessment frameworks. Moreover, the complementary strengths of the TSC and XGBoost models across different Receiver Operating Characteristic (ROC) curve regions demonstrate the practical benefits of model stacking. However, performance dropped when using aggregated monthly data, highlighting the importance of preserving high-frequency behavioral signals. This research contributes to more accurate, interpretable, and robust credit risk models and offers a pathway for banks to leverage time series data for improved risk forecasting.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102490"},"PeriodicalIF":2.7,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144711634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TEDA-driven adaptive stream clustering for concept drift detection","authors":"Zahra Rezaei , Hedieh Sajedi","doi":"10.1016/j.datak.2025.102484","DOIUrl":"10.1016/j.datak.2025.102484","url":null,"abstract":"<div><div>The rapid growth of data-driven applications has underlined the need for strong methods to analyze and cluster streaming data. Data stream clustering is envisioned to uncover interesting knowledge concealed within data streams, typically fast, structure- and pattern-evolving. However, most current methods suffer significant challenges like the inability to detect clusters with arbitrarily shaped, handling outliers, adaptation to concept drift, and reducing dependency on predefined parameters. To tackle these challenges, we propose a novel Typicality and Eccentricity Data Analysis (TEDA)-based concept drift detection stream clustering algorithm, which can divide the clustering problem into two subproblems, micro-clusters and macro-clusters. Our methodology utilizes a TEDA-based concept drift detection approach to enhance data stream clustering. Our method employs two models in monitoring the data stream to keep the information of a previous concept while tracking the emergence of a new concept. The models represent two distinct concepts when the intersection of data samples is significantly low, as described by the Jaccard Index. TEDA-CDD is compared to known methods from the literature in experiments using synthetic and real-world datasets simulating real-world applications. By dynamically updating clusters through model reuse or creation, our algorithm ensures adaptability to real-time changes in data distributions. The proposed algorithm was comprehensively evaluated using the KDDCup-99 dataset, an intrusion detection system benchmark under diverse scenarios, including concept drifts, evolving data distributions, varying cluster sizes, and outlier conditions. Empirical results demonstrated the algorithm’s superiority over baseline approaches such as DenStream, DStream, ClusTree, and DGStream, achieving perfect performance metrics. These findings emphasize the effectiveness of our algorithm in addressing real-world streaming data challenges, combining high sensitivity to concept drift with computational efficiency, adaptability, and robust clustering capabilities.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102484"},"PeriodicalIF":2.7,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144712898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporting Sound Multi-Level Modeling—Specification and Implementation of a Multi-Dimensional Modeling Approach","authors":"Thomas Kühne , Manfred A. Jeusfeld","doi":"10.1016/j.datak.2025.102481","DOIUrl":"10.1016/j.datak.2025.102481","url":null,"abstract":"<div><div>Multiple levels of classification naturally occur in many domains. Several multi-level modeling approaches account for this, and a subset of them attempt to provide their users with sanity-checking mechanisms in order to guard them against conceptually ill-formed models. Historically, the respective multi-level well-formedness schemes have either been overly restrictive or too lax. Orthogonal Ontological Classification has been proposed as a foundation for sound multi-level modeling that combines the selectivity of strict schemes with the flexibility afforded by laxer schemes. In this article, we present the second iteration of a formalization of Orthogonal Ontological Classification, which we empirically validated to demonstrate some of its hitherto only postulated claims using an implementation in <span>ConceptBase</span>. We discuss the expressiveness of the formal language used, <span>ConceptBase</span>’s evaluation efficiency, and the usability of our realization based on a digital twin example model.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102481"},"PeriodicalIF":2.7,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144780865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inference-based schema discovery for RDF data","authors":"Redouane Bouhamoum , Zoubida Kedad , Stéphane Lopes","doi":"10.1016/j.datak.2025.102491","DOIUrl":"10.1016/j.datak.2025.102491","url":null,"abstract":"<div><div>The Semantic Web represents a huge information space where an increasing number of datasets, described in RDF, are made available to users and applications. In this context, the data is not constrained by a predefined schema. In RDF datasets, the schema may be incomplete or even missing. While this offers high flexibility in creating data sources, it also makes their use difficult. Several works have addressed the problem of automatic schema discovery for RDF datasets, but existing approaches rely only on the explicit information provided by the data source, which may limit the quality of the results. Indeed, in an RDF data source, an entity is described by explicitly declared properties, but also by implicit properties that can be derived using reasoning rules. These implicit properties are not considered by existing schema discovery approaches.</div><div>In this work, we propose a first contribution towards a hybrid schema discovery approach capable of exploiting all the semantics of a data source, which is represented not only by the explicitly declared triples, but also by the ones that can be inferred through reasoning. By considering both explicit and implicit properties, the quality of the generated schema is improved. We provide a scalable design of our approach to enable the processing of large RDF data sources while improving the quality of the results. We present some experiments which demonstrate the efficiency of our proposal and the quality of the discovered schema.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102491"},"PeriodicalIF":2.7,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144670538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jacky Akoka , Isabelle Comyn-Wattiau , Nicolas Prat , Veda C. Storey
{"title":"Data and knowledge engineering: Insights from forty years of publication","authors":"Jacky Akoka , Isabelle Comyn-Wattiau , Nicolas Prat , Veda C. Storey","doi":"10.1016/j.datak.2025.102492","DOIUrl":"10.1016/j.datak.2025.102492","url":null,"abstract":"<div><div>The journal, <em>Data and Knowledge Engineering (DKE),</em> first published by Elsevier in 1985, has now been in existence for forty years. This journal has evolved and matured to play an important role in establishing and progressing research on conceptual modeling and related areas. To accurately characterize the history and current state of the research contributions and their impact, we analyze its publications in three phases, by employing bibliometric techniques of co-citation, bibliographic coupling, main path analysis, and topic modeling. Using descriptive bibliometrics, the results from the first phase provide an overview of the articles that have been published in the journal. It analyzes the dynamics and trend patterns of publications, specifically, their main topics and contributions. Using bibliometric mapping, the second phase identifies the journal's intellectual structure, its primary research themes, and the pathways through which knowledge is disseminated between the most influential articles. The third phase entails a comparison of DKE with other scientific journals that share at least some of its scope. In addition to delineating the strengths of DKE, we provide insights into how DKE might continue to evolve and progress the contributions to the field.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102492"},"PeriodicalIF":2.7,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144780864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonghyeon Ko , Marco Comuzzi , Fabrizio Maria Maggi
{"title":"Detecting and repairing anomaly patterns in business process event logs","authors":"Jonghyeon Ko , Marco Comuzzi , Fabrizio Maria Maggi","doi":"10.1016/j.datak.2025.102488","DOIUrl":"10.1016/j.datak.2025.102488","url":null,"abstract":"<div><div>Event log anomaly detection and log repairing concern the identification of anomalous traces in an event log and the reconstruction of a correct trace for the anomalous ones, respectively. Trace-level anomalies in event logs often appear according to specific patterns, such events inserted, repeated, or skipped. This paper proposes P-BEAR (Pattern-Based Event Log Anomaly Reconstruction), a semi-supervised pattern-based anomaly detection and log repairing approach that exploits the pattern-based nature of trace-level anomalies in event logs. P-BEAR captures, in a set of ad-hoc graphs, the behaviour of clean traces in a log and uses these to identify anomalous traces, determine the specific anomaly pattern that applies to them, and then reconstruct the correct trace. The proposed approach is evaluated using artificial and real event logs against traditional trace alignment in conformance checking, the edit distance-based alignment method, and an unsupervised method based on deep learning. Overall, the proposed method outperforms the alignment method in anomalous trace reconstruction while providing interpretability with respect to anomaly pattern classification. P-BEAR is also quicker to execute, and its performance is more balanced across different types of anomaly patterns.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102488"},"PeriodicalIF":2.7,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144656577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}