{"title":"RankT: Ranking-Triplets-based adversarial learning for knowledge graph link prediction","authors":"Jinlei Zhu, Xin Zhang, Xin Ding","doi":"10.1016/j.datak.2025.102463","DOIUrl":"10.1016/j.datak.2025.102463","url":null,"abstract":"<div><div>Aiming at completing the missing edges between entities in the knowledge graph, many state-of-the-art models are proposed to predict the links. Those models mainly focus on predicting the link score between source and target entities with certain relations, but ignore the similarities or differences of the whole meanings of triplets in different subgraphs. However, the triplets interact with each other in different ways and the link prediction model may lack interaction. In other word, the link prediction is superimposed with potential triplet uncertainties. To address this issue, we propose a Ranking-Triplet-based uncertainty adversarial learning (RankT) framework to improve the embedding representation of triplet for link prediction. Firstly, the proposed model calculates the node and edge embeddings by the node-level and edge-level neighborhood aggregation respectively, and then fuses the embeddings by a self-attention transformer to gain the interactive embedding of the triplet. Secondly, to reduce the uncertainty of the probability distribution of predicted links, a ranking-triplet-based adversarial loss function based on the confrontation of highest certainty and highest uncertainty links is designed. Lastly, to strengthen the stability of the adversarial learning, a ranking-triplet-based consistency loss is designed to make the probability of the highest positive links converge in the same direction. The ablation studies show the effectiveness of each part of the proposed model. The comparison of experimental results shows that our model significantly outperforms the state-of-the-art models. In conclusion, the proposed model improves the link prediction performance while discovering the similar or different meanings of triplets.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102463"},"PeriodicalIF":2.7,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CredBERT: Credibility-aware BERT model for fake news detection","authors":"Anju R., Nargis Pervin","doi":"10.1016/j.datak.2025.102461","DOIUrl":"10.1016/j.datak.2025.102461","url":null,"abstract":"<div><div>The spread of fake news on social media poses significant challenges, especially in distinguishing credible sources from unreliable ones. Existing methods primarily rely on text analysis, often neglecting user credibility, a key factor in enhancing detection accuracy. To address this, we propose CredBERT, a framework that combines credibility scores derived from user interactions and domain expertise with BERT-based text embeddings. CredBERT employs a multi-classifier ensemble, integrating Multi-Layer Perceptron (MLP), Convolutional Neural Networks (CNN), BiLSTM, Logistic Regression, and k-Nearest Neighbors, with predictions aggregated using majority voting, ensuring robust performance across both balanced and imbalanced class datasets. This approach effectively merges user credibility with content-based features, improving prediction accuracy and reducing biases. Compared to state-of-the art baselines FakeBERT and BiLSTM, CredBERT achieves 6.45% and 4.21% higher accuracy, respectively. By evaluating user credibility and content features, our model not only enhances fake news detection but also contributes to mitigating misinformation by identifying unreliable sources.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102461"},"PeriodicalIF":2.7,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Philosophical reflections on conceptual modeling as communication","authors":"Mattia Fumagalli , Giancarlo Guizzardi","doi":"10.1016/j.datak.2025.102453","DOIUrl":"10.1016/j.datak.2025.102453","url":null,"abstract":"<div><div>Conceptual modeling is a complex and demanding task. It is a task centered around the challenge of representing a portion of the world in a way that is objective, understandable, shareable, and reusable by a community of practitioners, who rely on models to design and implement software or to clarify the concepts within a given domain. The difficulty of conceptual modeling stems from the inherent limitations of human representation abilities, which cannot fully capture the infinite richness and diversity of the world, nor the endless possibilities for description enabled by language. Significant effort has been invested in addressing these challenges, particularly in the creation of effective and reusable conceptual models, which have presented numerous difficulties. This paper explores conceptual modeling from a philosophical standpoint, proposing that conceptual models should not be viewed merely as the static representational output of an a priori activity, subject to modification only during a preliminary design phase. Instead, they should be seen as dynamic artifacts that require continuous design, adaptation, and evolution from their inception to their application, which may account for multiple purposes. The paper seeks to highlight the importance of understanding conceptual modeling primarily as an act of communication, rather than just a process of information transmission. It also aims to clarify the distinction between these two aspects and to examine the potential implications of adopting a <em>communicative approach to modeling</em>. These implications extend not only to the tools and methodologies used in modeling but also to the ethical considerations that arise from such an approach.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102453"},"PeriodicalIF":2.7,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefano Rizzi, Matteo Francia, Enrico Gallinucci, Matteo Golfarelli
{"title":"Conceptual design of multidimensional cubes with LLMs: An investigation","authors":"Stefano Rizzi, Matteo Francia, Enrico Gallinucci, Matteo Golfarelli","doi":"10.1016/j.datak.2025.102452","DOIUrl":"10.1016/j.datak.2025.102452","url":null,"abstract":"<div><div>Large Language Models (LLMs) can simulate human linguistic capabilities, thus producing a disruptive impact across several domains, including software engineering. In this paper we focus on a specific scenario of software engineering, that of conceptual design of multidimensional data cubes. The goal is to evaluate the performance of LLMs (precisely, of ChatGPT-4o) in multidimensional conceptual design using the Dimensional Fact Model as a reference. To this end, we formulate nine research questions to (i) understand the competences of ChatGPT in multidimensional conceptual design, following either a supply- or a demand-driven approach, and (ii) investigate to what extent they can be improved via prompt engineering. After describing the research process in terms of base criteria, technological setting, input/output format, prompt templates, test cases, and metrics for evaluating the results, we discuss the output of the experiment. Our main conclusions are that (i) when prompts are enhanced with detailed procedural instructions and examples, the results produced significantly improve in all cases; and (ii) overall, ChatGPT is better at demand-driven design than at supply-driven design.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102452"},"PeriodicalIF":2.7,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143947575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tatsawan Timakum , Soobin Lee , Dongha Kim , Min Song , Il-Yeol Song
{"title":"Four decades of data & knowledge engineering: A bibliometric analysis and topic evolution study (1985–2024)","authors":"Tatsawan Timakum , Soobin Lee , Dongha Kim , Min Song , Il-Yeol Song","doi":"10.1016/j.datak.2025.102462","DOIUrl":"10.1016/j.datak.2025.102462","url":null,"abstract":"<div><div>The Data and Knowledge Engineering (DKE) journal has established a significant global research presence over four decades, substantially contributing to the advancement of data and knowledge engineering disciplines. This comprehensive bibliometric study analyzes the journal’s publications over the past 40 years (1985–2024), employing bibliographic records and citation data from Scopus, Web of Science (WoS), and ScienceDirect. By utilizing CiteSpace for citation and co-citation mapping and Dirichlet Multinomial Regression (DMR) topic modeling for trend analysis, the research provides a multifaceted examination of the journal’s scholarly landscape. Over its 40-year history, DKE has published 1951 articles, accumulating 53,594 citations. The study comprehensively explores key bibliometric dimensions, including influential authors, author networks, citation patterns, topic clusters, institutional contributions, and research funding sponsors, as well as evolution of topics, showing increasing, decreasing, or constant trends. Comprehensive analysis offers a meta-analytical perspective on DKE’s scholarly contributions, positioning the journal as a pioneering publication platform that advances critical knowledge and methodological innovations in data and knowledge engineering research domains. Through an in-depth examination of the journal’s publication trajectory, the study provides insights into the field’s scholarly evolution, highlighting DKE’s pivotal role in shaping academic discourse and technological understanding.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102462"},"PeriodicalIF":2.7,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"G2MBCF: Enhanced Named Entity Recognition for sensitive entities identification","authors":"Weibin Tian , Kaiming Gu , Shihui Xiao , Junbo Zhang , Wei Cui","doi":"10.1016/j.datak.2025.102444","DOIUrl":"10.1016/j.datak.2025.102444","url":null,"abstract":"<div><div>With the increasing growth of data, work on data security is becoming increasingly important. As the core of important data detection, the sensitive entities identification (SEI) problem has become a hot topic in natural language processing (NLP) science. Named Entity Recognition (NER) is the foundation of SEI, however, current studies treat SEI only as a special case of the NER problem. It lacks more detailed considerations of implicit links between entities and relations. In this paper, we propose a novel enhanced method called G2MBCF based on latent factor model (LFM). We use knowledge graph to represent the NER primary result with semantic structure. Then we use G2MBCF to inscribe entities and relations through a <span><math><mrow><mi>E</mi><mo>−</mo><mi>R</mi></mrow></math></span> matrix to mine implicit connections. Experiments show that compared to existing NER methods, our method enhances <span><math><mrow><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow></math></span> and <span><math><mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></math></span> of SEI. We also studied the influence of parameters in the experiments.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102444"},"PeriodicalIF":2.7,"publicationDate":"2025-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143900094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harivardhagini S (Professor) , Pranavanand S (Associate Professor) , Raghuram A (Professor)
{"title":"Ensemble model with combined feature set for Big data classification in IoT scenario","authors":"Harivardhagini S (Professor) , Pranavanand S (Associate Professor) , Raghuram A (Professor)","doi":"10.1016/j.datak.2025.102447","DOIUrl":"10.1016/j.datak.2025.102447","url":null,"abstract":"<div><div>Sensor nodes that are wirelessly connected to the internet and several systems make up the Internet of Things system. Large volumes of data are often stored in big data, which complicates the classification process. There are many Big data classification strategies in use, but the main issues are the management of secure information as well as computational time. This paper's goal is to suggest a novel classification system for big data in Internet of Things networks that operates in four main phases. Particularly, the healthcare data is considered as the Big data perspective to solve the classification problem. Since the healthcare Big data is the revolutionary tool in this industry, it is becoming the most vital point of patient-centric care. Different data sources are aggregated in this Big data healthcare ecosystem. The first stage is data acquisition which takes place via Internet of Things through sensors. The second stage is improved DSig normalization for input data preprocessing. The third stage is MapReduce framework-based feature extraction for handling the Big data. This extract features like raw data, mutual information, information gain, and improved Renyi entropy. Finally, the fourth stage is an ensemble disease classification model by the combination of Recurrent Neural Network, Neural Network, and Improved Support Vector Machine for predicting normal and abnormal diseases. The suggested work is implemented by the Python tool, and the effectiveness, specificity, sensitivity, precision, and other factors of the results are assessed. The proposed ensemble model achieves superior precision of 0.9573 for the training rate of 90 % when compared to the traditional models.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102447"},"PeriodicalIF":2.7,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144084758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frederik Wangelik, Majid Rafiei, Mahsa Pourbafrani, Wil M.P. van der Aalst
{"title":"Releasing differentially private event logs using generative models","authors":"Frederik Wangelik, Majid Rafiei, Mahsa Pourbafrani, Wil M.P. van der Aalst","doi":"10.1016/j.datak.2025.102450","DOIUrl":"10.1016/j.datak.2025.102450","url":null,"abstract":"<div><div>In recent years, the industry has been witnessing an extended usage of process mining and automated event data analysis. Consequently, there is a rising significance in addressing privacy apprehensions related to the inclusion of sensitive and private information within event data utilized by process mining algorithms. State-of-the-art research mainly focuses on providing quantifiable privacy guarantees, e.g., via differential privacy, for trace variants that are used by the main process mining techniques, e.g., process discovery. However, privacy preservation techniques designed for the release of trace variants are still insufficient to meet all the demands of industry-scale utilization. Moreover, ensuring privacy guarantees in situations characterized by a high occurrence of infrequent trace variants remains a challenging endeavor. In this paper, we introduce two novel approaches for releasing differentially private trace variants based on trained generative models. With TraVaG, we leverage <em>Generative Adversarial Networks</em> (GANs) to sample from a privatized implicit variant distribution. Our second method employs <em>Denoising Diffusion Probabilistic Models</em> that reconstruct artificial trace variants from noise via trained Markov chains. Both methods offer industry-scale benefits and elevate the degree of privacy assurances, particularly in scenarios featuring a substantial prevalence of infrequent variants. Also, they overcome the shortcomings of conventional privacy preservation techniques, such as bounding the length of variants and introducing fake variants. Experimental results on real-life event data demonstrate that our approaches surpass state-of-the-art techniques in terms of privacy guarantees and utility preservation.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102450"},"PeriodicalIF":2.7,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Florian Plötzky , Katarina Britz , Wolf-Tilo Balke
{"title":"A conceptual model for attributions in event-centric knowledge graphs","authors":"Florian Plötzky , Katarina Britz , Wolf-Tilo Balke","doi":"10.1016/j.datak.2025.102449","DOIUrl":"10.1016/j.datak.2025.102449","url":null,"abstract":"<div><div>The use of narratives as a means of fusing information from knowledge graphs (KGs) into a coherent line of argumentation has been the subject of recent investigation. Narratives are especially useful in event-centric knowledge graphs in that they provide a means to connect different real-world events and categorize them by well-known narrations. However, specifically for controversial events, a problem in information fusion arises, namely, multiple <em>viewpoints</em> regarding the validity of certain event aspects, e.g., regarding the role a participant takes in an event, may exist. Expressing those viewpoints in KGs is challenging because disputed information provided by different viewpoints may introduce <em>inconsistencies</em>. Hence, most KGs only feature a single view on the contained information, hampering the effectiveness of narrative information access. This paper is an extension of our original work and introduces <em>attributions</em>, i.e., parameterized predicates that allow for the representation of facts that are only valid in a specific viewpoint. For this, we develop a conceptual model that allows for the representation of viewpoint-dependent information. As an extension, we enhance the model by a conception of viewpoint-compatibility. Based on this, we deepen our original deliberations on the model’s effects on information fusion and provide additional grounding in the literature.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102449"},"PeriodicalIF":2.7,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143855958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaboration with GenAI in engineering research design","authors":"Fazel Naghdy","doi":"10.1016/j.datak.2025.102445","DOIUrl":"10.1016/j.datak.2025.102445","url":null,"abstract":"<div><div>Over the past five years, the fast development and use of generative artificial intelligence (GenAI) and large language models (LLMs) has ushered in a new era of study, teaching, and learning in many domains. The role that GenAIs can play in engineering research is addressed. The related previous works report on the potential of GenAIs in the literature review process. However, such potential is not demonstrated by case studies and practical examples. The previous works also do not address how GenAIs can assist with all the steps traditionally taken to design research. This study examines the effectiveness of collaboration with GenAIs at various stages of research design. It explores whether collaboration with GenAIs can result in more focused and comprehensive outcomes. A generalised approach for collaboration with AI tools in research design is proposed. A case study to develop a research design on the concept of “shared machine-human driving” is deployed to show the validity of the articulated concepts. The case study demonstrates both the pros and cons of collaboration with GenAIs. The results generated at each stage are rigorously validated and thoroughly examined to ensure they remain free from inaccuracies or hallucinations and align with the original research objectives. When necessary, the results are manually adjusted and refined to uphold their integrity and accuracy. The findings produced by the various GenAI models utilized in this study highlight the key attributes of generative artificial intelligence, namely speed, efficiency, and scope. However, they also underscore the critical importance of researcher oversight, as unexamined inferences and interpretations can render the results irrelevant or meaningless.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102445"},"PeriodicalIF":2.7,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}