Stefano Rizzi, Matteo Francia, Enrico Gallinucci, Matteo Golfarelli
{"title":"Conceptual design of multidimensional cubes with LLMs: An investigation","authors":"Stefano Rizzi, Matteo Francia, Enrico Gallinucci, Matteo Golfarelli","doi":"10.1016/j.datak.2025.102452","DOIUrl":"10.1016/j.datak.2025.102452","url":null,"abstract":"<div><div>Large Language Models (LLMs) can simulate human linguistic capabilities, thus producing a disruptive impact across several domains, including software engineering. In this paper we focus on a specific scenario of software engineering, that of conceptual design of multidimensional data cubes. The goal is to evaluate the performance of LLMs (precisely, of ChatGPT-4o) in multidimensional conceptual design using the Dimensional Fact Model as a reference. To this end, we formulate nine research questions to (i) understand the competences of ChatGPT in multidimensional conceptual design, following either a supply- or a demand-driven approach, and (ii) investigate to what extent they can be improved via prompt engineering. After describing the research process in terms of base criteria, technological setting, input/output format, prompt templates, test cases, and metrics for evaluating the results, we discuss the output of the experiment. Our main conclusions are that (i) when prompts are enhanced with detailed procedural instructions and examples, the results produced significantly improve in all cases; and (ii) overall, ChatGPT is better at demand-driven design than at supply-driven design.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102452"},"PeriodicalIF":2.7,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143947575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tatsawan Timakum , Soobin Lee , Dongha Kim , Min Song , Il-Yeol Song
{"title":"Four decades of data & knowledge engineering: A bibliometric analysis and topic evolution study (1985–2024)","authors":"Tatsawan Timakum , Soobin Lee , Dongha Kim , Min Song , Il-Yeol Song","doi":"10.1016/j.datak.2025.102462","DOIUrl":"10.1016/j.datak.2025.102462","url":null,"abstract":"<div><div>The Data and Knowledge Engineering (DKE) journal has established a significant global research presence over four decades, substantially contributing to the advancement of data and knowledge engineering disciplines. This comprehensive bibliometric study analyzes the journal’s publications over the past 40 years (1985–2024), employing bibliographic records and citation data from Scopus, Web of Science (WoS), and ScienceDirect. By utilizing CiteSpace for citation and co-citation mapping and Dirichlet Multinomial Regression (DMR) topic modeling for trend analysis, the research provides a multifaceted examination of the journal’s scholarly landscape. Over its 40-year history, DKE has published 1951 articles, accumulating 53,594 citations. The study comprehensively explores key bibliometric dimensions, including influential authors, author networks, citation patterns, topic clusters, institutional contributions, and research funding sponsors, as well as evolution of topics, showing increasing, decreasing, or constant trends. Comprehensive analysis offers a meta-analytical perspective on DKE’s scholarly contributions, positioning the journal as a pioneering publication platform that advances critical knowledge and methodological innovations in data and knowledge engineering research domains. Through an in-depth examination of the journal’s publication trajectory, the study provides insights into the field’s scholarly evolution, highlighting DKE’s pivotal role in shaping academic discourse and technological understanding.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102462"},"PeriodicalIF":2.7,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144106307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"G2MBCF: Enhanced Named Entity Recognition for sensitive entities identification","authors":"Weibin Tian , Kaiming Gu , Shihui Xiao , Junbo Zhang , Wei Cui","doi":"10.1016/j.datak.2025.102444","DOIUrl":"10.1016/j.datak.2025.102444","url":null,"abstract":"<div><div>With the increasing growth of data, work on data security is becoming increasingly important. As the core of important data detection, the sensitive entities identification (SEI) problem has become a hot topic in natural language processing (NLP) science. Named Entity Recognition (NER) is the foundation of SEI, however, current studies treat SEI only as a special case of the NER problem. It lacks more detailed considerations of implicit links between entities and relations. In this paper, we propose a novel enhanced method called G2MBCF based on latent factor model (LFM). We use knowledge graph to represent the NER primary result with semantic structure. Then we use G2MBCF to inscribe entities and relations through a <span><math><mrow><mi>E</mi><mo>−</mo><mi>R</mi></mrow></math></span> matrix to mine implicit connections. Experiments show that compared to existing NER methods, our method enhances <span><math><mrow><mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow></math></span> and <span><math><mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow></math></span> of SEI. We also studied the influence of parameters in the experiments.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102444"},"PeriodicalIF":2.7,"publicationDate":"2025-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143900094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harivardhagini S (Professor) , Pranavanand S (Associate Professor) , Raghuram A (Professor)
{"title":"Ensemble model with combined feature set for Big data classification in IoT scenario","authors":"Harivardhagini S (Professor) , Pranavanand S (Associate Professor) , Raghuram A (Professor)","doi":"10.1016/j.datak.2025.102447","DOIUrl":"10.1016/j.datak.2025.102447","url":null,"abstract":"<div><div>Sensor nodes that are wirelessly connected to the internet and several systems make up the Internet of Things system. Large volumes of data are often stored in big data, which complicates the classification process. There are many Big data classification strategies in use, but the main issues are the management of secure information as well as computational time. This paper's goal is to suggest a novel classification system for big data in Internet of Things networks that operates in four main phases. Particularly, the healthcare data is considered as the Big data perspective to solve the classification problem. Since the healthcare Big data is the revolutionary tool in this industry, it is becoming the most vital point of patient-centric care. Different data sources are aggregated in this Big data healthcare ecosystem. The first stage is data acquisition which takes place via Internet of Things through sensors. The second stage is improved DSig normalization for input data preprocessing. The third stage is MapReduce framework-based feature extraction for handling the Big data. This extract features like raw data, mutual information, information gain, and improved Renyi entropy. Finally, the fourth stage is an ensemble disease classification model by the combination of Recurrent Neural Network, Neural Network, and Improved Support Vector Machine for predicting normal and abnormal diseases. The suggested work is implemented by the Python tool, and the effectiveness, specificity, sensitivity, precision, and other factors of the results are assessed. The proposed ensemble model achieves superior precision of 0.9573 for the training rate of 90 % when compared to the traditional models.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102447"},"PeriodicalIF":2.7,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144084758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frederik Wangelik, Majid Rafiei, Mahsa Pourbafrani, Wil M.P. van der Aalst
{"title":"Releasing differentially private event logs using generative models","authors":"Frederik Wangelik, Majid Rafiei, Mahsa Pourbafrani, Wil M.P. van der Aalst","doi":"10.1016/j.datak.2025.102450","DOIUrl":"10.1016/j.datak.2025.102450","url":null,"abstract":"<div><div>In recent years, the industry has been witnessing an extended usage of process mining and automated event data analysis. Consequently, there is a rising significance in addressing privacy apprehensions related to the inclusion of sensitive and private information within event data utilized by process mining algorithms. State-of-the-art research mainly focuses on providing quantifiable privacy guarantees, e.g., via differential privacy, for trace variants that are used by the main process mining techniques, e.g., process discovery. However, privacy preservation techniques designed for the release of trace variants are still insufficient to meet all the demands of industry-scale utilization. Moreover, ensuring privacy guarantees in situations characterized by a high occurrence of infrequent trace variants remains a challenging endeavor. In this paper, we introduce two novel approaches for releasing differentially private trace variants based on trained generative models. With TraVaG, we leverage <em>Generative Adversarial Networks</em> (GANs) to sample from a privatized implicit variant distribution. Our second method employs <em>Denoising Diffusion Probabilistic Models</em> that reconstruct artificial trace variants from noise via trained Markov chains. Both methods offer industry-scale benefits and elevate the degree of privacy assurances, particularly in scenarios featuring a substantial prevalence of infrequent variants. Also, they overcome the shortcomings of conventional privacy preservation techniques, such as bounding the length of variants and introducing fake variants. Experimental results on real-life event data demonstrate that our approaches surpass state-of-the-art techniques in terms of privacy guarantees and utility preservation.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102450"},"PeriodicalIF":2.7,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Florian Plötzky , Katarina Britz , Wolf-Tilo Balke
{"title":"A conceptual model for attributions in event-centric knowledge graphs","authors":"Florian Plötzky , Katarina Britz , Wolf-Tilo Balke","doi":"10.1016/j.datak.2025.102449","DOIUrl":"10.1016/j.datak.2025.102449","url":null,"abstract":"<div><div>The use of narratives as a means of fusing information from knowledge graphs (KGs) into a coherent line of argumentation has been the subject of recent investigation. Narratives are especially useful in event-centric knowledge graphs in that they provide a means to connect different real-world events and categorize them by well-known narrations. However, specifically for controversial events, a problem in information fusion arises, namely, multiple <em>viewpoints</em> regarding the validity of certain event aspects, e.g., regarding the role a participant takes in an event, may exist. Expressing those viewpoints in KGs is challenging because disputed information provided by different viewpoints may introduce <em>inconsistencies</em>. Hence, most KGs only feature a single view on the contained information, hampering the effectiveness of narrative information access. This paper is an extension of our original work and introduces <em>attributions</em>, i.e., parameterized predicates that allow for the representation of facts that are only valid in a specific viewpoint. For this, we develop a conceptual model that allows for the representation of viewpoint-dependent information. As an extension, we enhance the model by a conception of viewpoint-compatibility. Based on this, we deepen our original deliberations on the model’s effects on information fusion and provide additional grounding in the literature.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102449"},"PeriodicalIF":2.7,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143855958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaboration with GenAI in engineering research design","authors":"Fazel Naghdy","doi":"10.1016/j.datak.2025.102445","DOIUrl":"10.1016/j.datak.2025.102445","url":null,"abstract":"<div><div>Over the past five years, the fast development and use of generative artificial intelligence (GenAI) and large language models (LLMs) has ushered in a new era of study, teaching, and learning in many domains. The role that GenAIs can play in engineering research is addressed. The related previous works report on the potential of GenAIs in the literature review process. However, such potential is not demonstrated by case studies and practical examples. The previous works also do not address how GenAIs can assist with all the steps traditionally taken to design research. This study examines the effectiveness of collaboration with GenAIs at various stages of research design. It explores whether collaboration with GenAIs can result in more focused and comprehensive outcomes. A generalised approach for collaboration with AI tools in research design is proposed. A case study to develop a research design on the concept of “shared machine-human driving” is deployed to show the validity of the articulated concepts. The case study demonstrates both the pros and cons of collaboration with GenAIs. The results generated at each stage are rigorously validated and thoroughly examined to ensure they remain free from inaccuracies or hallucinations and align with the original research objectives. When necessary, the results are manually adjusted and refined to uphold their integrity and accuracy. The findings produced by the various GenAI models utilized in this study highlight the key attributes of generative artificial intelligence, namely speed, efficiency, and scope. However, they also underscore the critical importance of researcher oversight, as unexamined inferences and interpretations can render the results irrelevant or meaningless.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102445"},"PeriodicalIF":2.7,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Derived multi-objective function for latency sensitive-based cloud object storage system using hybrid heuristic algorithm","authors":"N Nataraj , RV Nataraj","doi":"10.1016/j.datak.2025.102448","DOIUrl":"10.1016/j.datak.2025.102448","url":null,"abstract":"<div><div>Cloud Object Storage System (COSS) is capable of storing and retrieving a ton of unstructured data items called objects which act as a core cloud service for contemporary web-based applications. While sharing the data among different parties, privacy preservation becomes challenging. <em>Research Problem:</em> From day-to-day activities, a high volume of requests are served daily thus, it leads to cause the latency issues. In a cloud storage system, the adaption of a holistic approach helps the user to identify sensitive information and analyze the unwanted files/data. With evolving of Internet of Things (IoT) applications are latency-sensitive, which does not function well with these new ideas and platforms that are available today. <em>Overall Purpose of the Study:</em> Therefore, a novel latency-aware COSS is implemented with the aid of multi-objective functionalities to allocate and reallocate data efficiently in order to sustain the storage process in the cloud environment. <em>Design of the Study:</em> This goal is accomplished by implementing a hybrid meta-heuristic approach with the integration of the Mother Optimization Algorithm (MOA) with Dolphin Swarm Optimization (DSO) algorithm. The implemented hybrid optimization algorithm is called the Hybrid Dolphin Swarm-based Mother Optimization Algorithm (HDS-MOA). The HDS-MOA considers the objective function by considering constraints like throughput, latency, resource usage, and active servers during the data allocation process. While considering data reallocation process, the developed HDS-MOA algorithm is also performed by considering the multi-objective constraints like cost, makespan, and energy. The diverse experimental test is conducted to prove its effectiveness by comparing it with other existing methods for storing data efficiently across cloud networks. <em>Major findings of results:</em> In the configuration 3, the proposed HDS-MOA attains 31.11 %, 55.71 %, 55.71 %, and 68.21 % enhanced than the OSSperf, queuing theory, scheduling technique, and Monte Carlo-PSO based on the latency analysis. <em>Overview of Interpretations and Conclusions:</em> The developed HDS-MOA assured the better performance on the data is preserved in the optimal locations having appropriate access time and less latency that is highly essential for the cloud object storage. This supports to enhance the overall user experience by boosting the data retrieval. <em>Limitations of this Study with Solutions:</em> The ability of the proposed algorithm needs to enhance on balancing the multiple objectives such as performance, cost, and fault tolerance for optimally performing the operations in real-time that makes the system to be more efficient as well as responsive in the dynamic variations in the demand.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102448"},"PeriodicalIF":2.7,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143859469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ECS-KG: An event-centric semantic knowledge graph for event-related news articles","authors":"MVPT Lakshika, HA Caldera, TNK De Zoysa","doi":"10.1016/j.datak.2025.102451","DOIUrl":"10.1016/j.datak.2025.102451","url":null,"abstract":"<div><div>Recent advances in deep learning techniques and contextual understanding render Knowledge Graphs (KGs) valuable tools for enhancing accessibility and news comprehension. Conventional and news-specific KGs frequently lack the specificity for efficient news-related tasks, leading to limited relevance and static data representation. To fill the gap, this study proposes an Event-Centric Semantic Knowledge Graph (ECS-KG) model that combines deep learning approaches with contextual embeddings to improve the procedural and dynamic knowledge representation observed in news articles. The ECS-KG incorporates several information extraction techniques, a temporal Graph Neural Network (GNN), and a Graph Attention Network (GAT), yielding significant improvements in news representation. Several gold-standard datasets, comprising CNN/Daily Mail, TB-Dense, and ACE 2005, revealed that the proposed model outperformed the most advanced models. By integrating temporal reasoning and semantic insights, ECS-KG not only enhances user understanding of news significance but also meets the evolving demands of news consumers. This model advances the field of event-centric semantic KGs and provides valuable resources for applications in news information processing.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102451"},"PeriodicalIF":2.7,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143828580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Constantin Buschhaus , Arvid Butting , Judith Michael , Verena Nitsch , Sebastian Pütz , Bernhard Rumpe , Carolin Stellmacher , Sabine Theis
{"title":"Overcoming the hurdle of legal expertise: A reusable model for smartwatch privacy policies","authors":"Constantin Buschhaus , Arvid Butting , Judith Michael , Verena Nitsch , Sebastian Pütz , Bernhard Rumpe , Carolin Stellmacher , Sabine Theis","doi":"10.1016/j.datak.2025.102443","DOIUrl":"10.1016/j.datak.2025.102443","url":null,"abstract":"<div><div>Regulations for privacy protection aim to protect individuals from the unauthorized storage, processing, and transfer of their personal data but oftentimes fail in providing helpful support for understanding these regulations. To better communicate privacy policies for smartwatches, we need an in-depth understanding of their concepts and provide better ways to enable developers to integrate them when engineering systems. Up to now, no conceptual model exists covering privacy statements from different smartwatch manufacturers that is reusable for developers. This paper introduces such a conceptual model for privacy policies of smartwatches and shows its use in a model-driven software engineering approach to create a platform for data visualization of wearable privacy policies from different smartwatch manufacturers. We have analyzed the privacy policies of various manufacturers and extracted the relevant concepts. Moreover, we have checked the model with lawyers for its correctness, instantiated it with concrete data, and used it in a model-driven software engineering approach to create a platform for data visualization. This reusable privacy policy model can enable developers to easily represent privacy policies in their systems. This provides a foundation for more structured and understandable privacy policies which, in the long run, can increase the data sovereignty of application users.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102443"},"PeriodicalIF":2.7,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143817727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}