{"title":"Secure data storage in multi-cloud environments using lattice-based saber with Diffie-Hellman cryptography and authenticate based on PUF-ECC","authors":"R. Iyswarya , R. Anitha","doi":"10.1016/j.datak.2025.102512","DOIUrl":"10.1016/j.datak.2025.102512","url":null,"abstract":"<div><div>Human life has become highly dependent on data in recent decades almost every facet of daily activities, leading to its storage in multi-cloud environments. To ensure data integrity, confidentiality, and privacy, it is essential to protect data from unauthorized access. This paper proposes a novel approach for securing data in multi-cloud environments for user authentication and data storage using Lattice-Based Saber Cryptography combined with PUF-ECC and the Enhanced Goose Optimization Algorithm (EGOA). The initial user authentication is achieved through the PUF-ECC digital signature algorithm, which verifies both the user's and the device's identity. Once authenticated, user data is securely transmitted to the cloud server based on Lattice-Based Saber post-quantum cryptography combined with the Diffie-Hellman key exchange protocol. The encrypted data is then stored across multiple cloud storage through a cloud controller using RAM-based chunking. For efficient data retrieval, the Enhanced Goose Optimization Algorithm (EGOA) is employed to extract encrypted data from clouds. Finally, the data is decrypted using the Lattice-Based Saber decryption algorithm and securely retrieved by the authenticated user. This method enhances both the security and efficiency of cloud data management and retrieval. The experiment is carried out with the proposed methodologies and also compared with the existing technologies. The proposed approach achieves encryption times of 9.68 ms, key generation times of 4.84 ms, and block creation times of 1.59 ms, while maintaining a 93.7 % confidentiality rate, a 98 % packet delivery ratio, a transmission delay of 0.026 ms, throughput of 407.33 MB/s, jitter of 3.26 ms, and an RTT of 0.17 ms, demonstrating its effectiveness in secure data storage and retrieval in multi-cloud environments.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102512"},"PeriodicalIF":2.7,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145158079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A graph-based model for semantic textual similarity measurement","authors":"Van-Tan Bui , Quang-Minh Nguyen , Van-Vinh Nguyen , Duc-Toan Nguyen","doi":"10.1016/j.datak.2025.102509","DOIUrl":"10.1016/j.datak.2025.102509","url":null,"abstract":"<div><div>Measuring semantic similarity between sentence pairs is a fundamental problem in Natural Language Processing with applications in various domains, including machine translation, speech recognition, automatic question answering, and text summarization. Despite its significance, accurately assessing semantic similarity remains a challenging task, particularly for underrepresented languages such as Vietnamese. Existing methods have yet to fully leverage the unique linguistic characteristics of Vietnamese for semantic similarity measurement. To address this limitation, we propose GBNet-STS (Graph-Based Network for Semantic Textual Similarity), a novel framework for measuring the semantic similarity of Vietnamese sentence pairs. GBNet-STS integrates lexical-grammatical similarity scores and distributional semantic similarity scores within a multi-layered graph-based model. By capturing different semantic perspectives through multiple interconnected layers, our approach provides a more comprehensive and robust similarity estimation. Experimental results demonstrate that GBNet-STS outperforms traditional methods, achieving state-of-the-art performance in Vietnamese semantic similarity tasks.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102509"},"PeriodicalIF":2.7,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145060462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ASF: A novel associative scoring function for embedded knowledge graph reasoning","authors":"MVPT Lakshika, HA Caldera","doi":"10.1016/j.datak.2025.102511","DOIUrl":"10.1016/j.datak.2025.102511","url":null,"abstract":"<div><div>One of the most important tools for knowledge management is the Knowledge Graph (KG), a multi-relational graph that depicts rich factual information across entities. A KG represents entities as nodes and relations as edges, with each edge represented by a triplet: (head entity, relation, tail entity). The Scoring Function (SF) in a KG quantifies the plausibility of these triplets and is often derived from KG embeddings. However, due to the distinct relational patterns across KGs, an SF that performs well on one KG might fail on another, making the design of optimal SFs a challenging task. This study introduces the concept of an Associative Scoring Function (ASF), which leverages Association Rule Mining (ARM) to discover and incorporate patterns and characteristics of symmetric, asymmetric, inverse, and other relational types within embedded KGs. The ARM technique in ASF uses the FP-Growth algorithm to extract meaningful associations, which is enhanced further through hyperparameter tuning. Extensive experiments on benchmark datasets demonstrate that ASF is KG-independent and performs better than state-of-the-art SFs. These results highlight ASF's potential to generalize across diverse KGs, offering a significant advancement in the KG link prediction task.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102511"},"PeriodicalIF":2.7,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145094813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Trujillo , Ana Lavalle , Alejandro Reina-Reina , Jorge García-Carrasco , Alejandro Maté , Wolfgang Maaß
{"title":"An integrated requirements framework for analytical and AI projects","authors":"Juan Trujillo , Ana Lavalle , Alejandro Reina-Reina , Jorge García-Carrasco , Alejandro Maté , Wolfgang Maaß","doi":"10.1016/j.datak.2025.102493","DOIUrl":"10.1016/j.datak.2025.102493","url":null,"abstract":"<div><div>To this day, the requirements of data warehouses, user visualizations and ML projects have been tackled in an independent manner, ignoring the possible cross-requirements, collective constraints and dependencies between the outputs of the different systems that should be taken into account to ensure a successful analytical project. In this work, we take a holistic approach and propose a methodology that supports modeling and subsequent analysis while taking into account these three aspects. This methodology has several advantages, mainly that (i) it enables us to identify possible conflicts between actors on different tasks that are overlooked if the systems are treated in an isolated manner and (ii) this holistic view enables modeling multi-company systems, where the information or even the analytical results can be provided by third-parties, identifying key participants in federated environments. After presenting the required formalism to carry out this kind of analysis, we showcase it on a real-world running example of the tourism sector.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102493"},"PeriodicalIF":2.7,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145094812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rule-guided process discovery","authors":"Ali Norouzifar , Marcus Dees , Wil van der Aalst","doi":"10.1016/j.datak.2025.102508","DOIUrl":"10.1016/j.datak.2025.102508","url":null,"abstract":"<div><div>Event data extracted from information systems serves as the foundation for process mining, enabling the extraction of insights and identification of improvements. Process discovery focuses on deriving descriptive process models from event logs, which form the basis for conformance checking, performance analysis, and other applications. Traditional process discovery techniques predominantly rely on event logs, often overlooking supplementary information such as domain knowledge and process rules. These rules, which define relationships between activities, can be obtained through automated techniques like declarative process discovery or provided by domain experts based on process specifications. When used as an additional input alongside event logs, such rules have significant potential to guide process discovery. However, leveraging rules to discover high-quality imperative process models, such as BPMN models and Petri nets, remains an underexplored area in the literature. To address this gap, we propose an enhanced framework, IMr, which integrates discovered or user-defined rules into the process discovery workflow via a novel recursive approach. The IMr framework employs a divide-and-conquer strategy, using rules to guide the selection of process structures at each recursion step in combination with the input event log. We evaluate our approach on several real-world event logs and demonstrate that the discovered models better align with the provided rules without compromising their conformance to the event log. Additionally, we show that high-quality rules can improve model quality across well-known conformance metrics. This work highlights the importance of integrating domain knowledge into process discovery, enhancing the quality, interpretability, and applicability of the resulting process models.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102508"},"PeriodicalIF":2.7,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145048987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LQ-FJS: A logical query-digging fake-news judgment system with structured video-summarization engine using LLM","authors":"Jhing-Fa Wang , Din-Yuen Chan , Hsin-Chun Tsai , Bo-Xuan Fang","doi":"10.1016/j.datak.2025.102507","DOIUrl":"10.1016/j.datak.2025.102507","url":null,"abstract":"<div><div>The proliferation of online social platforms can greatly benefit people by fostering remote relationships, but it also inevitably amplifies the impact of multimodal fake news on societal trust and ethics. Existing fake-news detection AI systems are still vulnerable to the inconspicuous and indiscernible multimodal misinformation, and often lacking interpretability and accuracy in cross-platform settings. Hence, we propose a new innovative logical query-digging fake-news judgment system (LQ-FJS) to tackle the above problem based on multimodal approach. The LQ-FJS verifies the truthfulness of claims made within multimedia news by converting video content into structured textual summaries. It then acts as an interpretable agent, explaining the reasons for identified fake news by the structured video-summarization engine (SVSE) to act as an interpretable detection intermediary agent. The SVSE generates condensed captions for raw video content, converting it into structured textual narratives. Then, LQ-FJS exploits these condensed captions to retrieve reliable information related to the video content from LLM. Thus, LQ-FJS cross-verifies external knowledge sources and internal LLM responses to determine whether contradictions exist with factual information through a multimodal inconsistency verification procedure. Our experiments demonstrate that the subtle summarization produced by SVSE can facilitate the generation of explanatory reports that mitigate large-scale trust deficits caused by opaque “black-box” models. Our experiments show that LQ-FJS improves F1 scores by 4.5% and 7.2% compared to state-of-the-art models (FactLLaMA 2023 and HiSS 2023), and increases 14% user trusts through interpretable conclusions.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102507"},"PeriodicalIF":2.7,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145019198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Witold Andrzejewski, Tadeusz Morzy, Maciej Zakrzewicz
{"title":"ABBA: Index structure for sequential pattern-based aggregate queries","authors":"Witold Andrzejewski, Tadeusz Morzy, Maciej Zakrzewicz","doi":"10.1016/j.datak.2025.102506","DOIUrl":"10.1016/j.datak.2025.102506","url":null,"abstract":"<div><div>Pattern-based aggregate (PBA) queries constitute an important and widely used type of analytical queries in sequence OLAP (S-OLAP) systems. Unfortunately, finding accurate answers to PBA queries in the S-OLAP system is often very expensive both in terms of time and memory consumption. In this paper we propose an efficient and easily maintainable index structure called the ABBA Index, which addresses the problem of PBA query processing.</div><div>Experiments conducted using the KDD Cup data and public transport passengers’ travel behavior data show that our index outperforms state-of-the art solutions while requiring much less memory. The ABBA Index can be easily extended to support pattern-based aggregate queries over hierarchy (PBA-H), a novel class of analytical queries which we introduce as the second main contribution of the paper. Sensitivity, scalability and complexity analysis of the ABBA Index is also provided.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102506"},"PeriodicalIF":2.7,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144911761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Source-Free Domain Adaptation with complex distribution considerations for time series data","authors":"Jing Shang, Zunming Chen, Zhiwen Xiao, Zhihui Wu, Yifei Zhang, Jibing Wang","doi":"10.1016/j.datak.2025.102501","DOIUrl":"10.1016/j.datak.2025.102501","url":null,"abstract":"<div><div>Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained model from a labeled source domain to an unlabeled target domain without accessing source domain data, thereby protecting source domain privacy. Although SFDA has recently been applied to time series data, the inherent complex distribution characteristics including temporal variability and distributional diversity of such data remain underexplored. Time series data exhibit significant dynamic variability influenced by collection environments, leading to discrepancies between sequences. Additionally, multidimensional time series data face distributional diversity across dimensions. These complex characteristics increase the learning difficulty for source models and widen the adaptation gap between the source and target domains. To address these challenges, this paper proposes a novel SFDA method for time series data, named Adaptive Latent Subdomain feature extraction and joint Prediction (ALSP). The method divides the source domain, which has a complex distribution, into multiple latent subdomains with relatively simple distributions, thereby effectively capturing the features of different subdistributions. It extracts latent domain-specific and domain-invariant features to identify subdomain-specific characteristics. Furthermore, it combines domain-specific classifiers and a domain-invariant classifier to enhance model performance through multi-classifier joint prediction. During target domain adaptation, ALSP reduces domain dependence by extracting invariant features, thereby narrowing the distributional gap between the source and target domains. Simultaneously, it leverages prior knowledge from the source domain distribution to support the hypothesis space and dynamically adapt to the target domain. Experiments on three real-world datasets demonstrate that ALSP achieves superior performance in cross-domain time series classification tasks, significantly outperforming existing methods.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102501"},"PeriodicalIF":2.7,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144918105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Elevating human-machine collaboration in NLP for enhanced content creation and decision support","authors":"Priyanka V. Deshmukh, Aniket K. Shahade","doi":"10.1016/j.datak.2025.102505","DOIUrl":"10.1016/j.datak.2025.102505","url":null,"abstract":"<div><div>Human-machine collaboration in Natural Language Processing (NLP) is revolutionizing content creation and decision support by seamlessly combining the strengths of both entities for enhanced efficiency and quality. The lack of seamless integration between human creativity and machine efficiency in NLP hinders optimal content creation and decision support. The objective of this study is to explore and promote the integration of human-machine collaboration in NLP to enhance both content creation and decision support processes. Data Acquisition for NLP requests involves defining the task and target audience, identifying relevant data sources like text documents and web data, and incorporating human expertise for data curation through validation and annotation. Machine processing techniques like tokenization, stemming/lemmatization, and removal of stop words, as well as human input for tasks like data annotation and error correction, to improve data quality and relevance for NLP applications. The combination of automated processing and human feedback leads to more precise and dependable effects. Techniques such as sentiment analysis, topic modelling, and entity recognition are utilized to excerpt valued perceptions from the data and enhance collaboration between humans and machines. These techniques help to streamline the NLP process and ensure that the system is providing accurate and relevant information to users. The analysis of NLP models in machine processing involves training the models to perform specific tasks, such as summarization, sentiment analysis, information extraction, trend identification, and creative content generation. The results show that social media leads with 90% usage, pivotal for audience engagement, while blogs at 78% highlight their depth in content creation implementation using Python software. These trained models are then used to improve decision-making processes, generate creative content, and enhance the accuracy of search results. The future scope involves leveraging advanced NLP techniques to deepen the collaboration between humans and machines for more effective content creation and decision support.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102505"},"PeriodicalIF":2.7,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144908780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ria Hari Gusmita , Asep Fajar Firmansyah , Hamada M. Zahera , Axel-Cyrille Ngonga Ngomo
{"title":"ELEVATE-ID: Extending Large Language Models for End-to-End Entity Linking Evaluation in Indonesian","authors":"Ria Hari Gusmita , Asep Fajar Firmansyah , Hamada M. Zahera , Axel-Cyrille Ngonga Ngomo","doi":"10.1016/j.datak.2025.102504","DOIUrl":"10.1016/j.datak.2025.102504","url":null,"abstract":"<div><div>Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their effectiveness in low-resource languages remains underexplored, particularly in complex tasks such as end-to-end Entity Linking (EL), which requires both mention detection and disambiguation against a knowledge base (KB). In earlier work, we introduced IndEL — the first end-to-end EL benchmark dataset for the Indonesian language — covering both a general domain (news) and a specific domain (religious text from the Indonesian translation of the Quran), and evaluated four traditional end-to-end EL systems on this dataset. In this study, we propose ELEVATE-ID, a comprehensive evaluation framework for assessing LLM performance on end-to-end EL in Indonesian. The framework evaluates LLMs under both zero-shot and fine-tuned conditions, using multilingual and Indonesian monolingual models, with Wikidata as the target KB. Our experiments include performance benchmarking, generalization analysis across domains, and systematic error analysis. Results show that GPT-4 and GPT-3.5 achieve the highest accuracy in zero-shot and fine-tuned settings, respectively. However, even fine-tuned GPT-3.5 underperforms compared to DBpedia Spotlight — the weakest of the traditional model baselines — in the general domain. Interestingly, GPT-3.5 outperforms Babelfy in the specific domain. Generalization analysis indicates that fine-tuned GPT-3.5 adapts more effectively to cross-domain and mixed-domain scenarios. Error analysis uncovers persistent challenges that hinder LLM performance: difficulties with non-complete mentions, acronym disambiguation, and full-name recognition in formal contexts. These issues point to limitations in mention boundary detection and contextual grounding. Indonesian-pretrained LLMs, Komodo and Merak, reveal core weaknesses: template leakage and entity hallucination, respectively—underscoring architectural and training limitations in low-resource end-to-end EL.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"161 ","pages":"Article 102504"},"PeriodicalIF":2.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144889217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}