Veda C. Storey , Oscar Pastor , Giancarlo Guizzardi , Stephen W. Liddle , Wolfgang Maaß , Jeffrey Parsons , Jolita Ralyté , Maribel Yasmina Santos
{"title":"Large language models for conceptual modeling: Assessment and application potential","authors":"Veda C. Storey , Oscar Pastor , Giancarlo Guizzardi , Stephen W. Liddle , Wolfgang Maaß , Jeffrey Parsons , Jolita Ralyté , Maribel Yasmina Santos","doi":"10.1016/j.datak.2025.102480","DOIUrl":"10.1016/j.datak.2025.102480","url":null,"abstract":"<div><div>Large Language Models (LLMs) are being rapidly adopted for many activities in organizations, business, and education. Included in their applications are capabilities to generate text, code, and models. This leads to questions about their potential role in the conceptual modeling part of information systems development. This paper reports on a panel presented at the <em>43rd International Conference on Conceptual Modeling</em> where researchers discussed the current and potential role of LLMs in conceptual modeling. The panelists discussed applications and interest levels and expressed both optimism and caution in the adoption of LLMs. Suggested is a need for much continued research by the conceptual modeling community on LLM development and their role in research and teaching.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102480"},"PeriodicalIF":2.7,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144517377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md. Mehedi Hassan , Anindya Nag , Riya Biswas , Md Shahin Ali , Sadika Zaman , Anupam Kumar Bairagi , Chetna Kaushal
{"title":"Explainable artificial intelligence for natural language processing: A survey","authors":"Md. Mehedi Hassan , Anindya Nag , Riya Biswas , Md Shahin Ali , Sadika Zaman , Anupam Kumar Bairagi , Chetna Kaushal","doi":"10.1016/j.datak.2025.102470","DOIUrl":"10.1016/j.datak.2025.102470","url":null,"abstract":"<div><div>Recently, artificial intelligence has gained a lot of momentum and is predicted to surpass expectations across a range of industries. However, explainability is a major challenge due to sub-symbolic techniques like Deep Neural Networks and Ensembles, which were absent during the boom of AI. The practical application of AI in numerous application areas is greatly undermined by this lack of explainability. In order to counter the lack of perception of AI-based systems, Explainable AI (XAI) aims to increase transparency and human comprehension of black-box AI models. Explainable AI (XAI) also strives to promote transparency and human comprehension of black-box AI models. The explainability problem has been approached using a variety of XAI strategies; however, given the complexity of the search space, it may be tricky for ML developers and data scientists to construct XAI applications and choose the optimal XAI algorithms. This paper provides different frameworks, surveys, operations, and explainability methodologies that are currently available for producing reasoning for predictions from Natural Language Processing models in order to aid developers. Additionally, a thorough analysis of current work in explainable NLP and AI is undertaken, providing researchers worldwide with exploration, insight, and idea development opportunities. Finally, the authors highlight gaps in the literature and offer ideas for future research in this area.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102470"},"PeriodicalIF":2.7,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144297314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unveiling cancellation dynamics: A two-stage model for predictive analytics","authors":"Soumyadeep Kundu , Soumya Roy , Archit Shukla , Arqum Mateen","doi":"10.1016/j.datak.2025.102467","DOIUrl":"10.1016/j.datak.2025.102467","url":null,"abstract":"<div><div>Booking cancellations have an adverse impact on the performance of firms in the hospitality industry. Most of the studies in this domain have considered the questions of whether a booking would be cancelled or not (if). While useful, given the nature of the industry, it would be important to understand the timing of cancellation as well (when). Answering the inter-temporal nature of the question would help hotels to devise appropriate strategies to accommodate this change. In our study, we have proposed a novel two-stage model, which predicts both the likelihood (if) as well as the timing (when) of cancellation, using various statistical and machine learning techniques. We find that significant predictors include the average daily rate (which is an indicator of average rental revenue earned for an occupied room per day), month of arrival, day of arrival, and the lead time. Our insights can help hotels design bespoke cancellation policies and exercise personalised services and interventions for guests.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102467"},"PeriodicalIF":2.7,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144279321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-feature classification for fake news detection using multiscale and atrous convolution-based adaptive temporal convolution network","authors":"Rashmi Rane , R. Subhashini","doi":"10.1016/j.datak.2025.102469","DOIUrl":"10.1016/j.datak.2025.102469","url":null,"abstract":"<div><div>In the exponential growth of social media platforms, Facebook, Twitter, YouTube and Instagram are the main sources for providing news and information about anything at anywhere. Sometimes, fake information is quickly spread by uploading from particular people affecting the media usage of people. In this research work, a novel deep learning-based framework is proposed to effectively detect fake news for enhancing the trust of social media users. At first, the required text data is gathered from the benchmark resources and given to the preprocessing stage. Then, the preprocessed data is fed into the feature extraction phase here, the Bidirectional Encoder Representations from Transformers (BERT), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN) mechanisms are utilized to effectively extract the meaningful information from the data and improve the accuracy. Also, it can generate three sets of BERT, temporal, and spatial features in the extraction phase and then given to the detection phase. Here, the Multiscale and Atrous Convolution-based Adaptive Temporal Convolution Network (MAC-ATCN) is used for ultimately identifying and categorizing the false information to ensure more reliable outcomes and decision-making processes. Additionally, the Modified Osprey Optimization Algorithm (MOOA) algorithm is employed to fine-tune the parameters to prevent overfitting issues when dealing with larger data. It helps to easily address the imbalanced dataset issues by varying the hyperparameters in the training process. Finally, the overall detection performance is validated with various performance measures and compared with existing works. Also, the developed method achieved better accuracy value for dataset 1 is 93.74 % and dataset 2 is 92.82%. By effectively identifying the fake news in social media can help users to make timely informed decisions. This helps to prevent the spread of misinformation and protects individuals from harmful consequences.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102469"},"PeriodicalIF":2.7,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144271234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic query expansion for enhancing document retrieval system in healthcare application using GAN based embedding and hyper-tuned DAEBERT algorithm","authors":"Deepak Vishwakarma , Suresh Kumar","doi":"10.1016/j.datak.2025.102468","DOIUrl":"10.1016/j.datak.2025.102468","url":null,"abstract":"<div><div>Query expansion is a useful technique for improving document retrieval systems' dependability and performance. Search engines frequently employ query expansion strategies to improve Information Retrieval (IR) performance and elucidate users' information requirements. Although there are several methods for automatically expanding queries, the list of documents that are returned can occasionally be lengthy and contain a lot of useless information, particularly when searching the Web. As the size of medical document grows, Automatic Query Expansion might struggle with efficiency and real-time application. Thus, Hyper-Tuned Dual Attention Enhanced Bi-directional Encoder Representation from Transformers (HT-DAEBERT) with automatic ranking based query expansion system is created for enhancing medical document retrieval system. Initially, the user's query from the medical corpus document was collected, and it was augmented using the Generative Adversarial Network (GAN) approach. Then augmented text is pre-processed to improve the original text's quality through tokenization, acronym expansion, stemming, stop word removal, hyperlink removal, and spell correction. After that, Keywords are extracted using the Proximity-based Keyword Extraction (PKE) technique from the pre-processed text. Afterwards, the words are converted into vector form by utilizing the Hyper-Tuned Dual Attention Enhanced Bi-directional Encoder Representation from Transformers (HT-DAEBERT) model. In DAEBERT, key parameters such as dropout rate and weight decay were optimally selected by using the Election Optimization Algorithm (EOA). After that, a ranking-based query expansion approach was employed to enhance the document retrieval system. The proposed method achieves an accuracy of 97.60 %, a Hit Rate of 98.30 %, a PPV of 93.40 %, an F1-Score of 95.79 %, and an NPV of 97.50 %. This approach improves the accuracy and relevance of document retrieval in healthcare, potentially leading to better patient care and enhanced clinical outcomes.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102468"},"PeriodicalIF":2.7,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144306983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ROSI: A hybrid solution for omni-channel feature integration in E-commerce","authors":"Luyi Ma , Shengwei Tang , Anjana Ganesh, Jiao Chen, Aashika Padmanabhan, Malay Patel, Jianpeng Xu, Jason Cho, Evren Korpeoglu, Sushant Kumar, Kannan Achan","doi":"10.1016/j.datak.2025.102465","DOIUrl":"10.1016/j.datak.2025.102465","url":null,"abstract":"<div><div>Efficient integration of customer behavior data across multiple channels, including online and in-store interactions, is essential for developing recommendation systems that enhance customer experiences and maintain a competitive edge in e-commerce. However, the integration process faces several challenges, including data synchronization and discrepancies in data schemas. In this study, we introduce a hybrid data pipeline, <span>ROSI</span> (Retail Online-Store Integration), designed to integrate real-time streaming data from online platforms with batch data from in-store interactions. <span>ROSI</span> employs scalable, fault-tolerant streaming systems for online data and periodic batch processing for offline data, ensuring effective synchronization despite variations in data volume, update frequency, and schema. Our approach incorporates in-memory storage, sliding time windows, and feature registries to support applications such as machine learning model training and real-time inference in recommendation systems. Experimental results on a real-world retail data demonstrate that <span>ROSI</span> is highly robust, with a reduced growth rate of overall latency when data size increases linearly. Additionally, sequential recommendation systems built on the integrated dataset show a 6.25% improvement in ranking metrics. Overall, the proposed hybrid pipeline facilitates more personalized, omnichannel customer experiences while enhancing operational efficiency.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102465"},"PeriodicalIF":2.7,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144365621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CIAGELP: Clustering Inspired Augmented Graph Embedding based Link Prediction in dynamic networks","authors":"Nisha Singh , Mukesh Kumar , Siddharth Kumar , Bhaskar Biswas","doi":"10.1016/j.datak.2025.102464","DOIUrl":"10.1016/j.datak.2025.102464","url":null,"abstract":"<div><div>For a long time, numerous methods have been explored for the crucial and intricate task of link prediction. Among the most effective approaches are those that involve generating embeddings from various graph components such as nodes, edges, and groups. These representations aim to project the vertex space into a lower-dimensional space, ensuring that vertices and edges with similar contexts are represented closely. While random walk-based embedding (RWE) methods have shown significant improvements, their performance tends to be limited for dynamic networks. To address this, we have introduced CIAGELP (Clustering Inspired Augmented Graph Embedding-based Link Prediction), a distinctive approach that utilizes an augmented graph to generate more promising paths and consequently efficient embeddings. The augmentation of the graph is achieved through a customized pairwise clustering coefficient, which not only captures the local structural context but also strongly influences the strength of connections between pairs of nodes. Additionally, to address the drawbacks of previous RWE approaches on dynamic networks, such as inferior accuracy and high computational cost, our approach employs an enhanced RWE mechanism that considers only the differential graph among subsequent snapshots and generates embeddings efficiently at a low cost. Through comprehensive comparisons with different machine learning methods, various augmentation ratios, and state-of-the-art methods based on random walk embeddings, we demonstrate the superiority of our CIAGELP approach. By leveraging augmented graphs with cluster similarity and considering differential network dynamics for embedding generation in dynamic networks, our approach substantially outperforms previous random walk methods.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102464"},"PeriodicalIF":2.7,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144212426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RankT: Ranking-Triplets-based adversarial learning for knowledge graph link prediction","authors":"Jinlei Zhu, Xin Zhang, Xin Ding","doi":"10.1016/j.datak.2025.102463","DOIUrl":"10.1016/j.datak.2025.102463","url":null,"abstract":"<div><div>Aiming at completing the missing edges between entities in the knowledge graph, many state-of-the-art models are proposed to predict the links. Those models mainly focus on predicting the link score between source and target entities with certain relations, but ignore the similarities or differences of the whole meanings of triplets in different subgraphs. However, the triplets interact with each other in different ways and the link prediction model may lack interaction. In other word, the link prediction is superimposed with potential triplet uncertainties. To address this issue, we propose a Ranking-Triplet-based uncertainty adversarial learning (RankT) framework to improve the embedding representation of triplet for link prediction. Firstly, the proposed model calculates the node and edge embeddings by the node-level and edge-level neighborhood aggregation respectively, and then fuses the embeddings by a self-attention transformer to gain the interactive embedding of the triplet. Secondly, to reduce the uncertainty of the probability distribution of predicted links, a ranking-triplet-based adversarial loss function based on the confrontation of highest certainty and highest uncertainty links is designed. Lastly, to strengthen the stability of the adversarial learning, a ranking-triplet-based consistency loss is designed to make the probability of the highest positive links converge in the same direction. The ablation studies show the effectiveness of each part of the proposed model. The comparison of experimental results shows that our model significantly outperforms the state-of-the-art models. In conclusion, the proposed model improves the link prediction performance while discovering the similar or different meanings of triplets.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102463"},"PeriodicalIF":2.7,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CredBERT: Credibility-aware BERT model for fake news detection","authors":"Anju R., Nargis Pervin","doi":"10.1016/j.datak.2025.102461","DOIUrl":"10.1016/j.datak.2025.102461","url":null,"abstract":"<div><div>The spread of fake news on social media poses significant challenges, especially in distinguishing credible sources from unreliable ones. Existing methods primarily rely on text analysis, often neglecting user credibility, a key factor in enhancing detection accuracy. To address this, we propose CredBERT, a framework that combines credibility scores derived from user interactions and domain expertise with BERT-based text embeddings. CredBERT employs a multi-classifier ensemble, integrating Multi-Layer Perceptron (MLP), Convolutional Neural Networks (CNN), BiLSTM, Logistic Regression, and k-Nearest Neighbors, with predictions aggregated using majority voting, ensuring robust performance across both balanced and imbalanced class datasets. This approach effectively merges user credibility with content-based features, improving prediction accuracy and reducing biases. Compared to state-of-the art baselines FakeBERT and BiLSTM, CredBERT achieves 6.45% and 4.21% higher accuracy, respectively. By evaluating user credibility and content features, our model not only enhances fake news detection but also contributes to mitigating misinformation by identifying unreliable sources.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102461"},"PeriodicalIF":2.7,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Philosophical reflections on conceptual modeling as communication","authors":"Mattia Fumagalli , Giancarlo Guizzardi","doi":"10.1016/j.datak.2025.102453","DOIUrl":"10.1016/j.datak.2025.102453","url":null,"abstract":"<div><div>Conceptual modeling is a complex and demanding task. It is a task centered around the challenge of representing a portion of the world in a way that is objective, understandable, shareable, and reusable by a community of practitioners, who rely on models to design and implement software or to clarify the concepts within a given domain. The difficulty of conceptual modeling stems from the inherent limitations of human representation abilities, which cannot fully capture the infinite richness and diversity of the world, nor the endless possibilities for description enabled by language. Significant effort has been invested in addressing these challenges, particularly in the creation of effective and reusable conceptual models, which have presented numerous difficulties. This paper explores conceptual modeling from a philosophical standpoint, proposing that conceptual models should not be viewed merely as the static representational output of an a priori activity, subject to modification only during a preliminary design phase. Instead, they should be seen as dynamic artifacts that require continuous design, adaptation, and evolution from their inception to their application, which may account for multiple purposes. The paper seeks to highlight the importance of understanding conceptual modeling primarily as an act of communication, rather than just a process of information transmission. It also aims to clarify the distinction between these two aspects and to examine the potential implications of adopting a <em>communicative approach to modeling</em>. These implications extend not only to the tools and methodologies used in modeling but also to the ethical considerations that arise from such an approach.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102453"},"PeriodicalIF":2.7,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}