Veda C. Storey , Jeffrey Parsons , Arturo Castellanos Bueso , Monica Chiarini Tremblay , Roman Lukyanenko , Alfred Castillo , Wolfgang Maaß
{"title":"Domain knowledge in artificial intelligence: Using conceptual modeling to increase machine learning accuracy and explainability","authors":"Veda C. Storey , Jeffrey Parsons , Arturo Castellanos Bueso , Monica Chiarini Tremblay , Roman Lukyanenko , Alfred Castillo , Wolfgang Maaß","doi":"10.1016/j.datak.2025.102482","DOIUrl":"10.1016/j.datak.2025.102482","url":null,"abstract":"<div><div>Machine learning enables the extraction of useful information from large, diverse datasets. However, despite many successful applications, machine learning continues to suffer from performance and transparency issues. These challenges can be partially attributed to the limited use of domain knowledge by machine learning models. This research proposes using the domain knowledge represented in conceptual models to improve the preparation of the data used to train machine learning models. We develop and demonstrate a method, called the <em>Conceptual Modeling for Machine Learning (CMML)</em>, which is comprised of guidelines for data preparation in machine learning and based on conceptual modeling constructs and principles. To assess the impact of CMML on machine learning outcomes, we first applied it to two real-world problems to evaluate its impact on model performance. We then solicited an assessment by data scientists on the applicability of the method. These results demonstrate the value of CMML for improving machine learning outcomes.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102482"},"PeriodicalIF":2.7,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144534882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Veda C. Storey , Oscar Pastor , Giancarlo Guizzardi , Stephen W. Liddle , Wolfgang Maaß , Jeffrey Parsons , Jolita Ralyté , Maribel Yasmina Santos
{"title":"Large language models for conceptual modeling: Assessment and application potential","authors":"Veda C. Storey , Oscar Pastor , Giancarlo Guizzardi , Stephen W. Liddle , Wolfgang Maaß , Jeffrey Parsons , Jolita Ralyté , Maribel Yasmina Santos","doi":"10.1016/j.datak.2025.102480","DOIUrl":"10.1016/j.datak.2025.102480","url":null,"abstract":"<div><div>Large Language Models (LLMs) are being rapidly adopted for many activities in organizations, business, and education. Included in their applications are capabilities to generate text, code, and models. This leads to questions about their potential role in the conceptual modeling part of information systems development. This paper reports on a panel presented at the <em>43rd International Conference on Conceptual Modeling</em> where researchers discussed the current and potential role of LLMs in conceptual modeling. The panelists discussed applications and interest levels and expressed both optimism and caution in the adoption of LLMs. Suggested is a need for much continued research by the conceptual modeling community on LLM development and their role in research and teaching.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102480"},"PeriodicalIF":2.7,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144517377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md. Mehedi Hassan , Anindya Nag , Riya Biswas , Md Shahin Ali , Sadika Zaman , Anupam Kumar Bairagi , Chetna Kaushal
{"title":"Explainable artificial intelligence for natural language processing: A survey","authors":"Md. Mehedi Hassan , Anindya Nag , Riya Biswas , Md Shahin Ali , Sadika Zaman , Anupam Kumar Bairagi , Chetna Kaushal","doi":"10.1016/j.datak.2025.102470","DOIUrl":"10.1016/j.datak.2025.102470","url":null,"abstract":"<div><div>Recently, artificial intelligence has gained a lot of momentum and is predicted to surpass expectations across a range of industries. However, explainability is a major challenge due to sub-symbolic techniques like Deep Neural Networks and Ensembles, which were absent during the boom of AI. The practical application of AI in numerous application areas is greatly undermined by this lack of explainability. In order to counter the lack of perception of AI-based systems, Explainable AI (XAI) aims to increase transparency and human comprehension of black-box AI models. Explainable AI (XAI) also strives to promote transparency and human comprehension of black-box AI models. The explainability problem has been approached using a variety of XAI strategies; however, given the complexity of the search space, it may be tricky for ML developers and data scientists to construct XAI applications and choose the optimal XAI algorithms. This paper provides different frameworks, surveys, operations, and explainability methodologies that are currently available for producing reasoning for predictions from Natural Language Processing models in order to aid developers. Additionally, a thorough analysis of current work in explainable NLP and AI is undertaken, providing researchers worldwide with exploration, insight, and idea development opportunities. Finally, the authors highlight gaps in the literature and offer ideas for future research in this area.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102470"},"PeriodicalIF":2.7,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144297314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unveiling cancellation dynamics: A two-stage model for predictive analytics","authors":"Soumyadeep Kundu , Soumya Roy , Archit Shukla , Arqum Mateen","doi":"10.1016/j.datak.2025.102467","DOIUrl":"10.1016/j.datak.2025.102467","url":null,"abstract":"<div><div>Booking cancellations have an adverse impact on the performance of firms in the hospitality industry. Most of the studies in this domain have considered the questions of whether a booking would be cancelled or not (if). While useful, given the nature of the industry, it would be important to understand the timing of cancellation as well (when). Answering the inter-temporal nature of the question would help hotels to devise appropriate strategies to accommodate this change. In our study, we have proposed a novel two-stage model, which predicts both the likelihood (if) as well as the timing (when) of cancellation, using various statistical and machine learning techniques. We find that significant predictors include the average daily rate (which is an indicator of average rental revenue earned for an occupied room per day), month of arrival, day of arrival, and the lead time. Our insights can help hotels design bespoke cancellation policies and exercise personalised services and interventions for guests.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102467"},"PeriodicalIF":2.7,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144279321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-feature classification for fake news detection using multiscale and atrous convolution-based adaptive temporal convolution network","authors":"Rashmi Rane , R. Subhashini","doi":"10.1016/j.datak.2025.102469","DOIUrl":"10.1016/j.datak.2025.102469","url":null,"abstract":"<div><div>In the exponential growth of social media platforms, Facebook, Twitter, YouTube and Instagram are the main sources for providing news and information about anything at anywhere. Sometimes, fake information is quickly spread by uploading from particular people affecting the media usage of people. In this research work, a novel deep learning-based framework is proposed to effectively detect fake news for enhancing the trust of social media users. At first, the required text data is gathered from the benchmark resources and given to the preprocessing stage. Then, the preprocessed data is fed into the feature extraction phase here, the Bidirectional Encoder Representations from Transformers (BERT), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN) mechanisms are utilized to effectively extract the meaningful information from the data and improve the accuracy. Also, it can generate three sets of BERT, temporal, and spatial features in the extraction phase and then given to the detection phase. Here, the Multiscale and Atrous Convolution-based Adaptive Temporal Convolution Network (MAC-ATCN) is used for ultimately identifying and categorizing the false information to ensure more reliable outcomes and decision-making processes. Additionally, the Modified Osprey Optimization Algorithm (MOOA) algorithm is employed to fine-tune the parameters to prevent overfitting issues when dealing with larger data. It helps to easily address the imbalanced dataset issues by varying the hyperparameters in the training process. Finally, the overall detection performance is validated with various performance measures and compared with existing works. Also, the developed method achieved better accuracy value for dataset 1 is 93.74 % and dataset 2 is 92.82%. By effectively identifying the fake news in social media can help users to make timely informed decisions. This helps to prevent the spread of misinformation and protects individuals from harmful consequences.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102469"},"PeriodicalIF":2.7,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144271234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic query expansion for enhancing document retrieval system in healthcare application using GAN based embedding and hyper-tuned DAEBERT algorithm","authors":"Deepak Vishwakarma , Suresh Kumar","doi":"10.1016/j.datak.2025.102468","DOIUrl":"10.1016/j.datak.2025.102468","url":null,"abstract":"<div><div>Query expansion is a useful technique for improving document retrieval systems' dependability and performance. Search engines frequently employ query expansion strategies to improve Information Retrieval (IR) performance and elucidate users' information requirements. Although there are several methods for automatically expanding queries, the list of documents that are returned can occasionally be lengthy and contain a lot of useless information, particularly when searching the Web. As the size of medical document grows, Automatic Query Expansion might struggle with efficiency and real-time application. Thus, Hyper-Tuned Dual Attention Enhanced Bi-directional Encoder Representation from Transformers (HT-DAEBERT) with automatic ranking based query expansion system is created for enhancing medical document retrieval system. Initially, the user's query from the medical corpus document was collected, and it was augmented using the Generative Adversarial Network (GAN) approach. Then augmented text is pre-processed to improve the original text's quality through tokenization, acronym expansion, stemming, stop word removal, hyperlink removal, and spell correction. After that, Keywords are extracted using the Proximity-based Keyword Extraction (PKE) technique from the pre-processed text. Afterwards, the words are converted into vector form by utilizing the Hyper-Tuned Dual Attention Enhanced Bi-directional Encoder Representation from Transformers (HT-DAEBERT) model. In DAEBERT, key parameters such as dropout rate and weight decay were optimally selected by using the Election Optimization Algorithm (EOA). After that, a ranking-based query expansion approach was employed to enhance the document retrieval system. The proposed method achieves an accuracy of 97.60 %, a Hit Rate of 98.30 %, a PPV of 93.40 %, an F1-Score of 95.79 %, and an NPV of 97.50 %. This approach improves the accuracy and relevance of document retrieval in healthcare, potentially leading to better patient care and enhanced clinical outcomes.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102468"},"PeriodicalIF":2.7,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144306983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ROSI: A hybrid solution for omni-channel feature integration in E-commerce","authors":"Luyi Ma , Shengwei Tang , Anjana Ganesh, Jiao Chen, Aashika Padmanabhan, Malay Patel, Jianpeng Xu, Jason Cho, Evren Korpeoglu, Sushant Kumar, Kannan Achan","doi":"10.1016/j.datak.2025.102465","DOIUrl":"10.1016/j.datak.2025.102465","url":null,"abstract":"<div><div>Efficient integration of customer behavior data across multiple channels, including online and in-store interactions, is essential for developing recommendation systems that enhance customer experiences and maintain a competitive edge in e-commerce. However, the integration process faces several challenges, including data synchronization and discrepancies in data schemas. In this study, we introduce a hybrid data pipeline, <span>ROSI</span> (Retail Online-Store Integration), designed to integrate real-time streaming data from online platforms with batch data from in-store interactions. <span>ROSI</span> employs scalable, fault-tolerant streaming systems for online data and periodic batch processing for offline data, ensuring effective synchronization despite variations in data volume, update frequency, and schema. Our approach incorporates in-memory storage, sliding time windows, and feature registries to support applications such as machine learning model training and real-time inference in recommendation systems. Experimental results on a real-world retail data demonstrate that <span>ROSI</span> is highly robust, with a reduced growth rate of overall latency when data size increases linearly. Additionally, sequential recommendation systems built on the integrated dataset show a 6.25% improvement in ranking metrics. Overall, the proposed hybrid pipeline facilitates more personalized, omnichannel customer experiences while enhancing operational efficiency.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102465"},"PeriodicalIF":2.7,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144365621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CIAGELP: Clustering Inspired Augmented Graph Embedding based Link Prediction in dynamic networks","authors":"Nisha Singh , Mukesh Kumar , Siddharth Kumar , Bhaskar Biswas","doi":"10.1016/j.datak.2025.102464","DOIUrl":"10.1016/j.datak.2025.102464","url":null,"abstract":"<div><div>For a long time, numerous methods have been explored for the crucial and intricate task of link prediction. Among the most effective approaches are those that involve generating embeddings from various graph components such as nodes, edges, and groups. These representations aim to project the vertex space into a lower-dimensional space, ensuring that vertices and edges with similar contexts are represented closely. While random walk-based embedding (RWE) methods have shown significant improvements, their performance tends to be limited for dynamic networks. To address this, we have introduced CIAGELP (Clustering Inspired Augmented Graph Embedding-based Link Prediction), a distinctive approach that utilizes an augmented graph to generate more promising paths and consequently efficient embeddings. The augmentation of the graph is achieved through a customized pairwise clustering coefficient, which not only captures the local structural context but also strongly influences the strength of connections between pairs of nodes. Additionally, to address the drawbacks of previous RWE approaches on dynamic networks, such as inferior accuracy and high computational cost, our approach employs an enhanced RWE mechanism that considers only the differential graph among subsequent snapshots and generates embeddings efficiently at a low cost. Through comprehensive comparisons with different machine learning methods, various augmentation ratios, and state-of-the-art methods based on random walk embeddings, we demonstrate the superiority of our CIAGELP approach. By leveraging augmented graphs with cluster similarity and considering differential network dynamics for embedding generation in dynamic networks, our approach substantially outperforms previous random walk methods.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102464"},"PeriodicalIF":2.7,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144212426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RankT: Ranking-Triplets-based adversarial learning for knowledge graph link prediction","authors":"Jinlei Zhu, Xin Zhang, Xin Ding","doi":"10.1016/j.datak.2025.102463","DOIUrl":"10.1016/j.datak.2025.102463","url":null,"abstract":"<div><div>Aiming at completing the missing edges between entities in the knowledge graph, many state-of-the-art models are proposed to predict the links. Those models mainly focus on predicting the link score between source and target entities with certain relations, but ignore the similarities or differences of the whole meanings of triplets in different subgraphs. However, the triplets interact with each other in different ways and the link prediction model may lack interaction. In other word, the link prediction is superimposed with potential triplet uncertainties. To address this issue, we propose a Ranking-Triplet-based uncertainty adversarial learning (RankT) framework to improve the embedding representation of triplet for link prediction. Firstly, the proposed model calculates the node and edge embeddings by the node-level and edge-level neighborhood aggregation respectively, and then fuses the embeddings by a self-attention transformer to gain the interactive embedding of the triplet. Secondly, to reduce the uncertainty of the probability distribution of predicted links, a ranking-triplet-based adversarial loss function based on the confrontation of highest certainty and highest uncertainty links is designed. Lastly, to strengthen the stability of the adversarial learning, a ranking-triplet-based consistency loss is designed to make the probability of the highest positive links converge in the same direction. The ablation studies show the effectiveness of each part of the proposed model. The comparison of experimental results shows that our model significantly outperforms the state-of-the-art models. In conclusion, the proposed model improves the link prediction performance while discovering the similar or different meanings of triplets.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102463"},"PeriodicalIF":2.7,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient opinion mining for imbalanced customer reviews in last-mile services","authors":"Sangbaek Kim , Hongchul Lee , Jiho Kim","doi":"10.1016/j.datak.2025.102466","DOIUrl":"10.1016/j.datak.2025.102466","url":null,"abstract":"<div><div>Last-mile (LM) service manages the final stage of delivering products to customers in supply chains and logistics. Consumer opinion mining has recently become essential for providing high-level LM service quality. However, existing methods face challenges with domain-specific terminology and class imbalance. Therefore, we propose LM-BERT, a BERT-based text classification model specialized in LM service sentiment analysis. In addition, we introduce a teacher–student LM-BERT framework that alleviates data imbalance in online e-commerce reviews through high-confidence pseudo-labeling. After evaluating six Transformer models, KLUE-BERT was identified as the most suitable for our baseline. Experimental results demonstrate that domain-specific knowledge transfer improves performance by 1.78 % on seen data and 1.31 % on unseen data. Statistical verification and explainable artificial intelligence techniques were employed to confirm the reliability of our approach to enhance qualitative performance and expand domain knowledge. We also conducted an ablation study confirming that high-confidence pseudo-labeling (<em>t</em> = 0.99) outperforms the traditional resampling method. The proposed LM-BERT model can effectively support LM service quality evaluation and management based on the voice of the customer in e-commerce.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102466"},"PeriodicalIF":2.7,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144780863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}