{"title":"The “Curious Case of Contexts” in Retrieval-Augmented Generation With a Combination of Labeled and Unlabeled Data","authors":"Payel Santra, Madhusudan Ghosh, Debasis Ganguly, Partha Basuchowdhuri, Sudip Kumar Naskar","doi":"10.1002/widm.70021","DOIUrl":"https://doi.org/10.1002/widm.70021","url":null,"abstract":"With the growing reliance on LLMs for a wide range of NLP tasks, optimizing the use of labeled and unlabeled data for effective context generation has become critical. This work explores the interplay between two prominent methodologies in few-shot learning: in-context learning (ICL), which utilizes labeled task-specific data, and retrieval-augmented generation (RAG), which leverages unlabeled external knowledge to augment generative models. Since each has its individual limitations, we propose a novel hybrid approach to obtain “the best of both worlds” by dynamically integrating both labeled and unlabeled data towards improving the downstream performance of LLMs. Our methodology, which we call LU-RAG (labeled and unlabeled RAG), recomputes the scores of top-<i>k</i> labeled instances and top-<i>m</i> unlabeled passages to refine context selection. Our experimental results demonstrate that LU-RAG consistently outperforms both standalone ICL and RAG across multiple benchmarks, showing significant gains in downstream performance. Furthermore, we show that LU-RAG performs better with a semantic neighborhood as compared to a lexical one, highlighting its ability to generalize effectively.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"134 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144165784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey on Causal Inference-Driven Data Bias Optimization in Recommendation Systems: Principles, Opportunities and Challenges","authors":"Yongkang Li, Xingyu Zhu, Yuheng Wu, Wenxu Zhao, Xiaona Xia","doi":"10.1002/widm.70020","DOIUrl":"https://doi.org/10.1002/widm.70020","url":null,"abstract":"Recommendation systems predict user interests and recommend items for online platforms including e-commerce, social networks, and decision systems. However, data bias has become a significant obstacle, severely impacting the accuracy, fairness, and reliability of recommendation results. This survey examines causal inference for optimizing recommendation systems and mitigating data bias, addressing three questions: (1) Bias types and performance impacts; (2) Causal inference mitigation methods; (3) Approach advantages, limitations, and research opportunities. The motivation for this survey stems from the limitations of traditional debiasing methods, which often fail to account for causal relationships and struggle in dynamic, real-world scenarios. Causal inference provides a robust framework for identifying and addressing the underlying causes of bias, enabling more transparent and accurate recommendation systems. Therefore, we define three critical stages of bias: bias in the data stage, model selection stage, and model evaluation stage. For each stage, causal inference-based optimization methods are introduced and critically analyzed. Unlike traditional debiasing methods, this study analyzes data augmentation and regularization techniques as potential strategies for future research. The whole research might highlight the ability of causal inference to uncover and control confounding factors, offering deeper insights into the mechanisms driving biases.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Artificial Intelligence-Based Waste Management: A Review of Classification, Techniques, Issues, and Challenges","authors":"Dhanashree Vipul Yevle, Palvinder Singh Mann","doi":"10.1002/widm.70025","DOIUrl":"https://doi.org/10.1002/widm.70025","url":null,"abstract":"Artificial intelligence (AI) is emerging as a transforming force in waste management practices, enabling new ways of bringing efficiency and effectiveness. This survey presents methods related to waste management, which are categorized systematically for understanding the effectiveness of various AI-based techniques. The study undertakes a critical review of relevant research works that epitomize major advances and methodologies of AI-driven waste management. The manuscript provides an exhaustive taxonomy, dividing AI methods into Supervised Learning, Unsupervised Learning, and Reinforcement Learning, and then subdividing Supervised Learning into four broad categories: Machine Learning-based Classification, CNNs, Transfer Learning, and Hybrid or Ensemble Learning. We further evaluate different datasets applied in performance benchmarking and the efficacy of the various AI models. We also discuss some critical issues, such as the problem of available data quality, poor generalization of models, and integration of systems. Future research directions, which would go a long way toward helping to surmount these challenges, are also discussed. This survey aims to present a structured framework for understanding current AI applications in waste management, therefore guiding ongoing and future research in the field.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144087993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simi Job, Xiaohui Tao, Taotao Cai, Haoran Xie, Lin Li, Qing Li, Jianming Yong
{"title":"Exploring Causal Learning Through Graph Neural Networks: An In-Depth Review","authors":"Simi Job, Xiaohui Tao, Taotao Cai, Haoran Xie, Lin Li, Qing Li, Jianming Yong","doi":"10.1002/widm.70024","DOIUrl":"https://doi.org/10.1002/widm.70024","url":null,"abstract":"In machine learning, exploring data correlations to predict outcomes is a fundamental task. Recognizing causal relationships embedded within data is pivotal for a comprehensive understanding of system dynamics, the significance of which is paramount in data-driven decision-making processes. Beyond traditional methods, there has been a shift toward using graph neural networks (GNNs) for causal learning, given their capabilities as universal data approximators. Thus, a thorough review of the advancements in causal learning using GNNs is both relevant and timely. To structure this review, we introduce a novel taxonomy that encompasses various state-of-the-art GNN methods used in studying causality. GNNs are further categorized based on their applications in the causality domain. We further provide an exhaustive compilation of datasets integral to causal learning with GNNs to serve as a resource for practical study. This review also touches upon the application of causal learning across diverse sectors. We conclude the review with insights into potential challenges and promising avenues for future exploration in this rapidly evolving field of machine learning.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"97 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144087994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Loan T. T. Nguyen, Trang T. D. Nguyen, Quang-Thinh Bui, Bay Vo
{"title":"Geospatial Data Clustering in Network Space: A Survey","authors":"Loan T. T. Nguyen, Trang T. D. Nguyen, Quang-Thinh Bui, Bay Vo","doi":"10.1002/widm.70023","DOIUrl":"https://doi.org/10.1002/widm.70023","url":null,"abstract":"Geospatial data enhances traditional datasets by integrating spatial and temporal dimensions, facilitating advanced visualizations and comprehensive analytical insights. As a fundamental aspect of geospatial analytics, geospatial data clustering (GDC) has become a prominent area of academic research, playing a critical role in theoretical exploration and applied domains. GDC seeks to group geospatial objects based on inherent similarities, a necessity driven by modern datasets' increasing scale and complexity, particularly those within geographic information systems (GIS). This paper highlights key challenges and advancements in GDC, including spatial data clustering (SDC), clustering techniques within GIS, and algorithms designed for geospatial data clustering in network spaces (GDC in NS). Practical implementations of these methodologies encompass diverse applications such as hotspot analysis, infectious disease monitoring, transportation optimization, urban traffic management, and emergency response planning. These contributions are foundational for advancing scholarly research and addressing domain-specific challenges in this field.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144066652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdul Aziz Noor, Awais Manzoor, Muhammad Deedahwar Mazhar Qureshi, M. Atif Qureshi, Wael Rashwan
{"title":"Unveiling Explainable AI in Healthcare: Current Trends, Challenges, and Future Directions","authors":"Abdul Aziz Noor, Awais Manzoor, Muhammad Deedahwar Mazhar Qureshi, M. Atif Qureshi, Wael Rashwan","doi":"10.1002/widm.70018","DOIUrl":"https://doi.org/10.1002/widm.70018","url":null,"abstract":"This overview investigates the evolution and current landscape of eXplainable Artificial Intelligence (XAI) in healthcare, highlighting its implications for researchers, technology developers, and policymakers. Following the PRISMA protocol, we analyzed 89 publications from January 2000 to June 2024, spanning 19 medical domains, with a focus on Neurology and Cancer as the most studied areas. Various data types are reviewed, including tabular data, medical imaging, and clinical text, offering a comprehensive perspective on XAI applications. Key findings identify significant gaps, such as the limited availability of public datasets, suboptimal data preprocessing techniques, insufficient feature selection and engineering, and the limited utilization of multiple XAI methods. Additionally, the lack of standardized XAI evaluation metrics and practical obstacles in integrating XAI systems into clinical workflows are emphasized. We provide actionable recommendations, including the design of explainability‐centric models, the application of diverse and multiple XAI methods, and the fostering of interdisciplinary collaboration. These strategies aim to guide researchers in building robust AI models, assist technology developers in creating intuitive and user‐friendly AI tools, and inform policymakers in establishing effective regulations. Addressing these gaps will promote the development of transparent, reliable, and user‐centred AI systems in healthcare, ultimately improving decision‐making and patient outcomes.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143933226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weak Supervision: A Survey on Predictive Maintenance","authors":"Antonio M. Martínez‐Heredia, Sebastián Ventura","doi":"10.1002/widm.70022","DOIUrl":"https://doi.org/10.1002/widm.70022","url":null,"abstract":"The maintenance advancements achieved in Industry 4.0 generate large amounts of data, necessitating complete, accurate, and precise labels for training datasets to align with corresponding ground truth. These labels serve as annotations for early anomaly detection. Delivering high‐quality annotations derived from weak labels and striking a balance between annotation efforts and accuracy are critical tasks. Consequently, researchers have focused their attention on Weakly Supervised Learning methods, which have shown effectiveness in handling datasets characterized by incomplete, imprecise, and erroneous labels across various maintenance applications. In this survey, the authors aim to address a gap in the existing literature by conducting a comprehensive examination of Weakly Supervised Learning for Predictive Maintenance, categorizing related works. Furthermore, the survey discusses challenges and identifies open research lines.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143933235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Renzo Alva Principe, Nicola Chiarini, Marco Viviani
{"title":"Long Document Classification in the Transformer Era: A Survey on Challenges, Advances, and Open Issues","authors":"Renzo Alva Principe, Nicola Chiarini, Marco Viviani","doi":"10.1002/widm.70019","DOIUrl":"https://doi.org/10.1002/widm.70019","url":null,"abstract":"Automatic Document Classification (ADC) refers to the process of automatically categorizing or labeling documents into predefined classes or categories. Its effectiveness may depend on various factors, including the models used for the formal representation of documents, the classification techniques applied, or a combination of both. Recently, Transformer models have gained popularity due to their pre‐training on large corpora, allowing for flexible knowledge transfer to downstream tasks, such as ADC. However, such models can face challenges when handling “long” documents, particularly due to input sequence length constraints, which can have knock‐on effects on the task we refer to as Automatic Long Document Classification (ALDC). Distinct models for tackling this limitation of Transformers have been proposed over the past few years, and employed to perform ALDC; however, their application to this task has resulted in some inconsistent outcomes, struggles to surpass simple baselines, and difficulties in generalizing across diverse datasets and scenarios. That is why this survey aims to illustrate these limitations, by: (i) presenting current long document representation issues and solutions proposed in the literature; (ii) based on such solutions, illustrating a comprehensive analysis of their application in ALDC and their effectiveness; and (iii) discussing current evaluation strategies in ALDC with particular reference to suitable baselines and actual long‐document benchmark datasets.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143930751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Review on Information Fusion‐Based Data Mining for Improving Complex Anomaly Detection","authors":"Sorin‐Claudiu Moldovan, Laszlo Barna Iantovics","doi":"10.1002/widm.70017","DOIUrl":"https://doi.org/10.1002/widm.70017","url":null,"abstract":"Anomaly predicated upon multiple distributed hybrid sensors frequently uses hybrid approaches, integrating techniques derived from statistical analysis, probability, data mining, machine learning, deep learning, and signal denoising. Many of these methods are based on the analysis of irregularities, data continuity, correlation, and data consistency, aiming to discern anomalous patterns from normal behavior. By leveraging these techniques information fusion aims to enhance situational awareness, detect potential threats or abnormalities, and improve decision‐making processes in complex environments. It addresses uncertainties by integrating data from diverse sources, thereby enhancing performance, and reducing dependency on individual sensors. This study examines applications based on single and multiple sensor data, revealing common strategies, identifying strengths and weaknesses, and potential solutions for detecting and diagnosing anomalies by analyzing low, large, and complex data derived from the context of homogeneous or heterogeneous systems. Information fusion techniques are evaluated for their performance on various levels of algorithm complexity. This in‐depth bibliographic study involved searching top indexing databases such as Web of Science and Scopus. The inclusion criteria were articles published between 2012 and 2024. The search capitalized on specific keywords as follows: “sensor malfunction,” “sensor anomaly,” “sensor failure,” “sensor fusion,” and “anomaly data mining.” Publications that did not strictly focus on analytical processing for anomaly detection, diagnosis, and prognosis in sensor data were excluded. In conclusion, the practice of information fusion promotes transparency by elucidating the process of combining information, thereby enabling the inclusion of multitude of perspectives, and aligning with established best practices in the field. Data deviation remains the primary criterion for detecting anomalies using mostly deep learning and extensively hybrid techniques. Nevertheless, state‐of‐the‐art algorithms based on neural networks still require further contextual interpretation and analysis. Functional safety and safety of intended functionality breaching can lead to decision‐making errors, physical harm, and erosion of trust in autonomous systems. This is due to the lack of interpretability in AI approaches, making it challenging to predict and understand the system's behavior under various conditions.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143930745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Role of Causality in Explainable Artificial Intelligence","authors":"Gianluca Carloni, Andrea Berti, Sara Colantonio","doi":"10.1002/widm.70015","DOIUrl":"https://doi.org/10.1002/widm.70015","url":null,"abstract":"Causality and eXplainable Artificial Intelligence (XAI) have developed as separate fields in computer science, even though the underlying concepts of causation and explanation share common ancient roots. This is further enforced by the lack of review works jointly covering these two fields. In this paper, we investigate the literature to try to understand how and to what extent causality and XAI are intertwined. More precisely, we seek to uncover what kinds of relationships exist between the two concepts and how one can benefit from them, for instance, in building trust in AI systems. As a result, three main perspectives are identified. In the first one, the lack of causality is seen as one of the major limitations of current AI and XAI approaches, and the “optimal” form of explanations is investigated. The second is a pragmatic perspective and considers XAI as a tool to foster scientific exploration for causal inquiry, via the identification of pursue‐worthy experimental manipulations. Finally, the third perspective supports the idea that causality is propaedeutic to XAI in three possible manners: exploiting concepts borrowed from causality to support or improve XAI, utilizing counterfactuals for explainability, and considering accessing a causal model as explaining itself. To complement our analysis, we also provide relevant software solutions used to automate causal tasks. We believe our work provides a unified view of the two fields of causality and XAI by highlighting potential domain bridges and uncovering possible limitations.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143920261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}