Lakpa Tamang;Mohamed Reda Bouadjenek;Richard Dazeley;Sunil Aryal
{"title":"Handling Out-of-Distribution Data: A Survey","authors":"Lakpa Tamang;Mohamed Reda Bouadjenek;Richard Dazeley;Sunil Aryal","doi":"10.1109/TKDE.2025.3592614","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3592614","url":null,"abstract":"In the field of Machine Learning (ML) and data-driven applications, one of the significant challenge is the change in data distribution between the training and deployment stages, commonly known as distribution shift. This paper outlines different mechanisms for handling two main types of distribution shifts: (i) <bold>Covariate shift:</b> where the value of features or covariates change between train and test data, and (ii) <bold>Concept/Semantic-shift:</b> where model experiences shift in the concept learned during training due to emergence of novel classes in the test phase. We sum up our contributions in three folds. First, we formalize distribution shifts, recite on how the conventional method fails to handle them adequately and urge for a model that can simultaneously perform better in all types of distribution shifts. Second, we discuss why handling distribution shifts is important and provide an extensive review of the methods and techniques that have been developed to detect, measure, and mitigate the effects of these shifts. Third, we discuss the current state of distribution shift handling mechanisms and propose future research directions in this area. Overall, we provide a retrospective synopsis of the literature in the distribution shift, focusing on OOD data that had been overlooked in the existing surveys.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5948-5966"},"PeriodicalIF":10.4,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GeoRecover: Recovery From Poisoning Attacks for LDP-Enabled Spatial Density Aggregation","authors":"Xinyue Sun;Qingqing Ye;Haibo Hu;Jiawei Duan;Hui He;Weizhe Zhang","doi":"10.1109/TKDE.2025.3593289","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3593289","url":null,"abstract":"The spatial density distribution collected and aggregated from users’ trajectory data is vital for location-based services like regional popularity analysis and congestion measurement. However, spatial density aggregation poses privacy concerns since trajectory data usually originate from users. Local differential privacy (LDP) addresses these concerns by allowing users to perturb their data before reporting it. Yet, LDP is vulnerable to poisoning attacks where attackers manipulate data from malicious users. Recent studies attempt to defend against such attacks in LDP-enabled frequency estimation but suffer from inaccurate data recovery due to empirical presets of malicious user proportions and inaccurate malicious data estimation. These issues worsen in spatial density aggregation, as high-dimensional trajectory data help conceal malicious information. In this work, we propose GeoRecover, a method to defend against poisoning attacks in LDP-enabled spatial density aggregation by addressing previous limitations. GeoRecover designs an adaptive model to unify these attacks. Under this model, GeoRecover estimates the proportion of malicious users using statistical differences between genuine and malicious data and learns malicious data statistics through LDP properties. This allows GeoRecover to recover accurate spatial density distribution by subtracting malicious users’ contributions. Evaluations on two real-world datasets show GeoRecover outperforms state-of-the-art methods in recovery accuracy, defense capability, and practical performance.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5919-5933"},"PeriodicalIF":10.4,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145051031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instance-Dependent Incomplete Multi-Label Feature Selection by Fuzzy Tolerance Relation and Fuzzy Mutual Implication Granularity","authors":"Jianhua Dai;Wenxiang Chen;Yuhua Qian;Witold Pedrycz","doi":"10.1109/TKDE.2025.3591461","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3591461","url":null,"abstract":"Multi-label feature selection is an effective approach to mitigate the high-dimensional feature problem in multi-label learning. Most existing multi-label feature selection methods either assume that the data is complete, or that either the features or the labels are incomplete. So far, there are few studies on multi-label data with missing features and labels. In many cases, missing features in instances of multi-label data often lead to missing labels, which is ignored by existing studies. We define this type of data as instance-dependent incomplete multi-label data. In this paper, we propose a feature selection method for instance-dependent incomplete multi-label data. Firstly, we use the positive correlations between features to reconstruct the feature space, thereby recovering missing values and enhancing non-missing values. Secondly, we use fuzzy tolerance relation to guide label recovery, and utilize fuzzy mutual implication granularity to impose structural constraint on the projection matrix. Thirdly, we achieve feature selection by eliminating the impact of incomplete instances and imposing sparse regularization on the projection matrix. Finally, we provide a convergent solution for the proposed feature selection framework. Comparative experiments with existing multi-label feature selection methods show that our method can perform effective feature selection on instance-dependent incomplete multi-label data.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5994-6008"},"PeriodicalIF":10.4,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145051029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianchuan Yang;Haiqiang Chen;Haoyan Yang;Man-Sheng Chen;Xiangcheng Li;Youming Sun;Chang-Dong Wang
{"title":"Smoothness-Induced Efficient Incomplete Multi-View Clustering","authors":"Tianchuan Yang;Haiqiang Chen;Haoyan Yang;Man-Sheng Chen;Xiangcheng Li;Youming Sun;Chang-Dong Wang","doi":"10.1109/TKDE.2025.3591500","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3591500","url":null,"abstract":"Efficient incomplete multi-view clustering has received increasing attention due to its ability to handle large-scale and missing data. Although existing methods have promising performance, 1) they typically generate anchors directly from incomplete and noisy raw data, resulting in uncomprehensive anchor coverage and unreliable results; 2) they typically use only sparse regularization to remove noise and overlook outliers; 3) they ignore the inherent consistency of features in a view. To address these issues, we propose a smoothness-induced efficient incomplete multi-view clustering (SEIC) method. SEIC regards available data as natural anchors selected from complete data, and performs matrix decomposition only on them to obtain reliable small-size representation matrices. View-specific representation matrices are constructed as a tensor to capture consensus and guide matrix decomposition. More significantly, we enforce both smoothness and low-rank coupling on the tensor. Smoothness induces continuous variation of the tensor to further eliminate noise and enhance the relation among features. Benefiting from the noise robustness of SEIC, we design an adaptive noise balance parameter that renders SEIC parameter-free. Furthermore, by constructing a sparse anchor graph on the learned tensor, we propose the spectral clustering version SEIC-SC. Experiments on multiple datasets demonstrate the superior performance and efficiency of SEIC and SEIC-SC.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"6173-6188"},"PeriodicalIF":10.4,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanfeng Zhang;Xinyi Liu;Zihao Qi;Xingchen Yan;Wang Yang
{"title":"GI-Graph: A Generative Invariant Graph Learning Scheme Towards Out-of-Distribution Generalization","authors":"Sanfeng Zhang;Xinyi Liu;Zihao Qi;Xingchen Yan;Wang Yang","doi":"10.1109/TKDE.2025.3592640","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3592640","url":null,"abstract":"When distribution shifts occur between testing and training graph data, out-of-distribution (OOD) samples undermine the performance of graph neural networks (GNNs). To improve adaptive OOD generalization of GNNs, this paper introduces a novel generative invariant graph learning framework, named GI-Graph. It consists of four modules: subgraph extractor, generative environment subgraph augmentation, generative invariant subgraph learning, and query feedback module. The subgraph extractor decomposes a graph sample into an environment subgraph and an invariant subgraph and improves extraction accuracy through query feedback. GI-Graph uses a diffusion model to generate diverse environment subgraphs, augmenting the OOD data. By combining diffusion models, contrastive learning, and attribute prediction networks, GI-Graph also generates augmented invariant subgraphs with significant identically distributed features and consistency of labels. Experimental results demonstrate that the controllable environment subgraph and invariant subgraph augmentation effectively improve the OOD generalization capability of GI-Graph, especially in capturing invariant features and maintaining category consistency across environments. Additionally, the contrastive learning-based fine-tuning method enables GI-Graph to quickly adapt to evolving environments. This paper verifies the effectiveness of the generative invariant graph learning scheme in graph OOD generalization.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5934-5947"},"PeriodicalIF":10.4,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145050822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey of Text-to-SQL in the Era of LLMs: Where Are We, and Where Are We Going?","authors":"Xinyu Liu;Shuyu Shen;Boyan Li;Peixian Ma;Runzhi Jiang;Yuxin Zhang;Ju Fan;Guoliang Li;Nan Tang;Yuyu Luo","doi":"10.1109/TKDE.2025.3592032","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3592032","url":null,"abstract":"Translating users’ natural language queries (NL) into SQL queries (i.e., Text-to-SQL, <italic>a.k.a.</i> NL2SQL) can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of Text-to-SQL has been greatly enhanced with the emergence of Large Language Models (LLMs). In this survey, we provide a comprehensive review of Text-to-SQL techniques powered by LLMs, covering its entire lifecycle from the following four aspects: (1) <italic>Model:</i> Text-to-SQL translation techniques that tackle not only NL ambiguity and under-specification, but also properly map NL with database schema and instances; (2) <italic>Data:</i> From the collection of training data, data synthesis due to training data scarcity, to Text-to-SQL benchmarks; (3) <italic>Evaluation:</i> Evaluating Text-to-SQL methods from multiple angles using different metrics and granularities; and (4) <italic>Error Analysis:</i> analyzing Text-to-SQL errors to find the root cause and guiding Text-to-SQL models to evolve. Moreover, we offer a rule of thumb for developing Text-to-SQL solutions. Finally, we discuss the research challenges and open problems of Text-to-SQL in the LLMs era.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5735-5754"},"PeriodicalIF":10.4,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145050794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Songwei Zhao;Bo Yu;Sinuo Zhang;Zhejian Yang;Jifeng Hu;Philip S. Yu;Hechang Chen
{"title":"EGNN: Exploring Structure-Level Neighborhoods in Graphs With Varying Homophily Ratios","authors":"Songwei Zhao;Bo Yu;Sinuo Zhang;Zhejian Yang;Jifeng Hu;Philip S. Yu;Hechang Chen","doi":"10.1109/TKDE.2025.3591771","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3591771","url":null,"abstract":"Graph neural networks (GNNs) have garnered significant attention for their competitive performance on graph-structured data. However, many existing methods are commonly constrained by the homophily assumption, making them overly reliant on the uniform neighbor propagation, which limits their ability to generalize to heterophilous graphs. Although some approaches extend aggregation to multi-hop neighbors, adapting neighborhood sizes on a per-node basis remains a significant challenge. In view of this, we propose an Evolutionary Graph Neural Network (EGNN) with adaptive structure-level aggregation and label smoothing, offering a novel solution to the aforementioned drawback. The core innovation of EGNN lies in assigning each node a <italic>personalized</i> neighborhood structure utilizing <italic>behavior-level</i> crossover and mutation. Specifically, we first adaptively search for the optimal structure-level neighborhoods for nodes within the solution space, leveraging the exploratory capabilities of evolutionary computation. This approach enhances the exchange of information between the target node and surrounding nodes, achieving a smooth vector representation. Subsequently, we adopt the optimal structure obtained through evolutionary search to perform label smoothing, further boosting the robustness of the framework. We conduct experiments on nine real-world networks with different homophily ratios, where outstanding performance demonstrates that the ability of EGNN can match or surpass SOTA baselines.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5852-5865"},"PeriodicalIF":10.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145051055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Precise Bayes Regression: Approaching Optimality, Using Multi-Dimensional Space Partitioning Trees","authors":"Amin Vahedian","doi":"10.1109/TKDE.2025.3592074","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3592074","url":null,"abstract":"The Conditional Expectation Function (CEF) is an optimal estimator in real space. Artificial Neural Networks (ANN), as the current state-of-the-art method, lack interpretability. Estimating CEF offers a path to achieve both accuracy and interpretability. Previous attempts to estimate CEF rely on limiting assumptions such as independence and distributional form or perform the expensive nearest neighbor search. We propose Dynamically Ordered Precise Bayes Regression (DO-PBR), a novel method to estimate CEF in discrete space. We prove DO-PBR approaches optimality with increasing number of samples. DO-PBR dynamically learns importance rankings for the predictors, which are region-specific, allowing the importance of a predictor vary across the space. DO-PBR is fully interpretable and makes no assumptions on independence or the distributional form, while requiring minimal parameter setting. In addition, DO-PBR avoids the costly nearest-neighbor search, by using a hierarchy of binary trees. Our experiments confirm our theoretical claims on approaching optimality and show that DO-PBR achieves substantially higher accuracy compared to ANN, when given the same amount of time. Our experiments show that on average, ANN takes 32 times longer to achieve the same level of accuracy as DO-PBR.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"6107-6119"},"PeriodicalIF":10.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145036801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IGES-RCI: Improved Greedy Equivalence Search and Recursive Causal Inference for Industrial Equipment Failure Prediction","authors":"Xu Zhao;Weibing Wan;Zhijun Fang","doi":"10.1109/TKDE.2025.3591827","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3591827","url":null,"abstract":"Predicting equipment failures plays a pivotal role in minimizing maintenance costs and boosting production efficiency within the industrial sector. This paper introduces a novel approach that integrates Causal Inference with predictive modeling to enhance prediction accuracy, tackling key challenges such as noise interference, insufficient causal validation, and missing data. We first validate the causal connections identified by the Greedy Equivalence Search algorithm using conditional mutual information to strengthen the reliability of the causal graph. An information bottleneck strategy is then employed to isolate essential causal features, effectively filtering out irrelevant noise and refining the causal structure. Crucially, in the actual prediction phase, we propose a recursive causal inference-based imputation method to handle missing data, leveraging the causal graph to iteratively infer and fill gaps, thereby improving data completeness and prediction accuracy. Experimental results demonstrate that the proposed method significantly outperforms existing approaches, exhibiting superior accuracy and robustness in managing complex industrial datasets.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"5983-5993"},"PeriodicalIF":10.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145051056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning to Discriminate While Contrasting: Combating False Negative Pairs With Coupled Contrastive Learning for Incomplete Multi-View Clustering","authors":"Yu Ding;Katsuya Hotta;Chunzhi Gu;Ao Li;Jun Yu;Chao Zhang","doi":"10.1109/TKDE.2025.3592126","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3592126","url":null,"abstract":"The task of incomplete multi-view clustering (IMvC) aims to partition multi-view data with a lack of completeness into different clusters. The incompleteness can be typically categorized into the case of instance-missing and view-unaligned MvC. However, prior methods either consider each of them or struggle to pursue consistent latent representations among views. In this paper, we propose two forms of contrastive learning paradigms to jointly handle both cases for IMvC. Specifically, we design an instance-oriented contrastive (IOC) learning strategy to achieve intra-class consistency. As negative samples within different datasets can exhibit diverse distributions, we formulate a parameterized boundary for IOC learning to flexibly deal with such differing data modes. To preserve inter-view consistency, we further devise category-oriented contrastive (COC) learning such that data from different views can be seamlessly integrated into a combined semantic space. We also recover the missing instances with the learned latent representations in a reconstructing manner for realigning the incomplete multi-view data to facilitate clustering. Our approach unifies the solution to both incomplete cases into one formulation. To demonstrate the effectiveness of our model, we conduct four types of MvC tasks on six benchmark multi-view datasets and compare our method against state-of-the-art IMvC methods. Extensive experiments show that our method achieves state-of-the-art performance, quantitatively and qualitatively.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 10","pages":"6046-6060"},"PeriodicalIF":10.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145073230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}