Daniele Malitesta;Emanuele Rossi;Claudio Pomo;Tommaso Di Noia;Fragkiskos D. Malliaros
{"title":"Training-Free Graph-Based Imputation of Missing Modalities in Multimodal Recommendation","authors":"Daniele Malitesta;Emanuele Rossi;Claudio Pomo;Tommaso Di Noia;Fragkiskos D. Malliaros","doi":"10.1109/TKDE.2026.3667005","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3667005","url":null,"abstract":"Multimodal recommender systems (RSs) represent items in the catalog through multimodal data (e.g., product images and descriptions) that, in some cases, might be noisy or (even worse) missing. In those scenarios, the common practice is to drop items with missing modalities and train the multimodal RSs on a subsample of the original dataset. To date, the problem of missing modalities in multimodal recommendation has still received limited attention in the literature, lacking a precise formalisation as done with missing information in traditional machine learning. In this work, we first provide a problem formalisation for missing modalities in multimodal recommendation. Second, by leveraging the user-item graph structure, we re-cast the problem of missing multimodal information as a problem of graph features interpolation on the item-item co-purchase graph. On this basis, we propose four training-free approaches that propagate the available multimodal features throughout the item-item graph to impute the missing features. Extensive experiments on popular multimodal recommendation datasets demonstrate that our solutions can be seamlessly plugged into any existing multimodal RS and benchmarking framework while still preserving (or even widen) the performance gap between multimodal and traditional RSs. Moreover, we show that our graph-based techniques can perform better than traditional imputations in machine learning under different missing modalities settings. Finally, we analyse (for the first time in multimodal RSs) how feature homophily calculated on the item-item graph can influence our graph-based imputations.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 5","pages":"3250-3263"},"PeriodicalIF":10.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TrashToTreasure: An Informative and Interactive Multi-View Classification Framework","authors":"Guoqing Chao;Mingjie Zhang;Xiru Wang;Jie Wen;Weiping Ding;Dianhui Chu","doi":"10.1109/TKDE.2026.3676286","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3676286","url":null,"abstract":"As a basic machine learning task, Multi-View Classification (MVC) has garnered considerable attention and achieved great success. However, the existing MVC methods, especially late fusion style ones still suffer from some problems: 1) hidden valuable information is not well exploited; 2) a lack of interaction before decision making. To address these problems, we propose a novel framework named “TrashtoTreasure” that leverages mutual information to effectively exploit hidden valuable information. Specifically, the framework explicitly disentangles multi-view information into “useful” components and “trash” (noisy) components, and further extracts potentially valuable “treasure” information from the “trash” components of all views. Additionally, we design a tailored objective function that facilitates the effective separation of “useful” and “trash” components, as well as the synergistic extraction of “treasure” information. This function guides model optimization through triple mutual information constraints. Experimental results on synthetic data and several real-world data sets verified the effectiveness and superiority of the proposed method. The fresh perspective offered by this article may inspire more interesting exploration in this direction.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 5","pages":"3264-3276"},"PeriodicalIF":10.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uncertainty-Aware Online Time Series Multi-Step Forecasting Framework in Cloud Systems","authors":"Jiadong Chen;Yang Luo;Xiuqi Huang;Fuxin Jiang;Yangguang Shi;Tieying Zhang;Xiaofeng Gao","doi":"10.1109/TKDE.2026.3674583","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3674583","url":null,"abstract":"Accurate resource planning in large-scale systems relies on reliable predictions of future workloads, a task inherently challenged by their variability and dynamism. Previous prediction methods are either ineffective to deal with the changing dynamics of the series, or are highly black-boxed and unable to conduct effective theoretical analysis. To address these issues, we design an effective ensemble framework, Interval Prediction with Online Chasing (<b>IPOC</b>), tailored for multi-step interval forecasting in real-time systems. Theoretically, by formulating the task as a Dynamic Deterministic Markov Decision Process (Dd-MDP), an advanced theoretical framework is introduced to analyze problem solvability and derive conditions for the existence of feasible solutions. Incorporating the proposed Adaptive Copula Conformal Inference (ACCI) module and a well-designed Chasing Oracle, <b>IPOC</b> captures the changing dynamics and temporal dependencies to enable multi-step forecasting. We organically integrate advanced online learning theories with time series forecasting tasks to construct a forecasting framework that is both theoretically rigorous and practically effective. Theoretical analysis underpins <b>IPOC</b>’s effectiveness, demonstrating sublinear regret and adherence to confidence interval specifications. The chasing regret of the Chasing Oracle is <inline-formula><tex-math>$O(L_{c})$</tex-math></inline-formula>, and the overall regret of <b>IPOC</b> is <inline-formula><tex-math>$O(sqrt{L_{c} T log |mathcal {F}|})$</tex-math></inline-formula>. Empirically, <b>IPOC</b> is validated through extensive experiments on five real-world datasets, including public datasets and different types of workload collected from Bytedance Cloud, with comparisons to 25 baselines and 4 forecasting horizons (1/5/10/30). Specifically, <b>IPOC</b> achieves an average reduction of over 20% in RMSE/MAE/SMAPE/<inline-formula><tex-math>$rho$</tex-math></inline-formula>-risk compared to baselines across five datasets. Besides, we apply our model to a case study on predictive auto-scaling tasks in actual large-scale cloud systems to validate its utility.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 5","pages":"3277-3290"},"PeriodicalIF":10.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Which Data Harms My Regression Model: Enhancing Model Performance on Low-Quality Data Through Fast Data Attribution","authors":"Qingkai Sui;Yalin Wang;Chenliang Liu;Diju Liu;Xiaofang Chen;Yongfang Xie","doi":"10.1109/TKDE.2026.3675903","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3675903","url":null,"abstract":"With the rapid advancement of model architectures, the accuracy of industrial predictive modeling now largely hinges on data quality. However, real-world industrial datasets frequently contain low-quality samples that compromise model performance. While existing data preprocessing methods can effectively remove salient outliers, they persistently struggle to detect latent anomalies. To address this challenge, this paper proposes a fast data attribution-based dataset selection method for regression models, termed <inline-formula><tex-math>${mathrm{F{scriptscriptstyle AST}}DAR}$</tex-math></inline-formula>, which enables the model to identify training samples that are detrimental to its performance and subsequently perform dataset selection. <inline-formula><tex-math>${mathrm{F{scriptscriptstyle AST}}DAR}$</tex-math></inline-formula> integrates deep network data attribution into the Leave-One-Out (LOO) influence calculation paradigm of linear regression models through model linearization and parameter dimensionality reduction. Considering the synergy among samples, the truncated Monte Carlo method is adopted to estimate marginal influences of each sample, and sample utility is defined for dataset selection. Validation on real-world industrial datasets demonstrates the effectiveness and practicality of our method. Experimental results show that models trained on <inline-formula><tex-math>${mathrm{F{scriptscriptstyle AST}}DAR}$</tex-math></inline-formula>-selected data achieve significant performance improvements on both validation and test sets, outperforming multiple baseline methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 5","pages":"3321-3334"},"PeriodicalIF":10.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanbo Liu;Xiucheng Li;Xinyang Chen;Hongwei Liu;Zhijun Li
{"title":"Toward Learning Shift-Invariant Representations for Healthcare Series Classification","authors":"Yuanbo Liu;Xiucheng Li;Xinyang Chen;Hongwei Liu;Zhijun Li","doi":"10.1109/TKDE.2026.3667978","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3667978","url":null,"abstract":"Accurate classification of healthcare time series is critical for clinical decision-making. However, existing models often struggle under real-world data shifts and lack interpretability—two key requirements for reliable medical deployment. To address these challenges, we propose <bold>SHINE</b>, a novel end-to-end framework that learns disentangled and shift-invariant representations by modeling the generative process of multivariate healthcare signals. Specifically, SHINE first introduces a genuine data representation learning that disentangles healthcare signals into trend, seasonality, and noise components, reflecting distinct temporal dynamics of healthcare series. Then, we inject several inductive biases into each component to encourage latent representations to be invariant to data shifts and aligned with their corresponding semantic units. Extensive experiments on six healthcare benchmarks spanning ECG, EEG, and continuous glucose monitoring (CGM) domains—under a variety of simulated real-world shift scenarios—demonstrate that SHINE consistently outperforms state-of-the-art baselines, providing robust performance and clinically meaningful interpretations grounded in the estimated components.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 5","pages":"3222-3233"},"PeriodicalIF":10.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VMPQ: An Efficient Protocol for Privacy-Preserving and Verifiable Multi-Predicate Queries Over Time-Series Databases","authors":"Xuan Jing;Fei Xiao;Jianfeng Wang","doi":"10.1109/TKDE.2026.3665631","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3665631","url":null,"abstract":"With the widespread adoption of cloud storage, time-series databases have become indispensable for managing and analyzing sequential data generated on the user side over time (i.e., time-series data), thereby alleviating the computational and storage burden on resource-constrained users. However, critical security and privacy challenges—such as query privacy leakage, data exposure, and threats to storage integrity—remain inadequately addressed by existing solutions. To this end, we propose VMPQ, an efficient protocol for privacy-preserving and verifiable multi-predicate queries over time-series databases. Specifically, we introduce a new cryptographic primitive, verifiable offline/online private information retrieval (V-OO-PIR), which supports sublinear retrieval complexity while simultaneously ensuring both query privacy and result verifiability against untrusted servers. Building on V-OO-PIR, we design a dual-layer security framework that integrates replicated secret sharing (RSS) and secure multiparty computation (MPC): 1) RSS splits time-series data into two shares stored across two non-colluding servers, ensuring data confidentiality and mitigating exposure risks, and 2) MPC performs secure multiplication directly on these shares, enabling efficient evaluation of multi-predicate queries without reconstructing the original data. As a result, VMPQ ensures query privacy by preventing servers from inferring user interests across multiple predicates, while simultaneously guaranteeing data confidentiality and the verifiability of query results. Theoretical analysis confirms the security of VMPQ against malicious adversaries. Experimental results demonstrate that VMPQ reduces query latency by up to 5× compared to the state-of-the-art solution Waldo, while also enhancing throughput and preserving high storage efficiency through optimized database encoding.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 5","pages":"3306-3320"},"PeriodicalIF":10.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Training-Free and Unbiased Graph Collaborative Filtering for Personalized Recommendations","authors":"Ziyang Liu;Chaokun Wang;Cheng Wu;Leqi Zheng;Hao Feng;Hang Zhang","doi":"10.1109/TKDE.2026.3669816","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3669816","url":null,"abstract":"With the widespread adoption of collaborative filtering techniques for personalized recommendations, exposure bias has become a significant challenge. <italic>Exposure bias</i> refers to the tendency of recommendation models to disproportionately favor items with high exposure over those with low exposure. In graph collaborative filtering that uses graph neural networks (GNNs) for recommendations, exposure bias can be exacerbated due to 1) the reliance on positive feedback during graph construction and 2) the effects of the neighbor aggregation step in GNNs. To tackle this challenge, we propose a novel and efficient framework called FUGCF (training-<bold>F</b>ree and <bold>U</b>nbiased <bold>G</b>raph <bold>C</b>ollaborative <bold>F</b>iltering) to improve both the accuracy and bias mitigation of graph-based personalized recommendations. FUGCF employs a two-stage calculation strategy: it estimates exposure probabilities in the first stage and then leverages these exposure probabilities to help derive debiased node embeddings in the second stage. Furthermore, we design a training-free estimation method for FUGCF based on closed-form solutions to enhance its computation efficiency. The extensive experiments on a synthetic dataset and three real-world datasets demonstrate the effectiveness of FUGCF in reducing exposure bias, improving recommendation accuracy, and optimizing computation efficiency.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 5","pages":"3234-3249"},"PeriodicalIF":10.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unveiling Densest Multilayer Subgraphs via Greedy Peeling","authors":"Dandan Liu;Zhaonian Zou;Run-An Wang","doi":"10.1109/TKDE.2026.3668969","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3668969","url":null,"abstract":"The densest subgraphs in multilayer (ML) graphs unveil intricate relationships that are missed by simple graph representations, offering profound insights and applications across diverse domains. In this paper, we present a layer-oriented view of existing density measures for ML graphs and highlight their problems in identifying the densest subgraphs under the layer-oriented densities, including inefficiency, poor approximation ratios, and the lack of a unified algorithmic framework. In light of this, we introduce a new family of vertex-oriented density measures called generalized density. The two parameters <inline-formula><tex-math>$q$</tex-math></inline-formula> and <inline-formula><tex-math>$p$</tex-math></inline-formula> allow the generalized density to flexibly adjust its focus in the density evaluation. We investigate the problem of finding the ML subgraph that maximizes the generalized density and show that the problem can be solved using a unified greedy vertex peeling framework with strong approximation guarantees for half of the <inline-formula><tex-math>$(q, p)$</tex-math></inline-formula> parameter space. Specifically, for four regimes of <inline-formula><tex-math>$(q, p)$</tex-math></inline-formula>, we design tailored vertex-peeling strategies that lead to approximation algorithms with provable approximation ratios and precise time complexity bounds. We also develop a highly efficient implementation that reduces the execution time of greedy peeling to near-linear time for two of the four explored regimes of <inline-formula><tex-math>$(q, p)$</tex-math></inline-formula>. Extensive experiments on ten real-world ML graphs reveal that our generalized density and greedy peeling algorithms can effectively uncover different types of dense ML subgraphs in large-scale ML graphs.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 5","pages":"3291-3305"},"PeriodicalIF":10.4,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"2025 Reviewers List","authors":"","doi":"10.1109/TKDE.2026.3652658","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3652658","url":null,"abstract":"","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 3","pages":"2108-2121"},"PeriodicalIF":10.4,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11395241","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146162185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"XiYan-SQL: A Novel Multi-Generator Framework for Text-to-SQL","authors":"Yifu Liu;Yin Zhu;Yingqi Gao;Zhiling Luo;Xiaoxia Li;Xiaorong Shi;Yuntao Hong;Jinyang Gao;Yu Li;Bolin Ding;Jingren Zhou","doi":"10.1109/TKDE.2026.3657851","DOIUrl":"https://doi.org/10.1109/TKDE.2026.3657851","url":null,"abstract":"To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple high-quality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"38 4","pages":"2474-2487"},"PeriodicalIF":10.4,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147374362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}