{"title":"Discovering Cliques in Attribute Graphs Based on Proportional Fairness","authors":"Yongye Li;Renjie Sun;Chen Chen;Xiaoyang Wang;Ying Zhang;Wenjie Zhang","doi":"10.1109/TKDE.2025.3559994","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3559994","url":null,"abstract":"Community detection is a fundamental problem and has been extensively studied. With the abundance of information in real-world networks, the discovery of communities in attribute graphs is increasingly valuable. However, numerous previous models in attribute graphs neglect the fairness concept, which plays an important role in ensuring that graph analysis is not biased toward specific groups. In this paper, we propose a novel model, named proportional fair clique (PFC). Specifically, given an attribute graph <inline-formula><tex-math>$G=(V,E,A)$</tex-math></inline-formula>, an integer <inline-formula><tex-math>$k$</tex-math></inline-formula> and a threshold <inline-formula><tex-math>$lambda in [0,1/|A|]$</tex-math></inline-formula>, a subgraph <inline-formula><tex-math>$S$</tex-math></inline-formula> of <inline-formula><tex-math>$G$</tex-math></inline-formula> is a PFC if <inline-formula><tex-math>$(i)$</tex-math></inline-formula> <inline-formula><tex-math>$S$</tex-math></inline-formula> is a clique with size at least <inline-formula><tex-math>$k$</tex-math></inline-formula> and <inline-formula><tex-math>$(ii)$</tex-math></inline-formula> <inline-formula><tex-math>$|S_{a_{i}}|/|S| geq lambda$</tex-math></inline-formula> for each attribute <inline-formula><tex-math>$a_{i}$</tex-math></inline-formula> in <inline-formula><tex-math>$G$</tex-math></inline-formula>, where <inline-formula><tex-math>$S_{a_{i}}$</tex-math></inline-formula> is the node set in <inline-formula><tex-math>$S$</tex-math></inline-formula> associated with attribute <inline-formula><tex-math>$a_{i}$</tex-math></inline-formula>. We show that the problem of enumerating all the maximal proportional fair cliques (MPFC) is NP-hard. A reasonable baseline algorithm is first presented by extending the Bron-Kerbosch framework. To scale for large networks, we propose several optimization strategies to accelerate the computation. Finally, comprehensive experiments are conducted over 6 graphs to demonstrate the efficiency and effectiveness of the proposed techniques and model.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4003-4009"},"PeriodicalIF":8.9,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable and Load-Balanced Full-Graph GNN Training on Multiple GPUs","authors":"Qiange Wang;Yao Chen;Weng-Fai Wong;Bingsheng He","doi":"10.1109/TKDE.2025.3558641","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3558641","url":null,"abstract":"While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present <inline-formula><tex-math>$mathsf {HongTu }$</tex-math></inline-formula>, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs. <inline-formula><tex-math>$mathsf {HongTu }$</tex-math></inline-formula> offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance, <inline-formula><tex-math>$mathsf {HongTu }$</tex-math></inline-formula> utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions, <inline-formula><tex-math>$mathsf {HongTu }$</tex-math></inline-formula> employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally, <inline-formula><tex-math>$mathsf {HongTu }$</tex-math></inline-formula> tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that <inline-formula><tex-math>$mathsf {HongTu }$</tex-math></inline-formula> can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes, <inline-formula><tex-math>$mathsf {HongTu }$</tex-math></inline-formula> achieves speedups ranging from 11.4× to 21.3×.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4239-4253"},"PeriodicalIF":8.9,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144232023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tricolore: Multi-Behavior User Profiling for Enhanced Candidate Generation in Recommender Systems","authors":"Xiao Zhou;Zhongxiang Zhao;Hanze Guo","doi":"10.1109/TKDE.2025.3558503","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3558503","url":null,"abstract":"Online platforms aggregate extensive user feedback across diverse behaviors, providing a rich source for enhancing user engagement. Traditional recommender systems, however, typically optimize for a single target behavior and represent user preferences with a single vector, limiting their ability to handle multiple important behaviors or optimization objectives. This conventional approach also struggles to capture the full spectrum of user interests, resulting in a narrow item pool during candidate generation. To address these limitations, we present <italic>Tricolore</i>, a versatile multi-vector learning framework that uncovers connections between different behavior types for more robust candidate generation. <italic>Tricolore</i>'s adaptive multi-task structure is also customizable to specific platform needs. To manage the variability in sparsity across behavior types, we incorporate a behavior-wise multi-view fusion module that dynamically enhances learning. Moreover, a popularity-balanced strategy ensures the recommendation list balances accuracy with item popularity, fostering diversity and improving overall performance. Extensive experiments on public datasets demonstrate <italic>Tricolore</i>'s effectiveness across various recommendation scenarios, from short video platforms to e-commerce. By leveraging a shared base embedding strategy, <italic>Tricolore</i> also significantly improves the performance for cold-start users.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4349-4360"},"PeriodicalIF":8.9,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144232034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-Scale Clustering With Anchor-Based Constrained Laplacian Rank","authors":"Zhenyu Ma;Jingyu Wang;Feiping Nie;Xuelong Li","doi":"10.1109/TKDE.2025.3557718","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3557718","url":null,"abstract":"Graph-based clustering technique has garnered significant attention due to precise information characterization by pairwise graph similarity. Nevertheless, the post-processing step in traditional methods often limits clustering effects because of crucial information loss. Therefore, the Constrained Laplacian Rank (CLR) theory emerges to directly obtain discrete labels from optimally structural graph, achieving desirable outcomes. However, CLR suffers from substantial time overhead, making it infeasible for large-scale data analysis. To overcome this issue, we propose Anchor-based CLR (ACLR), a simple yet effective method for efficient large-scale clustering. The ACLR method comprises four stages: (1) anchors that roughly cover original data are opted to prepare bipartite graph construction; (2) a novel two-step probability transition (TSPT) strategy initializes a small-scale graph with random walk probability among anchors; (3) the main ACLR model alternately optimizes the graph connected structure and directly produces discrete anchor labels, achieving a time complexity independent of the number of samples due to dramatically reduced graph scale; and (4) labels are propagated from anchors to samples using <inline-formula><tex-math>$K$</tex-math></inline-formula>-NN algorithm. Extensive experiments demonstrate that ACLR yields superior accuracy and efficiency, particularly when applied to large-scale data.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4144-4158"},"PeriodicalIF":8.9,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uncertainty Calibration for Counterfactual Propensity Estimation in Recommendation","authors":"Wenbo Hu;Xin Sun;Qiang Liu;Le Wu;Liang Wang","doi":"10.1109/TKDE.2025.3552658","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3552658","url":null,"abstract":"Post-click conversion rate (CVR) is a reliable indicator of online customers’ preferences, making it crucial for developing recommender systems. A major challenge in predicting CVR is severe selection bias, arising from users’ inherent self-selection behavior and the system’s item selection process. To mitigate this issue, the inverse propensity score (IPS) is employed to weight the prediction error of each observed instance. However, current propensity score estimations are unreliable due to the lack of a quality measure. To address this, we evaluate the quality of propensity scores from the perspective of uncertainty calibration, proposing the use of Expected Calibration Error (ECE) as a measure of propensity-score quality, which quantifies the extent to which predicted probabilities are overconfident by assessing the difference between predicted probabilities and actual observed frequencies. Miscalibrated propensity scores can lead to distorted IPS weights, thereby compromising the debiasing process in CVR prediction. In this paper, we introduce a model-agnostic calibration framework for propensity-based debiasing of CVR predictions. Theoretical analysis on bias and generalization bounds demonstrates the superiority of calibrated propensity estimates over uncalibrated ones. Experiments conducted on the Coat, Yahoo and KuaiRand datasets show improved uncertainty calibration, as evidenced by lower ECE values, leading to enhanced CVR prediction outcomes.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3781-3793"},"PeriodicalIF":8.9,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Data-Level Augmentation Framework for Time Series Forecasting With Ambiguously Related Source Data","authors":"Rui Ye;Qun Dai","doi":"10.1109/TKDE.2025.3555530","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3555530","url":null,"abstract":"Many practical time series forecasting (TSF) tasks are plagued by data limitations. To alleviate this challenge, we design a data-level augmentation framework. It involves a time series generation (TSG) module and a source data selection (Sel-src) module. TSG aims to achieve better generation results by considering both the global profile and temporal dynamics of series. However, when only few target data is available, TSG module may tend to simulate the limited target samples, leading to poor generalization performance. A natural idea for this problem is to seek help from related source domain, which can provide additional useful information for TSG module. Here we consider a more complex situation, where the relevance between source and target domains is ambiguous. That is, irrelevant samples may exist in the source domain. Blindly using all the source data may lead to counterproductive results. To meet this challenge, Sel-src module is designed to select effective source samples by Inter-Representation Learning (Inter-RL) and Intra-Representation Learning (Intra-RL). Effectiveness of this algorithm is underpinned from two aspects: the quality of the augmented data and the accuracy improvement upon the augmentation.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"3855-3868"},"PeriodicalIF":8.9,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Transactional Stream Processing on Multicore Processors","authors":"Jianjun Zhao;Yancan Mao;Zhonghao Yang;Haikun Liu;Shuhao Zhang","doi":"10.1109/TKDE.2025.3556741","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3556741","url":null,"abstract":"Transactional stream processing engines (TSPEs) are central to modern stream applications handling shared mutable states. However, their full potential, particularly in adaptive scheduling, remains largely unexplored. We present <italic>MorphStream</i>, a TSPE designed to optimize parallelism and performance for transactional stream processing on multicores. Through a unique three-stage execution paradigm (i.e., <italic>planning</i>, <italic>scheduling</i>, and <italic>execution</i>), <italic>MorphStream</i> enables adaptive scheduling under varying workload characteristics. Building on this foundation, <italic>MorphStream</i> is further enhanced with support for non-deterministic state access, employing a stateful task precedence graph to handle undefined read/write sets at runtime while guaranteeing transaction semantics. Additionally, <italic>MorphStream</i> incorporates a generalized framework for managing window-based operations, enabling efficient tracking and maintenance of overlapping windows using multi-versioned state management. These extensions enhance the system’s ability to process dynamic and irregular workloads. Experimental results demonstrate up to 3.4 times higher throughput and 69.1% lower latency compared to state-of-the-art TSPEs, validating its scalability and adaptability in real-world streaming scenarios.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4254-4269"},"PeriodicalIF":8.9,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10949743","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144229476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Disentangling Dynamics: Advanced, Scalable and Explainable Imputation for Multivariate Time Series","authors":"Shuai Liu;Xiucheng Li;Yile Chen;Yue Jiang;Gao Cong","doi":"10.1109/TKDE.2025.3558405","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3558405","url":null,"abstract":"Missing values pose a formidable obstacle in multivariate time series analysis. Existing imputation methods rely on entangled representations that struggle to simultaneously capture multiple orthogonal time-series patterns, leading to suboptimal performance and limited interpretability. Meanwhile, requiring the entire data span as input renders these models impractical for long time series. To address these issues, we propose <inline-formula><tex-math>$mathsf {TIDER}$</tex-math></inline-formula> and its enhanced version, <inline-formula><tex-math>$mathsf {AdaTIDER}$</tex-math></inline-formula>. <inline-formula><tex-math>$mathsf {TIDER}$</tex-math></inline-formula> employs low-rank matrix factorization and disentangled temporal representations to model intricate dynamics like trend, seasonality, and local bias. However, <inline-formula><tex-math>$mathsf {TIDER}$</tex-math></inline-formula> is limited to single-period modeling and does not explicitly capture dependencies between channels. To overcome these limitations, <inline-formula><tex-math>$mathsf {AdaTIDER}$</tex-math></inline-formula> incorporates adaptive cross-channel dependency modeling and multi-period seasonality representations. These advancements enable it to dynamically capture variable relationships and complex multi-period patterns, significantly enhancing imputation accuracy and interpretability, while maintaining <inline-formula><tex-math>$mathsf {TIDER}$</tex-math></inline-formula>’s scalability. Extensive experiments on real-world datasets validate the superiority of our models in imputation accuracy, scalability, interpretability, and robustness.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4010-4022"},"PeriodicalIF":8.9,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiatong Li;Wei Liu;Zhihao Ding;Wenqi Fan;Yuqiang Li;Qing Li
{"title":"Large Language Models are in-Context Molecule Learners","authors":"Jiatong Li;Wei Liu;Zhihao Ding;Wenqi Fan;Yuqiang Li;Qing Li","doi":"10.1109/TKDE.2025.3557697","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3557697","url":null,"abstract":"Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose <bold>I</b>n-<bold>C</b>ontext <bold>M</b>olecule <bold>A</b>daptation (<bold>ICMA</b>), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4131-4143"},"PeriodicalIF":8.9,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FairCoRe: Fairness-Aware Recommendation Through Counterfactual Representation Learning","authors":"Chenzhong Bin;Wenqiang Liu;Feng Zhang;Liang Chang;Tianlong Gu","doi":"10.1109/TKDE.2025.3557501","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3557501","url":null,"abstract":"Eliminating bias from data representations is crucial to ensure fairness in recommendation. Existing studies primarily focus on weakening the correlation between data representations and sensitive attributes, yet may inadvertently steer the user representations toward another potential bias direction of the target attribute. Furthermore, they often overlook the impact of user preferences on capturing sensitive information, incurring inadequate bias elimination. In this paper, we propose a <bold>Fair</b> <bold>Co</b>unterfactual <bold>Re</b>presentations (<bold>FairCoRe</b>) learning framework, which aims to ensure the neutrality of representations among all bias directions. First, we intervene on sensitive attributes to construct a counterfactual scenario. Then, two opposing attribute prediction tasks are respectively performed in ground-truth and counterfactual scenarios to encode sensitive information along different bias directions. Second, we design a bias-aware enhancement learning method that quantifies the respective correlation of user preferences and sensitive attributes to enhance sensitive information encoding. Finally, we introduce two mutual information optimization methods that optimize the representations to capture users’ interests and disentangle sensitive factors. Moreover, we propose an attribute neutralization strategy that refines the learned representations, ensuring sensitive attribute neutrality. Extensive experiments demonstrate that our method achieves the optimal fairness and competitive accuracy compared to state-of-the-art methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4049-4062"},"PeriodicalIF":8.9,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}