Zhida Qin;Wenhao Xue;Haotian He;Haoyao Zhang;Shixiao Yang;Enjun Du;Tianyu Huang;John C.S. Lui
{"title":"Behavior Habits Enhanced Intention Learning for Session Based Recommendation","authors":"Zhida Qin;Wenhao Xue;Haotian He;Haoyao Zhang;Shixiao Yang;Enjun Du;Tianyu Huang;John C.S. Lui","doi":"10.1109/TBDATA.2025.3618463","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3618463","url":null,"abstract":"Multi-behavior Session Based Recommendations (MBSBRs) have achieved remarkable results due to considering behavioral heterogeneity in sessions. Yet most existing works only consider binary or continuous behavior dependencies and aim to predict the next item under the target behavior, neglecting users’ inherent behavior habits, resulting in learning inaccurate intentions. To tackle the above issues, we propose a novel <underline>B</u>ehavior <underline>H</u>abits Enhanced Intention Learning framework for <underline>S</u>ession <underline>B</u>ased <underline>R</u>ecommendation (<bold>BHSBR</b>). Specifically, we focus on the next item recommendation and design a global item transition graph to learn the behavior-aware semantic relationships between items, in order to mine the underlying similarity between items beyond the session. In addition, we construct a hypergraph to extract the diverse behavior habits of users and break through the limitations of temporal relationships in the session. Compared to the existing works, our behavior habit learning method learns behavior dependencies at the user level, which could capture the user’s more accurate long-term intentions and reduce the impact of noise behaviors. Extensive experiments on three datasets demonstrate that the performance of our proposed <bold>BHSBR</b> is superior to SOTA. Further ablation experiments fully illustrate the effectiveness of our various modules.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"236-248"},"PeriodicalIF":5.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Core Decomposition of Temporal Graphs","authors":"Wen Bai;Yufeng Wang;Yuncheng Jiang;Di Wu","doi":"10.1109/TBDATA.2025.3618443","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3618443","url":null,"abstract":"To underscore the significance of the interactive frequency among diverse vertices in each snapshot, prior research has extended the <inline-formula><tex-math>$k$</tex-math></inline-formula>-core of general graphs to the <inline-formula><tex-math>$(k,h)$</tex-math></inline-formula>-core of temporal graphs, in which each vertex has at least <inline-formula><tex-math>$k$</tex-math></inline-formula> neighbors and is connected by at least <inline-formula><tex-math>$h$</tex-math></inline-formula> edges to each of these neighbors. Due to the numerous combinations of <inline-formula><tex-math>$k$</tex-math></inline-formula> and <inline-formula><tex-math>$h$</tex-math></inline-formula>, the quantity of <inline-formula><tex-math>$(k,h)$</tex-math></inline-formula>-cores is substantial, which necessitates considerable time and space for querying and decomposition. As a temporal graph evolves, for instance, with edges being inserted or removed from the previous snapshot, the affected <inline-formula><tex-math>$(k,h)$</tex-math></inline-formula>-cores must also be updated to reflect the latest structure. To address these challenges, we initially develop a novel <inline-formula><tex-math>$(k,h)$</tex-math></inline-formula>-core storage index that exhibits excellent query performance while consuming linear space regarding the graph size. Subsequently, we design an efficient decomposition algorithm to extract <inline-formula><tex-math>$(k,h)$</tex-math></inline-formula>-cores from a snapshot. Following this, we offer two maintenance algorithms to manage temporal graph evolution. Finally, we validate the effectiveness of our proposed methods on actual temporal graphs. Experimental results indicate that our methods surpass existing techniques by two orders of magnitude.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"212-223"},"PeriodicalIF":5.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Recommendations With Knowledge-Guided Interest Contrast","authors":"Meng Jian;Ruoxi Li;Yulong Bai;Ge Shi","doi":"10.1109/TBDATA.2025.3618449","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3618449","url":null,"abstract":"In the digital age, the overwhelming amount of information necessitates advanced recommendation systems to deliver personalized content. However, these systems face significant challenges, such as sparse user-item interactions and long-tail bias. Recent studies construct structural learning or self-supervised learning on the interaction graph achieving a positive impact on alleviating the problems, but the interaction data itself may be far too little to solve the problems. While knowledge graphs (KGs) offer a promising solution by providing semantic depth to recommendations, their integration often introduces noise from redundant knowledge. Addressing these critical gaps, this study proposes a knowledge-guided interest contrast (KGIC) to enhance recommendations, which innovatively harmonizes collaborative filtering with semantic insights from KG. The KGIC model introduces three key innovations: (1) a knowledge filtering mechanism that selectively leverages interest-relevant signals from the knowledge graph to encode interest and avoid redundant knowledge interference; (2) an adaptive graph augmentation strategy that enhances the interaction graph based on semantic-aware interest propagation and interaction intensity estimation; and (3) a self-supervised contrastive learning task that mitigates long-tail bias and sparsity issues by homogenizing the embedding distribution between augmented views. The extensive evaluation reveals the superiority of KGIC with knowledge filtering and graph augmentation for recommendation.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"200-211"},"PeriodicalIF":5.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLIP2LE: A Label Enhancement Fair Representation Method via CLIP","authors":"Pu Wang;YinSong Xiong;Zhuoran Zheng","doi":"10.1109/TBDATA.2025.3618450","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3618450","url":null,"abstract":"Label enhancement is a novel label shift strategy that aims to integrate the feature space with the logical label space to obtain a high-quality label distribution. This label distribution can serve as a soft target for algorithmic learning, akin to label smoothing, thereby enhancing the performance of various learning paradigms including multi-label learning, single positive multi-label learning, and partial-label learning. However, limited by dataset type and annotation inaccuracy, the same label enhancement algorithm on different datasets struggles to achieve consistent performance, for reasons derived from the following two insights: 1) Differential Contribution of Feature Space and Logical Label Space: The feature space and logical label space of different datasets contribute differently to generating an accurate label distribution; 2) Presence of Noise and Incorrect Labels: Some datasets contain noise and inaccurately labeled samples, leading to divergent outputs for similar inputs. To address these challenges, we propose leveraging CLIP (Contrastive Language-Image Pre-training) as a foundational strategy, treating the feature space and the logical label space as two distinct modalities. By recoding these modalities before applying the label enhancement algorithm, we aim to achieve a fair and robust representation. In addition, we further explained the reasonableness of our motives in the discussion session. Extensive experimental results demonstrate the effectiveness of our approach to help existing label enhancement algorithms improve their performance on several benchmarks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"224-235"},"PeriodicalIF":5.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaowen Chu;Wei Wang;Cong Wang;Yang Liu;Rongfei Zeng;Christopher G. Brinton
{"title":"Guest Editorial Special Issue on Federated Learning for Big Data Applications","authors":"Xiaowen Chu;Wei Wang;Cong Wang;Yang Liu;Rongfei Zeng;Christopher G. Brinton","doi":"10.1109/TBDATA.2024.3417057","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3417057","url":null,"abstract":"","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2099-2101"},"PeriodicalIF":5.7,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11149636","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Jiang;Lei Wang;Changqing Yu;Zhuhong You;Xinfei Wang;Mengmeng Wei;Mianshuo Lu
{"title":"MuGNet-CMI: Multi-Head Hybrid Graph Neural Network for Predicting circRNA-miRNA Interactions With Global High-Order and Local Low-Order Information","authors":"Chen Jiang;Lei Wang;Changqing Yu;Zhuhong You;Xinfei Wang;Mengmeng Wei;Mianshuo Lu","doi":"10.1109/TBDATA.2025.3604175","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3604175","url":null,"abstract":"Circular RNAs (circRNAs) are non-coding RNA molecules that play a crucial role in regulating genes and contributing to disease progression. CircRNAs can function as sponges for microRNAs (miRNAs), thereby regulating gene expression and influencing disease outcomes. Identifying associations between circRNAs and miRNAs through computational methods enhances the understanding of complex disease mechanisms and offers a reliable tool for pre-selecting candidates for experimental validation. Existing models, however, are limited in their ability to capture either global or local node information, the prediction of circRNA and miRNA interactions is still challenging. In order to effectively deal with this problem, we propose a novel framework for predicting circRNA-miRNA interactions (CMIs), known as MuGNet-CMI, which leverages multi-head hybrid graph neural network and global high-order and local low-order information. The model employs the MetaPath2Vec algorithm to generate high-quality node embeddings within the circRNA-miRNA heterogeneous matrix. The multi-head dynamic attention mechanism, combined with GraphSAGE, is incorporated to efficiently capture both global high-order and local low-order node information. Additionally, we integrate neural aggregators into the multi-head dynamic attention mechanism to aggregate feature information from the captured nodes. Validation using three real datasets demonstrates that MuGNet-CMI delivers good performance in predicting CMIs, offering valuable insights to guide experimental research in gene regulation.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"159-173"},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K Naveen Kumar;Srinivasa Rao Chalamala;Ajeet Kumar Singh;C Krishna Mohan
{"title":"Optimal Transport Barycentric Aggregation for Byzantine-Resilient Federated Learning","authors":"K Naveen Kumar;Srinivasa Rao Chalamala;Ajeet Kumar Singh;C Krishna Mohan","doi":"10.1109/TBDATA.2025.3604177","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3604177","url":null,"abstract":"Federated learning (FL) has emerged as a promising solution to enable distributed learning without sharing sensitive data. However, FL is vulnerable to data poisoning attacks, where malicious clients inject malicious data during training to compromise the global model. Existing FL defenses suffer from the assumptions of independent and identically distributed (IID) model updates, asymptotic optimal error rate bounds, and strong convexity in the optimization problem. Hence, we propose a novel framework called Federated Learning Optimal Transport (FLOT) that leverages the Wasserstein barycentric technique to obtain a global model from a set of locally trained non-IID models on client devices. In addition, we introduce a loss function-based rejection (LFR) mechanism to suppress malicious updates and a dynamic weighting scheme to optimize the Wasserstein barycentric aggregation function. We provide the theoretical proof of the Byzantine resilience and convergence of FLOT to highlight its efficacy. We evaluate FLOT on four benchmark datasets: GTSRB, KBTS, CIFAR10, and EMNIST. The experimental results underscore the practical significance of FLOT as an effective defense mechanism against data poisoning attacks in FL while maintaining high accuracy and scalability. Also, we observe that FLOT serves as a robust client selection technique under no attack, which demonstrates its effectiveness.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"174-185"},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Deduplication Parameters via a Change-Estimation Analytical Model","authors":"Owen Randall;Luke Schultz;Paul Lu","doi":"10.1109/TBDATA.2025.3604171","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3604171","url":null,"abstract":"Variable-sized, content-defined deduplication is a technique to find and eliminate redundant chunks of data for efficient data backups, reduced data transfers, and reduced data-storage overheads. For big datasets, especially with incremental updates over time such as backups and gathered data, deduplication makes data management faster and more efficient. While many existing deduplication systems use default expected chunk lengths such as 4 KB or 8 KB, they are suboptimal. Poorly optimized deduplication systems can significantly increase storage costs and network usage, making large datasets prohibitively expensive to manage. We present the design, implementation, and an empirical validation of our Deduplication Change-Estimation Analytical Model (DCAM) which predicts the performance of sliding window-based deduplication parameters on any given dataset, to be used for parameter optimization. Our empirical evaluation includes workloads based on source code (Linux kernel, Kubernetes, TensorFlow), open-research datasets (CORD-19), and articles (Wikipedia). Validated using both our system and the Destor deduplication system, a DCAM-based search finds deduplication parameters that require up to 3.8× less storage relative to a common baseline. DCAM Search optimizes parameters up to 19.8× faster than previously possible, and the size of the resulting deduplicated datasets are all within 5.15% of the best results found by searching using actual deduplication.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"135-146"},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"scProGraph: A Cell Bagging Strategy for Cell Type Annotation With Gene Interaction-Aware Explainability","authors":"Xinyuan Li;Yue-Chao Li;Hai-Ru You;Xuequn Shang;Leon Wong;Zhi-An Huang;Zhu-Hong You;Yu-An Huang","doi":"10.1109/TBDATA.2025.3604169","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3604169","url":null,"abstract":"The rapid advancement of scRNA-seq has generated massive data for cell type annotation. However, current automated annotation methods remain limited: most approaches separately model either cell-cell similarities or gene-gene relationships, neglecting their synergistic effects, which leads to suboptimal accuracy and poor biological interpretability. To address this, we propose scProGraph, a prototype-guided graph neural network that jointly models cell type classification and functional gene subgraph discovery. By constructing a cell similarity graph and incorporating cell-type prototypes as prior anchors, our method simultaneously optimizes classification boundaries and the interpretability of gene subgraphs. Experiments on seven independent datasets spanning three disease categories demonstrate that scProGraph achieves over 90% accuracy on four datasets and exceeds 80% on six datasets, outperforming state-of-the-art methods. Further analysis reveals that the gene subgraphs extracted by scProGraph for Macrophage, Fibroblast, and Monocyte cover 26.92%, 26.83%, and 22.22% of a protein-protein interaction networks dataset, respectively, validating the biological relevance of the identified gene modules. This study not only provides a high-accuracy tool for single-cell annotation but also opens new avenues for discovering novel biomarkers and regulatory mechanisms through gene relationship mining.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"147-158"},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Wali Ur Rahman;Ric Nevarez;Lamia Tasnim Mim;Salim Hariri
{"title":"SDEC: Semantic Deep Embedded Clustering","authors":"Mohammad Wali Ur Rahman;Ric Nevarez;Lamia Tasnim Mim;Salim Hariri","doi":"10.1109/TBDATA.2025.3603433","DOIUrl":"https://doi.org/10.1109/TBDATA.2025.3603433","url":null,"abstract":"The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: <italic>AG News, Yahoo! Answers, DBPedia, Reuters 2,</i> and <italic>Reuters 5</i>. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on <italic>AG News</i> and set a new benchmark of 53.63% on <italic>Yahoo! Answers</i>, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"12 1","pages":"119-134"},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}