{"title":"UniDyG: A Unified and Effective Representation Learning Approach for Large Dynamic Graphs","authors":"Yuanyuan Xu;Wenjie Zhang;Xuemin Lin;Ying Zhang","doi":"10.1109/TKDE.2025.3566064","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3566064","url":null,"abstract":"Dynamic graphs, which capture time-evolving edges between nodes, are formulated in continuous-time or discrete-time dynamic graphs. They differ in temporal granularity: Continuous-Time Dynamic Graphs (CTDGs) exhibit rapid, localized changes, while Discrete-Time Dynamic Graphs (DTDGs) show gradual, global updates. This difference leads to isolated developments in representation learning for each type. To advance dynamic graph representation learning, recent research attempts to design a unified model capable of handling both CTDGs and DTDGs, achieving promising results. However, it typically focuses on local dynamic propagation for temporal structure learning in the time domain, failing to accurately capture the underlying structural evolution associated with each temporal granularity and thus compromising model effectiveness. In addition, existing works-whether specific or unified-often overlook the issue of temporal noise, compromising the model’s robustness. To better model both types of dynamic graphs, we propose UniDyG, a unified and effective representation learning approach, which can scale to large dynamic graphs. Specifically, we first propose a novel Fourier Graph Attention (FGAT) mechanism that can model local and global structural correlations based on recent neighbors and complex-number selective aggregation, while theoretically ensuring consistent representations of dynamic graphs over time. Based on approximation theory, we demonstrate that FGAT is well-suited to capture the underlying structures in both CTDGs and DTDGs. We further enhance FGAT to resist temporal noise by designing an energy-gated unit, which adaptively filters out high-frequency noise according to the energy. Last, we leverage our proposed FGAT mechanisms for temporal structure learning and employ the frequency-enhanced linear function for node-level dynamic updates, facilitating the generation of high-quality temporal embeddings. Extensive experiments show that our UniDyG achieves an average improvement of 14.4% over sixteen baselines across nine dynamic graphs while exhibiting superior robustness in noisy scenarios.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4373-4388"},"PeriodicalIF":8.9,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144231999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danyang Wu;Penglei Wang;Jitao Lu;Zhanxuan Hu;Hongming Zhang;Feiping Nie
{"title":"Triangle Topology Enhancement for Multi-View Graph Clustering","authors":"Danyang Wu;Penglei Wang;Jitao Lu;Zhanxuan Hu;Hongming Zhang;Feiping Nie","doi":"10.1109/TKDE.2025.3566387","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3566387","url":null,"abstract":"Most existing multi-view graph clustering models focus on integrating the topological structure of different views directly, which cannot efficiently stimulate the collaboration between multiple views. To alleviate this problem, this paper proposes a Triangle Topology Enhancement (T<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>E) module, which expands two topological structures based on the raw topology of each view, including the self-triangle enhanced topology that highlights the local view information and the cross-view triangle enhanced topology containing the global-local view information. Afterward, this paper designs a novel multi-view graph clustering model, named MGC-T<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>E, to integrate both the raw and derived topological structures and directly induce consistent clustering indicators based on a self-supervised clustering module. In the simulation, the experimental results demonstrate that MGC-T<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>E achieves state-of-the-art performances compared with a mass of current competitors.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4338-4348"},"PeriodicalIF":8.9,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144229475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph Portfolio: High-Frequency Factor Predictors via Heterogeneous Continual GNNs","authors":"Min Hu;Zhizhong Tan;Bin Liu;Guosheng Yin","doi":"10.1109/TKDE.2025.3566111","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3566111","url":null,"abstract":"This study aims to address the challenges of financial price prediction in high-frequency trading (HFT) by introducing a novel continual learning framework based on factor predictors via graph neural networks. The model integrates multi-factor pricing theory with real-time market dynamics, effectively bypassing the limitations of conventional time series forecasting methods, which often lack financial theory guidance and ignore market correlations. We propose three heterogeneous tasks, including price gap regression, changepoint detection, and price moving average regression to trace the short-, intermediate-, and long-term trend factors present in the data. We also account for the cross-sectional correlations inherent in the financial market, where prices of different assets show strong dynamic correlations. To accurately capture these dynamic relationships, we resort to spatio-temporal graph neural network (STGNN) to enhance the predictive power of the model. Our model allows a continual learning strategy to simultaneously consider these tasks (factors). To tackle the catastrophic forgetting in continual learning while considering the heterogeneity of tasks, we propose to calculate parameter importance with mutual information between original observations and the extracted features. Empirical studies on the Chinese futures data and U.S. equity data demonstrate the superior performance of the proposed model compared to other state-of-the-art approaches.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4104-4116"},"PeriodicalIF":8.9,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lyuheng Yuan;Guimu Guo;Da Yan;Saugat Adhikari;Jalal Khalil;Cheng Long;Lei Zou
{"title":"G-Thinkerq: A General Subgraph Querying System With a Unified Task-Based Programming Model","authors":"Lyuheng Yuan;Guimu Guo;Da Yan;Saugat Adhikari;Jalal Khalil;Cheng Long;Lei Zou","doi":"10.1109/TKDE.2025.3537964","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3537964","url":null,"abstract":"Given a large graph <inline-formula><tex-math>$G$</tex-math></inline-formula>, a subgraph query <inline-formula><tex-math>$Q$</tex-math></inline-formula> finds the set of all subgraphs of <inline-formula><tex-math>$G$</tex-math></inline-formula> that satisfy certain conditions specified by <inline-formula><tex-math>$Q$</tex-math></inline-formula>. Examples of subgraph queries including finding a community containing designated members to organize an event, and subgraph matching. To overcome the weakness of existing graph-parallel systems that underutilize CPU cores when finding subgraphs, our prior system, G-thinker, was proposed that adopts a novel think-like-a-task (TLAT) parallel programming model. However, G-thinker targets offline analytics and cannot support interactive online querying where users continually submit subgraph queries with different query contents. The challenges here are (i) how to maintain fairness that queries are answered in the order that they are received: a later query is processed only if earlier queries cannot saturate the available computation resources; (ii) how to track the progress of active queries (each with many tasks under computation) so that users can be timely notified as soon as a query completes; and (iii) how to maintain memory boundedness and high task concurrency as in G-thinker. In this article, we propose a novel TLAT programming framework, called G-thinkerQ, for answering online subgraph queries. G-thinkerQ inherits the memory boundedness and high task concurrency of G-thinker by organizing the tasks of each query using a “task capsule” structure, and designs a novel task-capsule list is to ensure fairness among queries. A novel lineage-based mechanism is also designed to keep track of when the last task of a query is completed. Parallel counterparts of the state-of-the-art algorithms for 4 recent advanced subgraph queries are implemented on G-thinkerQ to demonstrate its CPU-scalability.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3429-3444"},"PeriodicalIF":8.9,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Causal Representations Based on a GAE Embedded Autoencoder","authors":"Kuang Zhou;Ming Jiang;Bogdan Gabrys;Yong Xu","doi":"10.1109/TKDE.2025.3546607","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3546607","url":null,"abstract":"Traditional machine-learning approaches face limitations when confronted with insufficient data. Transfer learning addresses this by leveraging knowledge from closely related domains. The key in transfer learning is to find a transferable feature representation to enhance cross-domain classification models. However, in some scenarios, some features correlated with samples in the source domain may not be relevant to those in the target. Causal inference enables us to uncover the underlying patterns and mechanisms within the data, mitigating the impact of confounding factors. Nevertheless, most existing causal inference algorithms have limitations when applied to high-dimensional datasets with nonlinear causal relationships. In this work, a new causal representation method based on a Graph autoencoder embedded AutoEncoder, named GeAE, is introduced to learn invariant representations across domains. The proposed approach employs a causal structure learning module, similar to a graph autoencoder, to account for nonlinear causal relationships present in the data. Moreover, the cross-entropy loss as well as the causal structure learning loss and the reconstruction loss are incorporated in the objective function designed in a united autoencoder. This method allows for the handling of high-dimensional data and can provide effective representations for cross-domain classification tasks. Experimental results on generated and real-world datasets demonstrate the effectiveness of GeAE compared with the state-of-the-art methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3472-3484"},"PeriodicalIF":8.9,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Valuing Training Data via Causal Inference for In-Context Learning","authors":"Xiaoling Zhou;Wei Ye;Zhemg Lee;Lei Zou;Shikun Zhang","doi":"10.1109/TKDE.2025.3546761","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3546761","url":null,"abstract":"In-context learning (ICL) empowers large pre-trained language models (PLMs) to predict outcomes for unseen inputs without parameter updates. However, the efficacy of ICL heavily relies on the choice of demonstration examples. Randomly selecting from the training set frequently leads to inconsistent performance. Addressing this challenge, this study takes a novel approach by focusing on training data valuation through causal inference. Specifically, we introduce the concept of average marginal effect (AME) to quantify the contribution of individual training samples to ICL performance, encompassing both its generalization and robustness. Drawing inspiration from multiple treatment effects and randomized experiments, we initially sample diverse training subsets to construct prompts and evaluate the ICL performance based on these prompts. Subsequently, we employ Elastic Net regression to collectively estimate the AME values for all training data, considering subset compositions and inference performance. Ultimately, we prioritize samples with the highest values to prompt the inference of the test data. Across various tasks and with seven PLMs ranging in size from 0.8B to 33B, our approach consistently achieves state-of-the-art performance. Particularly, it outperforms Vanilla ICL and the best-performing baseline by an average of 14.1% and 5.2%, respectively. Moreover, prioritizing the most valuable samples for prompting leads to a significant enhancement in performance stability and robustness across various learning scenarios. Impressively, the valuable samples exhibit transferability across diverse PLMs and generalize well to out-of-distribution tasks.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3824-3840"},"PeriodicalIF":8.9,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143902653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Do as I Can, Not as I Get: Topology-Aware Multi-Hop Reasoning on Multi-Modal Knowledge Graphs","authors":"Shangfei Zheng;Hongzhi Yin;Tong Chen;Quoc Viet Hung Nguyen;Wei Chen;Lei Zhao","doi":"10.1109/TKDE.2025.3546686","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3546686","url":null,"abstract":"A multi-modal knowledge graph (MKG) includes triplets that consist of entities and relations and multi-modal auxiliary data. In recent years, multi-hop multi-modal knowledge graph reasoning (MMKGR) based on reinforcement learning (RL) has received extensive attention because it addresses the intrinsic incompleteness of MKG in an interpretable manner. However, its performance is limited by empirically designed rewards and sparse relations. In addition, this method has been designed for the transductive setting where test entities have been seen during training, and it works poorly in the inductive setting where test entities do not appear in the training set. To overcome these issues, we propose <bold>TMR</b> (<bold>T</b>opology-aware <bold>M</b>ulti-hop <bold>R</b>easoning), which can conduct MKG reasoning under inductive and transductive settings. Specifically, TMR mainly consists of two components. (1) The topology-aware inductive representation captures information from the directed relations of unseen entities, and aggregates query-related topology features in an attentive manner to generate the fine-grained entity-independent features. (2) After completing multi-modal feature fusion, the relation-augmented adaptive RL conducts multi-hop reasoning by eliminating manual rewards and dynamically adding actions. Finally, we construct new MKG datasets with different scales for inductive reasoning evaluation. Experimental results demonstrate that TMP outperforms state-of-the-art MKGR methods under both inductive and transductive settings.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2405-2419"},"PeriodicalIF":8.9,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Counting and Utilizing Induced 6-Cycles in Bipartite Networks","authors":"Jason Niu;Jaroslaw Zola;Ahmet Erdem Sarıyüce","doi":"10.1109/TKDE.2025.3546516","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3546516","url":null,"abstract":"Bipartite graphs are a powerful tool for modeling the interactions between two distinct groups. These bipartite relationships often feature small, recurring structural patterns called motifs which are building blocks for community structure. One promising structure is the induced 6-cycle which consists of three nodes on each node set forming a cycle where each node has exactly two edges. In this paper, we study the problem of counting and utilizing induced 6-cycles in large bipartite networks. We first consider two adaptations inspired by previous works for cycle counting in bipartite networks. Then, we introduce a new approach for node triplets which offer a systematic way to count the induced 6-cycles, used in <small>BatchTripletJoin</small>. Our experimental evaluation shows that <small>BatchTripletJoin</small> is significantly faster than the other algorithms while being scalable to large graph sizes and number of cores. On a network with <inline-formula><tex-math>$ 112M$</tex-math></inline-formula> edges, <small>BatchTripletJoin</small> is able to finish the computation in 78 mins by using 52 threads. In addition, we provide a new way to identify anomalous node triplets by comparing and contrasting the butterfly and induced 6-cycle counts of the nodes. We showcase several case studies on real-world networks from Amazon Kindle ratings, Steam game reviews, and Yelp ratings.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3386-3398"},"PeriodicalIF":8.9,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shankui Ji;Yang Du;He Huang;Yu-E Sun;Jia Liu;Yapeng Shu
{"title":"PipeFilter: Parallelizable and Space-Efficient Filter for Approximate Membership Query","authors":"Shankui Ji;Yang Du;He Huang;Yu-E Sun;Jia Liu;Yapeng Shu","doi":"10.1109/TKDE.2025.3543881","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3543881","url":null,"abstract":"Approximate membership query data structures (i.e., filters) have ubiquitous applications in database and data mining. Cuckoo filters are emerging as the alternative to Bloom filters because they support deletions and usually have higher operation throughput and space efficiency. However, their designs are confined to a single-threaded execution paradigm and consequently cannot fully exploit the parallel processing capabilities of modern hardware. This paper presents PipeFilter, a faster and more space-efficient filter that harnesses pipeline parallelism for superior performance. PipeFilter re-architects the Cuckoo filter by partitioning its data structure into several sub-filters, each providing a candidate position for every item. This allows the filter operations, including insertion, lookup, and deletion, to be naturally distributed across several pipeline stages, each overseeing one of the sub-filters, which can further be implemented through multi-threaded execution or pipeline stages of programmable hardware to achieve significantly higher throughput. Meanwhile, PipeFilter excels for single-threaded execution thanks to a combination of unique design features, including <i>block design</i>, <i>path prophet</i>, <i>round robin</i>, and <i>SIMD optimization</i>, such that it achieves superior performance than the SOTAs even when running with a single core. PipeFilter also has a competitive advantage in space utilization because it permits each item to explore more candidate positions. We implement and optimize PipeFilter on four platforms (single-core CPU, multi-core CPU, FPGA, and P4 ASIC). Experimental results demonstrate that PipeFilter surpasses all baseline methods on four platforms. When running with a single core, it showcases a notable 15%<inline-formula><tex-math>$sim$</tex-math></inline-formula>57% improvement in operation throughput and a high load factor exceeding 99%. When parallel processing on other platforms, PipeFilter achieves 7<inline-formula><tex-math>$times sim 800times$</tex-math></inline-formula> higher throughput than single-threaded execution.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2816-2830"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Yuan;Jingtao Ding;Jie Feng;Depeng Jin;Yong Li
{"title":"A Universal Pre-Training and Prompting Framework for General Urban Spatio-Temporal Prediction","authors":"Yuan Yuan;Jingtao Ding;Jie Feng;Depeng Jin;Yong Li","doi":"10.1109/TKDE.2025.3545948","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545948","url":null,"abstract":"Urban spatio-temporal prediction is crucial for informed decision-making, such as traffic management, resource optimization, and emergency response. Despite remarkable breakthroughs in pretrained natural language models that enable one model to handle diverse tasks, a universal solution for spatio-temporal prediction remains challenging. Existing prediction approaches are typically tailored for specific spatio-temporal scenarios, requiring task-specific model designs and extensive domain-specific training data. In this study, we introduce UniST, a universal model designed for general urban spatio-temporal prediction across a wide range of scenarios. Inspired by large language models, UniST achieves success through: (i) utilizing diverse spatio-temporal data from different scenarios, (ii) effective pre-training to capture complex spatio-temporal dynamics, (iii) knowledge-guided prompts to enhance generalization capabilities. These designs together unlock the potential of building a universal model for various scenarios. Extensive experiments on more than 20 spatio-temporal scenarios, including grid-based data and graph-based data, demonstrate UniST’s efficacy in advancing state-of-the-art performance, especially in few-shot and zero-shot prediction.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2212-2225"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}