Jaekang Shin;Myeonggu Kang;Yunki Han;Junyoung Park;Lee-Sup Kim
{"title":"AToM: Adaptive Token Merging for Efficient Acceleration of Vision Transformer","authors":"Jaekang Shin;Myeonggu Kang;Yunki Han;Junyoung Park;Lee-Sup Kim","doi":"10.1109/TC.2025.3540638","DOIUrl":"https://doi.org/10.1109/TC.2025.3540638","url":null,"abstract":"Recently, Vision Transformers (ViTs) have set a new standard in computer vision (CV), showing unparalleled image processing performance. However, their substantial computational requirements hinder practical deployment, especially on resource-limited devices common in CV applications. Token merging has emerged as a solution, condensing tokens with similar features to cut computational and memory demands. Yet, existing applications on ViTs often miss the mark in token compression, with rigid merging strategies and a lack of in-depth analysis of ViT merging characteristics. To overcome these issues, this paper introduces Adaptive Token Merging (AToM), a comprehensive algorithm-architecture co-design for accelerating ViTs. The AToM algorithm employs an image-adaptive, fine-grained merging strategy, significantly boosting computational efficiency. We also optimize the merging and unmerging processes to minimize overhead, employing techniques like First-Come-First-Merge mapping and Linear Distance Calculation. On the hardware side, the AToM architecture is tailor-made to exploit the AToM algorithm's benefits, with specialized engines for efficient merge and unmerge operations. Our pipeline architecture ensures end-to-end ViT processing, minimizing latency and memory overhead from the AToM algorithm. Across various hardware platforms including CPU, EdgeGPU, and GPU, AToM achieves average end-to-end speedups of 10.9<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 7.7<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, and 5.4<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, alongside energy savings of 24.9<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 1.8<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, and 16.7<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>. Moreover, AToM offers 1.2<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> 1.9<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> higher effective throughput compared to existing transformer accelerators.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1620-1633"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haiyang Yu;Hui Zhang;Zhen Yang;Yuwen Chen;Huan Liu
{"title":"Efficient and Secure Storage Verification in Cloud-Assisted Industrial IoT Networks","authors":"Haiyang Yu;Hui Zhang;Zhen Yang;Yuwen Chen;Huan Liu","doi":"10.1109/TC.2025.3540661","DOIUrl":"https://doi.org/10.1109/TC.2025.3540661","url":null,"abstract":"The rapid development of Industrial IoT (IIoT) has caused the explosion of industrial data, which opens up promising possibilities for data analysis in IIoT networks. Due to the limitation of computation and storage capacity, IIoT devices choose to outsource the collected data to remote cloud servers. Unfortunately, the cloud storage service is not as reliable as it claims, whilst the loss of physical control over the cloud data makes it a significant challenge in ensuring the integrity of the data. Existing schemes are designed to check the data integrity in the cloud. However, it is still an open problem since IIoT devices have to devote lots of computation resources in existing schemes, which are especially not friendly to resource-constrained IIoT devices. In this paper, we propose an efficient storage verification approach for cloud-assisted industrial IoT platform by adopting a homomorphic hash function combined with polynomial commitment. The proposed approach can efficiently generate verification tags and verify the integrity of data in the industrial cloud platform for IIoT devices. Moreover, the proposed scheme can be extended to support privacy-enhanced verification and dynamic updates. We prove the security of the proposed approach under the random oracle model. Extensive experiments demonstrate the superior performance of our approach for resource-constrained devices in comparison with the state-of-the-art.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1702-1716"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"$mathsf{Aurora}$Aurora: Leaderless State-Machine Replication With High Throughput","authors":"Hao Lu;Jian Liu;Kui Ren","doi":"10.1109/TC.2025.3540656","DOIUrl":"https://doi.org/10.1109/TC.2025.3540656","url":null,"abstract":"State-machine replication (SMR) allows a deterministic state machine to be replicated across a set of replicas and handle clients’ requests as a single machine. Most existing SMR protocols are leader-based requiring a leader to order requests and coordinate the protocol. This design places a disproportionately high load on the leader, inevitably impairing the scalability. If the leader fails, a complex and bug-prone fail-over protocol is needed to switch to a new leader. An adversary can also exploit the fail-over protocol to slow down the protocol. <p>In this paper, we propose a crash-fault tolerant SMR named <inline-formula><tex-math>$mathsf{Aurora}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>Aurora</mml:mi></mml:mrow></mml:math><inline-graphic></alternatives></inline-formula>, with the following properties: <list> <list-item> <label>•</label> <p>Leaderless: it does not require a leader, hence completely get rid of the fail-over protocol.</p></list-item> <list-item> <label>•</label> <p>Scalable: it can scale up to <inline-formula><tex-math>$11$</tex-math><alternatives><mml:math><mml:mn>11</mml:mn></mml:math><inline-graphic></alternatives></inline-formula> replicas.</p></list-item> <list-item> <label>•</label> <p>Robust: it behaves well even under a poor network connection.</p></list-item></list></p> <p>We provide a full-fledged implementation of <inline-formula><tex-math>$mathsf{Aurora}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>Aurora</mml:mi></mml:mrow></mml:math><inline-graphic></alternatives></inline-formula> and systematically evaluate its performance. Our benchmark results show that <inline-formula><tex-math>$mathsf{Aurora}$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>Aurora</mml:mi></mml:mrow></mml:math><inline-graphic></alternatives></inline-formula> achieves a throughput of around two million Transactions Per Second (TPS), up to 8.7<inline-formula><tex-math>$boldsymbol{times}$</tex-math><alternatives><mml:math><mml:mo>×</mml:mo></mml:math><inline-graphic></alternatives></inline-formula> higher than the state-of-the-art leaderless SMR.</p>","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1690-1701"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KDN-Based Adaptive Computation Offloading and Resource Allocation Strategy Optimization: Maximizing User Satisfaction","authors":"Kaiqi Yang;Qiang He;Xingwei Wang;Zhi Liu;Yufei Liu;Min Huang;Liang Zhao","doi":"10.1109/TC.2025.3541142","DOIUrl":"https://doi.org/10.1109/TC.2025.3541142","url":null,"abstract":"In large-scale dynamic network environments, optimizing the computation offloading and resource allocation strategy is key to improving resource utilization and meeting the diverse demands of User Equipment (UE). However, traditional strategies for providing personalized computing services face several challenges: dynamic changes in the environment and UE demands, along with the inefficiency and high costs of real-time data collection; the unpredictability of resource status leads to an inability to ensure long-term UE satisfaction. To address these challenges, we propose a Knowledge-Defined Networking (KDN)-based Adaptive Edge Resource Allocation Optimization (KARO) architecture, facilitating real-time data collection and analysis of environmental conditions. Additionally, we implement an environmental resource change perception module in the KARO to assess current and future resource utilization trends. Based on the real-time state and resource urgency, we develop a deep reinforcement learning-based Adaptive Long-term Computation Offloading and Resource Allocation (AL-CORA) strategy optimization algorithm. This algorithm adapts to the environmental resource urgency, autonomously balancing UE satisfaction and task execution cost. Experimental results indicate that AL-CORA effectively improves long-term UE satisfaction and task execution success rates, under the limited computation resource constraints.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1743-1757"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Tang;Rui Dai;Chenguang Zuo;Jingwen Chen;Keqin Li;Zheng Qin
{"title":"A Low-Rate DoS Attack Mitigation Scheme Based on Port and Traffic State in SDN","authors":"Dan Tang;Rui Dai;Chenguang Zuo;Jingwen Chen;Keqin Li;Zheng Qin","doi":"10.1109/TC.2025.3541143","DOIUrl":"https://doi.org/10.1109/TC.2025.3541143","url":null,"abstract":"Low-rate Denial of Service (DoS) attacks can significantly compromise network availability and are difficult to detect and mitigate due to their stealthy exploitation of flaws in congestion control mechanisms. Software-Defined Networking (SDN) is a revolutionary architecture that decouples network control from packet forwarding, emerging as a promising solution for defending against low-rate DoS attacks. In this paper, we propose Trident, a low-rate DoS attack mitigation scheme based on port and traffic state in SDN. Specifically, we design a multi-step strategy to monitor switch states. First, Trident identifies switches suspected of suffering from low-rate DoS attacks through port state detection. Then, it monitors the traffic state of switches with abnormal port states. Once a switch is identified as suffering from an attack, Trident analyzes the flow information to pinpoint the malicious flow. Finally, Trident issues rules to the switch's flow table to block the malicious flow, effectively mitigating the attack. We prototype Trident on the Mininet platform and conduct experiments using a real-world topology to evaluate its performance. The experiments show that Trident can accurately and robustly detect low-rate DoS attacks, respond quickly to mitigate them, and maintain low overhead.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1758-1770"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zewen Ye;Junhao Huang;Tianshun Huang;Yudan Bai;Jinze Li;Hao Zhang;Guangyan Li;Donglong Chen;Ray C. C. Cheung;Kejie Huang
{"title":"PQNTRU: Acceleration of NTRU-Based Schemes via Customized Post-Quantum Processor","authors":"Zewen Ye;Junhao Huang;Tianshun Huang;Yudan Bai;Jinze Li;Hao Zhang;Guangyan Li;Donglong Chen;Ray C. C. Cheung;Kejie Huang","doi":"10.1109/TC.2025.3540647","DOIUrl":"https://doi.org/10.1109/TC.2025.3540647","url":null,"abstract":"Post-quantum cryptography (PQC) has rapidly evolved in response to the emergence of quantum computers, with the US National Institute of Standards and Technology (NIST) selecting four finalist algorithms for PQC standardization in 2022, including the Falcon digital signature scheme. Hawk is currently the only lattice-based candidate in NIST Round 2 additional signatures. Falcon and Hawk are based on the NTRU lattice, offering compact signatures, fast generation, and verification suitable for deployment on resource-constrained Internet-of-Things (IoT) devices. Despite the popularity of ML-DSA and ML-KEM, research on NTRU-based schemes has been limited due to their complex algorithms and operations. Falcon and Hawk's performance remains constrained by the lack of parallel execution in crucial operations like the Number Theoretic Transform (NTT) and Fast Fourier Transform (FFT), with data dependency being a significant bottleneck. This paper enhances NTRU-based schemes Falcon and Hawk through hardware/software co-design on a customized Single-Instruction-Multiple-Data (SIMD) processor, proposing new SIMD hardware units and instructions to expedite these schemes along with software optimizations to boost performance. Our NTT optimization includes a novel layer merging technique for SIMD architecture to reduce memory accesses, and the use of modular algorithms (Signed Montgomery and Improved Plantard) targets various modulus data widths to enhance performance. We explore applying layer merging to accelerate fixed-point FFT at the SIMD instruction level and devise a dual-issue parser to streamline assembly code organization to maximize dual-issue utilization. A System-on-chip (SoC) architecture is devised to improve the practical application of the processor in real-world scenarios. Evaluation on 28 <inline-formula><tex-math>$nm$</tex-math></inline-formula> technology and field programmable gate array (FPGA) platform shows that our design and optimizations can increase the performance of Hawk signature generation and verification by over 7<inline-formula><tex-math>$times$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1649-1662"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"29-Billion Atoms Molecular Dynamics Simulation With Ab Initio Accuracy on 35 Million Cores of New Sunway Supercomputer","authors":"Xun Wang;Xiangyu Meng;Zhuoqiang Guo;Mingzhen Li;Lijun Liu;Mingfan Li;Qian Xiao;Tong Zhao;Ninghui Sun;Guangming Tan;Weile Jia","doi":"10.1109/TC.2025.3540646","DOIUrl":"https://doi.org/10.1109/TC.2025.3540646","url":null,"abstract":"Physical phenomena such as bond breaking and phase transitions require molecular dynamics (MD) with <i>ab initio</i> accuracy, involving up to billions of atoms and over nanosecond timescales. Previous state-of-the-art work has demonstrated that neural network molecular dynamics (NNMD) like deep potential molecular dynamics (DeePMD), can successfully extend the temporal and spatial scales of MD with <i>ab initio</i> accuracy on both ARM and GPU platforms. However, the DeePMD-kit package is currently unable to fully exploit the computational potential of the new Sunway supercomputer due to its unique many-core architecture, memory hierarchy, and low precision capability. In this paper, we re-design the DeePMD-kit to harness the massive computing power of the new Sunway, enabling the MD with over ten billion atoms. We first design a large-scale parallelization scheme to exploit the massive parallelism of the new Sunway. Then we devise specialized optimizations for the time-consuming operators. Finally, we design a novel mixed precision method for DeePMD-kit customized operators to leverage the low precision computing power of the new Sunway. The optimized DeePMD-kit achieves 67.6 / 56.5 <inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> speedup for water / copper systems on the new Sunway. Meanwhile, it can perform 29 billion atoms simulation for the water system on 35 million cores (i.e., 90,000 computing nodes, around 84% of the whole supercomputer) with a peak performance of 57.1 PFLOPs, which is 7.9<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> bigger and 1.2<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster than state-of-the-art results. This paves the way for investigating more realistic scenarios, such as studying the mechanical properties of metals, semiconductor devices, batteries, and other materials and physical systems.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1634-1648"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Defying Multi-Model Forgetting in One-Shot Neural Architecture Search Using Orthogonal Gradient Learning","authors":"Lianbo Ma;Yuee Zhou;Ye Ma;Guo Yu;Qing Li;Qiang He;Yan Pei","doi":"10.1109/TC.2025.3540650","DOIUrl":"https://doi.org/10.1109/TC.2025.3540650","url":null,"abstract":"One-shot neural architecture search (NAS) trains an over-parameterized network (termed as supernet) that assembles all the architectures as its subnets by using weight sharing for computational budget reduction. However, there is an issue of multi-model forgetting during supernet training that some weights of the previously well-trained architecture will be overwritten by that of the newly sampled architecture which has overlapped structures with the old one. To overcome the issue, we propose an orthogonal gradient learning (OGL) guided supernet training paradigm, where the novelty lies in the fact that the weights of the overlapped structures of current architecture are updated in the orthogonal direction to the gradient space of these overlapped structures of all previously trained architectures. Moreover, a new approach of calculating the projection is designed to effectively find the base vectors of the gradient space to acquire the orthogonal direction. We have theoretically and experimentally proved the effectiveness of the proposed paradigm in overcoming the multi-model forgetting. Besides, we apply the proposed paradigm to two one-shot NAS baselines, and experimental results demonstrate that our approach is able to mitigate the multi-model forgetting and enhance the predictive ability of the supernet with remarkable efficiency on popular test datasets.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1678-1689"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingyang Zhang;Shuai Qian;Jie Cui;Hong Zhong;Fengqun Wang;Debiao He
{"title":"Blockchain-Based Privacy-Preserving Deduplication and Integrity Auditing in Cloud Storage","authors":"Qingyang Zhang;Shuai Qian;Jie Cui;Hong Zhong;Fengqun Wang;Debiao He","doi":"10.1109/TC.2025.3540670","DOIUrl":"https://doi.org/10.1109/TC.2025.3540670","url":null,"abstract":"Ensuring cloud data security and reducing cloud storage costs have become particularly important. Many schemes expose user file ownership privacy when deduplicating authentication tags and during integrity auditing. Moreover, key management becomes more difficult as the number of files increases. Also, many audit schemes rely on third-party auditors (TPAs), but finding a fully trustworthy TPA is challenging. Therefore, we propose a blockchain-based integrity audit scheme supporting data deduplication. It protects file tag privacy during deduplication of ciphertexts and authentication tags, safeguards audit proof privacy, and effectively protects user file ownership privacy. To reduce key management costs, we introduce identity-based broadcast encryption (IBBE) that does not require interaction with key servers, eliminating additional communication costs. Additionally, we use smart contracts for integrity auditing, eliminating the need for a fully trusted TPA. We evaluate the proposed scheme through security and theoretical analyses and a series of experiments, demonstrating its efficiency and practicality.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1717-1729"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RelHDx: Hyperdimensional Computing for Learning on Graphs With FeFET Acceleration","authors":"Jaeyoung Kang;Minxuan Zhou;Weihong Xu;Tajana Rosing","doi":"10.1109/TC.2025.3541141","DOIUrl":"https://doi.org/10.1109/TC.2025.3541141","url":null,"abstract":"Graph neural networks (GNNs) are a powerful machine learning (ML) method to analyze graph data. The training of GNN has compute and memory-intensive phases along with irregular data movements, which makes in-memory acceleration challenging. We present a hyperdimensional computing (HDC)-based graph ML framework called RelHDx that aggregates node features and graph structure, along with representing node and edge information in high-dimensional space. RelHDx enables single-pass training and inference with simple arithmetic operations, resulting in the efficient design of graph-based ML tasks: node classification and link prediction. We accelerate RelHDx using scalable processing in-memory (PIM) architecture based on emerging ferroelectric FET (FeFET) technology. Our accelerator uses a data allocation optimization and operation scheduler to address the irregularity of the graph and maximize the performance. Evaluation results show that RelHDx offers comparable accuracy to popular GNN-based algorithms while achieving up to <inline-formula><tex-math>$63.8boldsymbol{times}$</tex-math></inline-formula> faster speed on GPU. Our FeFET-based accelerator, RelHDx-PIM, is <inline-formula><tex-math>$32boldsymbol{times}$</tex-math></inline-formula> faster for node classification, while for link prediction it is <inline-formula><tex-math>$65.4boldsymbol{times}$</tex-math></inline-formula> faster than when running on GPU. Furthermore, RelHDx-PIM improves energy efficiency by four orders of magnitude over GPU. Compared to the state-of-the-art in-memory processing-based GNN accelerator, PIM-GCN <xref>[1]</xref>, RelHDx-PIM is <inline-formula><tex-math>$10boldsymbol{times}$</tex-math></inline-formula> faster and <inline-formula><tex-math>$986boldsymbol{times}$</tex-math></inline-formula> more energy-efficient on average.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 5","pages":"1730-1742"},"PeriodicalIF":3.6,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143800791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}