{"title":"TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-Up Cluster Design With High Bandwidth Main Memory Link","authors":"Yichao Zhang;Marco Bertuletti;Chi Zhang;Samuel Riedel;Diyou Shen;Bowen Wang;Alessandro Vanelli-Coralli;Luca Benini","doi":"10.1109/TC.2025.3603692","DOIUrl":"https://doi.org/10.1109/TC.2025.3603692","url":null,"abstract":"Shared L1-memory clusters of streamlined instruction processors (processing elements - PEs) are commonly used as building blocks in modern, massively parallel computing architectures (e.g. GP-GPUs). <i>Scaling out</i> these architectures by increasing the number of clusters incurs computational and power overhead, caused by the requirement to split and merge large data structures in chunks and move chunks across memory hierarchies via the high-latency global interconnect. <i>Scaling up</i> the cluster reduces buffering, copy, and synchronization overheads. However, the complexity of a fully connected cores-to-L1-memory crossbar grows quadratically with Processing Element (PE)-count, posing a major physical implementation challenge. We present TeraPool, a physically implementable, <inline-formula><tex-math>${boldsymbol >} 1000$</tex-math></inline-formula> floating-point-capable RISC-V PEs scaled-up cluster design, sharing a Multi-MegaByte <inline-formula><tex-math>${boldsymbol >} 4000$</tex-math></inline-formula>-banked L1 memory via a low latency hierarchical interconnect (1-7/9/11 cycles, depending on target frequency). Implemented in 12 nm FinFET technology, TeraPool achieves near-gigahertz frequencies (910 MHz) typical, 0.80 V/25 <inline-formula><tex-math>$^{boldsymbol{circ}}$</tex-math></inline-formula>C. The energy-efficient hierarchical PE-to-L1-memory interconnect consumes only 9-13.5 pJ for memory bank accesses, just 0.74-1.1<inline-formula><tex-math>${boldsymbol times}$</tex-math></inline-formula> the cost of a FP32 FMA. A high-bandwidth main memory link is designed to manage data transfers in/out of the shared L1, sustaining transfers at the full bandwidth of an HBM2E main memory. At 910 MHz, the cluster delivers up to 1.89 single precision TFLOP/s peak performance and up to 200 GFLOP/s/W energy efficiency (at a high IPC/PE of 0.8 on average) in benchmark kernels, demonstrating the feasibility of scaling a shared-L1 cluster to a thousand PEs, four times the PE count of the largest clusters reported in literature.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3667-3681"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qinglin Zhao;Lixin Zhang;Qi Pan;Kunbo Cui;Mingqi Zhao;Fuze Tian;Bin Hu
{"title":"An On-Board Executable Pareto-Based Iterated Local Search Algorithm for Embedded Multi-Core Processor Task Scheduling","authors":"Qinglin Zhao;Lixin Zhang;Qi Pan;Kunbo Cui;Mingqi Zhao;Fuze Tian;Bin Hu","doi":"10.1109/TC.2025.3603699","DOIUrl":"https://doi.org/10.1109/TC.2025.3603699","url":null,"abstract":"The advancement of wearable electronic technology has facilitated the integration of smart wearable devices into artificial intelligence (AI)-driven medical assisted diagnosis. Embedded multi-core processors (MPs) have gradually emerged as pivotal hardware components for smart wearable medical diagnostic devices due to their high performance and flexibility. However, embedded MPs face the challenge of balancing performance, power consumption, and load-balancing. In response, we introduce a Pareto-based iterated local search (PILS) algorithm for task scheduling, which systematically optimizes multiple objectives, alongside a task list model to reduce the dimension of the decision space and enhance scheduling performance. In addition, we present a two-stage discretization scheme to ensure that the proposed algorithm offers meaningful guidance throughout the scheduling process. Simulation and on-board testing results show that the proposed algorithm effectively optimizes energy consumption, task execution time, and load-balancing in embedded MPs task scheduling, indicating the potential of the proposed algorithm in enhancing the performance of smart wearable medical diagnostic devices powered by embedded MPs.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3696-3709"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fuliang Li;Kejun Guo;Yiming Lv;Jiaxing Shen;Yuting Liu;Xingwei Wang;Jiannong Cao
{"title":"FSA-Hash: Flow-Size-Aware Sketch Hashing for Software Switches","authors":"Fuliang Li;Kejun Guo;Yiming Lv;Jiaxing Shen;Yuting Liu;Xingwei Wang;Jiannong Cao","doi":"10.1109/TC.2025.3603716","DOIUrl":"https://doi.org/10.1109/TC.2025.3603716","url":null,"abstract":"In modern data centers and enterprise networks, software switches have become critical components for achieving flexible and efficient network management. Due to resource constraints in software switches, sketches have emerged as a promising approach for network traffic measurement. However, their accuracy is often impacted by hash collisions. Existing hash functions treat all collisions equally, failing to account for the differing impacts of collisions involving elephant flows versus mouse flows. We propose FSA-Hash, a novel flow-size-aware hashing scheme that separates elephant flows from each other and from mouse flows, minimizing the most detrimental collisions. FSA-Hash is designed based on two insights: separating elephant flows from mouse flows avoids overestimating mouse flows, while separating elephant flows from each other enables accurate heavy-hitter detection. We implement FSA-Hash using machine learning models trained on network traffic data (LFSA-Hash), and also design a lightweight online variant (OLFSA-Hash) that learns the hash model solely from sketch queries on the software switch, obviating traffic collection overheads. Evaluations across four sketches and two tasks demonstrate FSA-Hash’s superior accuracy over standard hash functions. Moreover, OLFSA-Hash closely matches LFSA-Hash’s performance, making it an attractive option for adaptively refining the hash model without monitoring traffic.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3736-3749"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SCC: Synchronization Congestion Control for Multi-Tenant Learning Over Geo-Distributed Clouds","authors":"Chengxi Gao;Fuliang Li;Kejiang Ye;Yang Wang;Pengfei Wang;Xingwei Wang;Chengzhong Xu","doi":"10.1109/TC.2025.3604486","DOIUrl":"https://doi.org/10.1109/TC.2025.3604486","url":null,"abstract":"Distributed machine learning over geo-distributed clouds enables joint training of data located in different regions, alleviating the burden of transferring large volumes of training datasets, which greatly saves bandwidth. However, the limited capacity of WAN links slows down the inter-cloud communications, which significantly decelerates the synchronization of distributed machine learning over geo-distributed clouds. Besides, the multi-tenancy in clouds results in multiple training tasks running simultaneously, whose synchronizations consistently compete for the limited WAN bandwidth with each other, which further aggravates the training performance of each task. While existing works optimize synchronizations through techniques like gradient compression, multi-resource interleaving and so on, none of them targets at the synchronization congestion especially due to multi-tenant learning, which results in inferior training performance. To solve these problems, we propose a simple but effective scheme, SCC, for fast and efficient multi-tenant learning via synchronization congestion control. SCC monitors the cross-cloud network conditions and evaluates the synchronization congestion level based on the round-trip transmission time for each synchronization. Then SCC alleviates synchronization congestion via controlling the synchronization frequency according to the synchronization congestion level in a probabilistic way. Extensive experiments are conducted within our testbeds consisted of 16 NVIDIA V100 GPUs to evaluate the performance of SCC, and comparison results show that SCC can reduce the average training completion time and makespan by up to 28.6% and 43.2% over SAP-SGD <xref>[1]</xref>. Targeted experiments are conducted to demonstrate the effectiveness and robustness of SCC.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3911-3924"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Wang;Haiyong Bao;Na Ruan;Qinglei Kong;Cheng Huang;Hong-Ning Dai
{"title":"PRRQ: Privacy-Preserving Resilient RkNN Query Over Encrypted Outsourced Multiattribute Data","authors":"Jing Wang;Haiyong Bao;Na Ruan;Qinglei Kong;Cheng Huang;Hong-Ning Dai","doi":"10.1109/TC.2025.3603688","DOIUrl":"https://doi.org/10.1109/TC.2025.3603688","url":null,"abstract":"Traditional reverse k-nearest neighbor (RkNN) query schemes typically assume that users are available online in real-time for interactive key reception, overlooking scenarios where users might be offline. Moreover, existing privacy-preserving RkNN query schemes primarily focus on user features or spatial data, neglecting the significance of user reputation values. To address these limitations, we propose a privacy-preserving resilient RkNN query scheme over encrypted outsourced multi-attribute data (PRRQ). Specifically, to mitigate the challenges posed by resilient online presence (i.e., non-real-time online) of users for interactive key reception, we incorporate a non-interactive key exchange (NIKE) protocol and the Diffie-Hellman two-party key exchange algorithm to propose a multi-party NIKE algorithm (2K-NIKE), facilitating non-interactive key reception for multiple users. Considering the privacy leakage issues, PRRQ encodes original multi-attribute data (i.e., spatial, feature, and reputation values) alongside query requests based on formalized criteria. Additionally, we integrate the proposed 2K-NIKE and the improved symmetric homomorphic encryption (iSHE) algorithms to encrypt them. Furthermore, catering to the requirements of ciphertext-based RkNN queries, we propose a private RkNN query eligibility-checking (PREC) algorithm and a private reputation-verifying (PRRV) algorithm, which validate the compliance of encrypted outsourced multi-attribute data with query requests. Security analysis demonstrates that PRRQ achieves simulation-based security under an <italic>honest-but-curious</i> model. Experimental results show that PRRQ offers superior computational efficiency compared to comparative schemes.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3652-3666"},"PeriodicalIF":3.8,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoquan Zhang;Lin Cui;WaiMing Lau;Fung Po Tso;Yuhui Deng;Weijia Jia
{"title":"Enhancing In-Network Computing Deployment via Collaboration Across Planes","authors":"Xiaoquan Zhang;Lin Cui;WaiMing Lau;Fung Po Tso;Yuhui Deng;Weijia Jia","doi":"10.1109/TC.2025.3603730","DOIUrl":"https://doi.org/10.1109/TC.2025.3603730","url":null,"abstract":"The new paradigm of In-network computing (INC) permits service computation to be executed within network paths, rather than solely on dedicated servers. Although the programmable data plane has showcased notable performance advantages for INC application deployments, its effectiveness is constrained by resource limitations, potentially impeding the expressiveness and scalability of these deployments. Conversely, delegating computational tasks to the control plane, supported by general-purpose servers with abundant resources, offers increased flexibility. Nonetheless, this strategy compromises efficiency to a considerable extent, particularly when the system operates under heavy load. To simultaneously exploit the efficiency of data plane and the flexibility of control plane, we propose <italic>Carlo</i>, a cross-plane collaborative optimization framework to support the network-wide deployment of multiple INC applications across both the control and data plane. <italic>Carlo</i> first analyzes resource requirements of various INC applications across different planes. It then establishes mathematical models for resource allocation in cross-plane and automatically generates solutions using proposed algorithms. We have implemented the prototype of <italic>Carlo</i> on Intel Tofino ASIC switches and DPDK. Experimental results demonstrate that <italic>Carlo</i> can effectively trade off between computation time and deployment performance while avoiding performance degradation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3805-3817"},"PeriodicalIF":3.8,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Encrypted Deduplication Based on Location-Hiding Secret Sharing of Data Keys","authors":"Guanxiong Ha;Yuchen Chen;Chunfu Jia;Keyan Chen;Rongxi Wang;Qiaowen Jia","doi":"10.1109/TC.2025.3603710","DOIUrl":"https://doi.org/10.1109/TC.2025.3603710","url":null,"abstract":"Encrypted deduplication is attractive because it can provide high storage efficiency while protecting data privacy. Most existing schemes achieve encrypted deduplication against brute-force attacks (BFAs) based on server-aided encryption. Unfortunately, the centralized key server in server-aided encryption can potentially become a single point of failure. To this end, distributed server-aided encryption is presented, which splits a system-level master key into multiple shares and distributes them across several key servers. However, it is hard to improve security and scalability with this method simultaneously. This paper presents a secure and scalable encrypted deduplication scheme ScalaDep. ScalaDep achieves a new design paradigm centered on location-hiding secret sharing of data keys. As the number of deployed key servers increases, the attack cost of adversaries increases while the number of requests handled by each key server decreases, enhancing both scalability and security. Furthermore, we propose a two-phase duplicate detection method for our paradigm, which utilizes short hashes and key identifiers to achieve secure duplicate detection against BFAs. Additionally, based on the allreduce algorithm, ScalaDep enables all key servers to collaboratively record the number of client requests and resist online BFAs by enforcing rate limiting. Security analysis and performance evaluation demonstrate the security and efficiency of ScalaDep.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3710-3721"},"PeriodicalIF":3.8,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Caravan: Incentive-Driven Account Migration via Transaction Aggregation in Sharded Blockchain","authors":"Yu Tao;Shouchen Zhou;Lu Zhou;Zhe Liu","doi":"10.1109/TC.2025.3603672","DOIUrl":"https://doi.org/10.1109/TC.2025.3603672","url":null,"abstract":"Blockchain sharding is a promising solution for scalability but struggles to reach the expected performance due to the high ratio of cross-shard transactions. Account migration has emerged as a critical approach to optimizing shard performance. However, existing migration solutions suffer from inefficient handling of queued withdrawal transactions from a migrating account and inadequate priority mechanism for migration transaction, resulting in prolonged transaction makespan and reduced system throughput. This paper proposes Caravan, a novel blockchain sharding system for optimizing account migration. First, Caravan proposes a transaction aggregation-based migration scheme to efficiently handle withdrawal congestion post-migration. It incorporates a multi-level Merkle tree and cross-shard synchronization protocol to ensure cross-shard security. Second, Caravan presents an economic incentive-driven priority mechanism that motivates miners to perform transaction aggregation and prioritize migration transactions by increasing the associated revenue. Furthermore, its gas recycling strategy enables users to finance migration costs without awareness or extra expenses. Finally, we develop the Caravan prototype, deploy it on Alibaba Cloud, and experiment with real Ethereum transactions. The results show that compared to the state-of-the-art account migration schemes, Caravan significantly mitigates the transaction surge caused by migration, achieving up to a 3.2× throughput improvement and a 65% reduction in transaction confirmation latency. And users share considerable migration costs without extra expenses, significantly reduce system costs.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3609-3622"},"PeriodicalIF":3.8,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Da Zhang;Haojun Xia;Xiaotong Wang;Yanchang Feng;Haohao Liu;Bibo Tu
{"title":"Thermal Elasticity-Aware Host Resource Provision for Carbon Efficiency on Virtualized Servers","authors":"Da Zhang;Haojun Xia;Xiaotong Wang;Yanchang Feng;Haohao Liu;Bibo Tu","doi":"10.1109/TC.2025.3603698","DOIUrl":"https://doi.org/10.1109/TC.2025.3603698","url":null,"abstract":"Servers in modern data centers face increasing challenges from energy inefficiency and thermal-related outages, both of which significantly contribute to their overall carbon footprint. These challenges often arise from a lack of coordination between computational resource provisioning and thermal management capabilities. This paper introduces the concept of thermal elasticity, a system’s intrinsic ability to absorb thermal stress without requiring additional cooling, as a guiding metric for sustainable thermal management. Building on this, we propose a collaborative in-band and out-of-band resource provisioning framework that adjusts CPU allocation based on real-time thermal feedback. By leveraging a machine learning model and runtime monitoring, the framework dynamically provisions CPU clusters to virtual machines co-located on the same host. Evaluations on real servers with multiple workloads show that our method reduces peak power consumption from 5.2% to 9.6%, and lowers peak temperatures between 4<inline-formula><tex-math>${^{boldsymbol{circ}}}$</tex-math></inline-formula>C and 6.5<inline-formula><tex-math>${^{boldsymbol{circ}}}$</tex-math></inline-formula>C (up to 40<inline-formula><tex-math>${^{boldsymbol{circ}}}$</tex-math></inline-formula>C in extreme cases). Carbon emissions are also reduced from 7% to 37% during SPEC benchmark runs. These results highlight the framework’s potential to alleviate stress on power and cooling infrastructure, thereby enhancing energy efficiency, reducing carbon footprint, and improving service continuity during thermal challenges.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3682-3695"},"PeriodicalIF":3.8,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EABE-PUFPH: Efficient Attribute-Based Encryption With Reliable Policy Updating Under Full Policy Hiding","authors":"Chenghao Gu;Jiguo Li;Yichen Zhang;Yang Lu;Jian Shen","doi":"10.1109/TC.2025.3603717","DOIUrl":"https://doi.org/10.1109/TC.2025.3603717","url":null,"abstract":"Ciphertext-policy attribute-based encryption (CP-ABE) has garnered significant attention for enabling fine-grained access control over encrypted data in cloud environments. However, in traditional CP-ABE schemes, access policies are transmitted in plaintext, which can lead to sensitive information leakage. To mitigate this risk, hiding access policies has become essential. Under the condition of full hidden access policies, realizing efficient and accurate decryption and dynamic policy updating has become an urgent challenge. To tackle these challenges, we present an efficient attribute-based encryption with reliable policy updating under full policy hiding (EABE-PUFPH) scheme, which effectively integrates full policy hiding with policy updating capabilities. Furthermore, we conduct a rigorous security analysis and performance evaluation of the EABE-PUFPH scheme. Evaluation results show that the EABE-PUFPH scheme achieves full hidden access policies without affecting decryption efficiency, and its efficiency surpasses other similar schemes that achieve full policy hiding.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3750-3762"},"PeriodicalIF":3.8,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}