{"title":"IOPS: A Unified SpMM Accelerator Based on Inner-Outer-Hybrid Product","authors":"Wenhao Sun;Wendi Sun;Song Chen;Yi Kang","doi":"10.1109/TC.2025.3558013","DOIUrl":"https://doi.org/10.1109/TC.2025.3558013","url":null,"abstract":"Sparse matrix multiplication (SpMM) is widely applied to numerous domains, such as graph processing and machine learning. However, inner product (IP) induces redundant zero-element computing for mismatched nonzero operands, while outer product (OP) lacks input reuse across Process Elements (PEs). Besides, current accelerators only focus on sparse-sparse matrix multiplication (SSMM) or sparse-dense matrix multiplication (SDMM), rarely performing efficiently for both. To compensate for the shortcomings of IP and OP, we propose an inner-outer-hybrid product (IOHP) method, which reuses the input matrix among PEs with IP and removes zero-element calculations with OP in each PE. Based on IOHP, we co-design a accelerator with a unified computing flow, called IOPS, to efficiently process both SSMM and SDMM. It divides the SpMM into three stages: encoding, partial sum (psum) calculation, and address mapping, where the input matrices can be reused among PEs after encoding (IP) and the zero element can be skipped in the latter two stages (OP). Furthermore, an adaptive partition strategy is proposed to tile the input matrices based on their sparsity ratios, effectively utilizing the on-chip storage and reducing DRAM access. Compared with SpArch, we achieve <inline-formula><tex-math>$1.2boldsymbol{times}$</tex-math></inline-formula>~<inline-formula><tex-math>$4.3boldsymbol{times}$</tex-math></inline-formula> performance and <inline-formula><tex-math>$1.3boldsymbol{times}$</tex-math></inline-formula>~<inline-formula><tex-math>$4.8boldsymbol{times}$</tex-math></inline-formula> energy efficiency, with <inline-formula><tex-math>$1.4boldsymbol{times}$</tex-math></inline-formula>~<inline-formula><tex-math>$2.1boldsymbol{times}$</tex-math></inline-formula> DRAM access saving.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2210-2222"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tu Dinh Ngoc;Boris Teabe;Georges Da Costa;Daniel Hagimont
{"title":"Virtual NVMe-Based Storage Function Framework With Fast I/O Request State Management","authors":"Tu Dinh Ngoc;Boris Teabe;Georges Da Costa;Daniel Hagimont","doi":"10.1109/TC.2025.3558033","DOIUrl":"https://doi.org/10.1109/TC.2025.3558033","url":null,"abstract":"Current cloud environments provide numerous storage functions to virtual machines such as disk encryption, snapshotting, compression and so on. These functions are implemented using software stacks inside the hypervisor's kernel, emulator, or as a userspace polling driver like SPDK. However, each stack brings its own limitations: Linux's kernel I/O stack cannot easily integrate proprietary technologies such as Intel SGX, while SPDK requires significant changes in software development and tooling yet lacks the rich feature set of existing solutions like Linux LVM. To remedy these limitations, we introduce NVMetro, a high-performance storage framework for virtual machines based on the NVMe protocol. NVMetro provides multiple I/O paths that can be dynamically combined to fit the needs of each storage function. It links these paths together with an eBPF-based I/O router/classifier framework, as well as a userspace software stack for out-of-kernel I/O processing. We implemented three different storage functions with NVMetro and evaluated them under various workloads. Our results show that NVMetro approaches the performance of kernel-bypass solutions like SPDK while maintaining the compatibility and ease of use of in-kernel storage stacks.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2253-2266"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihong Deng;Chunming Tang;Taotao Li;Zhikang Zeng;Parhat Abla;Debiao He
{"title":"$mathtt{SFPoW}$SFPoW: Constructing Secure and Flexible Proof-of-Work Sidechains for Cross-Chain Interoperability With Wrapped Assets","authors":"Zhihong Deng;Chunming Tang;Taotao Li;Zhikang Zeng;Parhat Abla;Debiao He","doi":"10.1109/TC.2025.3558040","DOIUrl":"https://doi.org/10.1109/TC.2025.3558040","url":null,"abstract":"Sidechain techniques enhance blockchain scalability and interoperability, enabling decentralized exchanges and cross-chain operations for wrapped digital assets. However, existing PoW sidechains face challenges, including centralization, high communication costs, and incomplete PoW-based security proofs. This paper introduces <inline-formula><tex-math>$mathtt{SFPoW}$</tex-math></inline-formula>, a <u>S</u>ecure and <u>F</u>lexible <u>Proof-of-Work</u> sidechains for cross-chain interoperability with wrapped assets. <inline-formula><tex-math>$mathtt{SFPoW}$</tex-math></inline-formula> facilitates decentralized asset transfers and token swaps across nearly all PoW-based cryptocurrencies without requiring soft or hard forks or fixed PoW targets. It establishes a decentralized, fair validation set within the sidechain, improving adaptability and reducing competition and confirmation periods. A pluggable cross-chain proof generation method is proposed, effectively filtering lazy nodes, incentivizing active participation, and minimizing on-chain verification overhead to a proof size of 200.5 bytes. Through mining behavior analysis and cryptographic reductions, <inline-formula><tex-math>$mathtt{SFPoW}$</tex-math></inline-formula> satisfies weak and strong atomicity. Experiments on the Ronin blockchain and an Ethereum testnet demonstrate a round-trip cost of $6.55 and latency of 372.1–382.6 seconds, confirming its practicality and efficiency.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2278-2292"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Antognazza;Alessandro Barenghi;Gerardo Pelosi
{"title":"An Efficient and Unified RTL Accelerator Design for HQC-128, HQC-192, and HQC-256","authors":"Francesco Antognazza;Alessandro Barenghi;Gerardo Pelosi","doi":"10.1109/TC.2025.3558044","DOIUrl":"https://doi.org/10.1109/TC.2025.3558044","url":null,"abstract":"In the Post-Quantum Standardization (PQC) process held by the National Institute of Standards and Technology (NIST), the final round of evaluation of the asymmetric cryptographic schemes <monospace>Classic McEliece</monospace>, <monospace>BIKE</monospace> and <monospace>HQC</monospace> will elect the alternative Key Establishment Mechanism (KEM) to the FIPS <inline-formula><tex-math>$203$</tex-math></inline-formula> standard <monospace>CRYSTALS-Kyber</monospace>. In this work we present two configurations of a RTL hardware design of the <monospace>HQC</monospace> candidate, either optimized for devices exclusively working with client-server style protocols, or a unified accelerator compatible with all KEM operations, i.e. Key Generation, Encapsulation, and Decapsulation. Our designs are compatible with all the parameter sets defined by the <monospace>HQC</monospace> specification, providing security margins equivalent to the ones of <monospace>AES-128</monospace>, <monospace>AES-192</monospace>, and <monospace>AES-256</monospace> based on a selection made at runtime. We are providing an extensive comparison with the current state-of-the-art RTL hardware designs for Artix-<inline-formula><tex-math>$7$</tex-math></inline-formula> FPGAs of the schemes in the PQC process, introducing a new metric to evaluate the area utilization, historically a challenging task for such devices made of heterogeneous resources, and determining that <monospace>HQC</monospace> has by far the best figures among the code-based candidates in terms of latency, area occupied and efficiency, and even comparable with the lattice-based <monospace>CRYSTALS-Kyber</monospace> when using the parameters with lowest security margin.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2306-2320"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RSQC: Recursive Sparse QUBO Construction for Quantum Annealing Machines","authors":"Jianwen Luo;Yuhao Shu;Yajun Ha","doi":"10.1109/TC.2025.3557965","DOIUrl":"https://doi.org/10.1109/TC.2025.3557965","url":null,"abstract":"Quantum annealing algorithms have shown commercial potential in solving some instances of combinatorial optimization problems. However, existing mapping for general optimization problems into a compatible format for quantum annealing yields dense topology and complicated weighting, which limits the size of solvable problems on practical quantum annealing machines. To address this issue, we propose a novel mapping framework with three new techniques. First, to address the issue from general constraints, we introduce a recursive methodology to map constraints into interconnected Boolean gates and small algebraic cliques, which yields sparse topology and hardware-friendly biases/interactions. Second, to better address frequently-used constraints, we introduce a specialized penalty set based on this methodology with detailed optimizations. Third, to address the issue from the objective, we reformulate the complicated objective into a single multi-bit variable and apply binary search to its range, which turns each search step into a constraint-only problem. Compared with the state-of-the-art, experimental results and analysis over an exhaustive scan for operand bit-widths from 1 to 64 show that: (1) the growth order of the number of physical qubits with regard to operand bit-widths is reduced from <inline-formula><tex-math>$O(w^{2})$</tex-math></inline-formula> to <inline-formula><tex-math>$O(w)$</tex-math></inline-formula>, while the number is reduced by a factor of <inline-formula><tex-math>$10^{-1}$</tex-math></inline-formula> in the best case; (2) the dynamic range of biases/interactions is reduced from <inline-formula><tex-math>$O(2^{2w})$</tex-math></inline-formula> to <inline-formula><tex-math>$ lt 32$</tex-math></inline-formula>; (3) the graph minor embedding run time is reduced by a factor of <inline-formula><tex-math>$10^{-2}$</tex-math></inline-formula> in the best case. For the same optimization problem, our framework reduces the requirement of the number of physical qubits and machine precision, and shortens the time from problem to machine.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2114-2128"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pike: Two-Phase BFT With Linearity and Flexible View Change","authors":"Xiao Sui;Qichang Liu;Sisi Duan;Haibin Zhang","doi":"10.1109/TC.2025.3573597","DOIUrl":"https://doi.org/10.1109/TC.2025.3573597","url":null,"abstract":"As the first Byzantine fault-tolerant (BFT) protocol with linear communication complexity, HotStuff (PODC 2019) has received significant attention. HotStuff has three round-trips for both normal case operations and view change protocols. Follow-up studies attempt to reduce the number of phases for HotStuff. However, most studies give up on one thing in return for another. This paper extends our previous work Marlin(DSN 2022) to Pike, another BFT protocol with two phases and linear communication complexity. Both Pikeand Marlinuse the same cryptographic tools as in HotStuff and introduce no additional assumptions. Marlinhas a more efficient view change (i.e., leader election) protocol but a more complicated data structure. Pikefurther simplifies the data structure at the cost of longer view changes in extreme cases. We implement the Pike, Marlin, HotStuff, and HotStuff-2, showing that both Pikeand Marlinoutperform HotStuff in normal case operations.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2772-2784"},"PeriodicalIF":3.6,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuankai Xu;Yinchen Ni;Tiancheng He;Ruiqi Sun;Yier Jin;An Zou
{"title":"Real-Time Scheduling and Analysis of Fixed-Priority Tasks on a Basic Heterogeneous Architecture With Multiple CPUs and Many PEs","authors":"Yuankai Xu;Yinchen Ni;Tiancheng He;Ruiqi Sun;Yier Jin;An Zou","doi":"10.1109/TC.2025.3573602","DOIUrl":"https://doi.org/10.1109/TC.2025.3573602","url":null,"abstract":"While accelerator-based heterogeneous architectures have gained traction in accelerating AI tasks, effectively managing them with stringent timing constraints remains a challenge. Although many scheduling and response time analysis approaches are proposed for multi-core or heterogeneous multi-core (i.e., big.LITTLE cores) processors, direct application of them to accelerator-based heterogeneous architectures with multiple CPUs and numerous processing units (PEs) often results in significant pessimism. This paper introduces real-time scheduling and comprehensive response time analysis from unit-level micro view to job-level macro view, for general accelerator-based heterogeneous architectures, greatly enhancing schedulability and utilization rates. We begin by establishing a general task execution pattern on heterogeneous architectures that integrates multiple CPU cores and various PEs. Subsequently, we present a real-time scheduling strategy and corresponding response time analysis based on this task execution pattern from micro to macro views. Through extensive experiments conducted on GEMM and AI workloads, our proposed scheduling and response time analysis significantly outperforms state-of-the-art scheduling algorithms, improving schedulability by 10.3% to 52.9%. Furthermore, experiments on NVIDIA GPU systems indicate a potential pessimism reduction of up to 30.7%. As we target general heterogeneous architectures, our approach can be readily applied to off-the-shelf accelerator-based heterogeneous computing systems, ensuring adherence to deadlines and enhancing schedulability.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2785-2798"},"PeriodicalIF":3.6,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yilan Zhu;Honghui You;Wei Zhang;Jiming Xu;Qian Lou;Shoumeng Yan;Lei Ju
{"title":"DAHE: Parameter-Adaptive and Memory Efficient FPGA Acceleration of Homomorphic Encryption","authors":"Yilan Zhu;Honghui You;Wei Zhang;Jiming Xu;Qian Lou;Shoumeng Yan;Lei Ju","doi":"10.1109/TC.2025.3569159","DOIUrl":"https://doi.org/10.1109/TC.2025.3569159","url":null,"abstract":"While homomorphic encryption (HE) has been well-recognized as a promising data privacy protection technique, there are many challenges to the real-world deployment of HE applications. In this work, we propose a design flow for parameter-adaptive and memory-efficient FPGA acceleration of homomorphic encryption. In the framework, we explore the correlations between HE parameter selection to meet various design objectives and the huge design space due to underlying FPGA hardware resource allocation. Particularly, we demonstrate that adaptive management of the FPGA memory hierarchy is crucial to supporting diverse cryptosystem parameter selection for application-level security, accuracy, and performance requirements. We propose a resource-efficient and flexible micro-architectural design for HE operations, where data access patterns in various pipeline execution stages are optimized for high memory bandwidth utilization. Furthermore, a memory-aware performance model is built for automatic design space exploration for cryptosystem parameter selection and hardware resource provisioning. Experimental results show 1.50X and 1.16X speedup for the NTT and Rotation operations w.r.t. the state-of-the-art FPGA implementation. Meanwhile, the proposed framework generates flexible and high-performance accelerator code for real HE application kernels with different cryptosystem parameters on a wide range of FPGA devices.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2687-2701"},"PeriodicalIF":3.6,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ji Li;Qiang He;Xingwei Wang;Ammar Hawbani;Keping Yu;Yuanguo Bi;Liang Zhao
{"title":"UAV-Assisted Microservice Mobile Edge Computing Architecture: Addressing Post-Disaster Emergency Medical Rescue","authors":"Ji Li;Qiang He;Xingwei Wang;Ammar Hawbani;Keping Yu;Yuanguo Bi;Liang Zhao","doi":"10.1109/TC.2025.3566913","DOIUrl":"https://doi.org/10.1109/TC.2025.3566913","url":null,"abstract":"In post-disaster emergency medical rescue operations, rapidly establishing an adaptive and flexible edge computing (EC) network, balancing data offloading with energy consumption, and ensuring the stable operation of the network have become urgent priorities. To address these challenges, we proposed an unmanned aerial vehicle (UAV)-assisted microservice mobile edge computing (MEC) architecture. The architecture can be rapidly deployed to provide temporary network coverage and EC services in disaster-stricken areas. A transformer-based resource management (TBRM) approach is utilized to optimize data offloading efficiency and reduce energy consumption, thereby maximizing the service time of the architecture. To enhance the security and reliability of the architecture, four microservices are designed to manage the full UAV lifecycle, and UAV identity authentication is implemented through dual digital signature certificates. Large-scale simulation experiments have demonstrated the effectiveness of the architecture in complex rescue scenarios, providing strong technical support for post-disaster medical rescue efforts.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2635-2648"},"PeriodicalIF":3.6,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightweight and Holistic-Scalable Serverless Secure Container Runtime for High-Density Deployment and High-Concurrency Startup","authors":"Zijun Li;Chenyang Wu;Chuhao Xu;Quan Chen;Shuo Quan;Bin Zha;Qiang Wang;Weidong Han;Jie Wu;Minyi Guo","doi":"10.1109/TC.2025.3566912","DOIUrl":"https://doi.org/10.1109/TC.2025.3566912","url":null,"abstract":"The secure container that hosts a single container in a micro virtual machine (VM) is now used in serverless computing, as the containers are isolated through the microVMs. There are high demands on the high-density container deployment and high-concurrency container startup to improve both the resource utilization and user experience, as user functions are fine-grained in serverless platforms. Our investigation shows that the entire software stacks, containing the cgroups in the host operating system, the guest operating system, and the container <italic>rootfs</i> for the function workload, together result in low deployment density and slow startup performance at high-concurrency. We propose a lightweight and holistic-scalable secure container runtime, named <bold>RunD-V</b>, to resolve above problems in serverless computing. RunD-V proposes a guest-to-host runtime template for microVM scaling-out, and CR-bind feature in guest kernel for microVM scaling-up. Using guest-to-host runtime template, over 200 secure containers can be launched within 1<italic>s</i> on a node equipped with 104 vCPUs. It also enables more than 2,500 secure containers to be deployed on a node with 384GB of memory. The vertical scaling mechanism CR-bind further enhances both startup concurrency and deployment density.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2621-2634"},"PeriodicalIF":3.6,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}