{"title":"Efficient and Fast High-Performance Library Generation for Deep Learning Accelerators","authors":"Jun Bi;Yuanbo Wen;Xiaqing Li;Yongwei Zhao;Yuxuan Guo;Enshuai Zhou;Xing Hu;Zidong Du;Ling Li;Huaping Chen;Tianshi Chen;Qi Guo","doi":"10.1109/TC.2024.3475575","DOIUrl":"https://doi.org/10.1109/TC.2024.3475575","url":null,"abstract":"The widespread adoption of deep learning accelerators (DLAs) underscores their pivotal role in improving the performance and energy efficiency of neural networks. To fully leverage the capabilities of these accelerators, exploration-based library generation approaches have been widely used to substantially reduce software development overhead. However, these approaches have been challenged by issues related to sub-optimal optimization results and excessive optimization overheads. In this paper, we propose \u0000<small>Heron</small>\u0000 to generate high-performance libraries of DLAs in an efficient and fast way. The key is automatically enforcing massive constraints through the entire program generation process and guiding the exploration with an accurate pre-trained cost model. \u0000<small>Heron</small>\u0000 represents the search space as a constrained satisfaction problem (CSP) and explores the space via evolving the CSPs. Thus, the sophisticated constraints of the search space are strictly preserved during the entire exploration process. The exploration algorithm has the flexibility to engage in space exploration using either online-trained models or pre-trained models. Experimental results demonstrate that \u0000<small>Heron</small>\u0000 averagely achieves 2.71\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over three state-of-the-art automatic generation approaches. Also, compared to vendor-provided hand-tuned libraries, \u0000<small>Heron</small>\u0000 achieves a 2.00\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup on average. When employing a pre-trained model, \u0000<small>Heron</small>\u0000 achieves 11.6\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 compilation time speedup, incurring a minor impact on execution time.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"155-169"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaodong Huang;Tingting Yao;Zelin Lin;Xiaojun Shang;Yukun Yuan;Laizhong Cui;Yuanyuan Yang
{"title":"Efficient Service Function Chain Placement Over Heterogeneous Devices in Deviceless Edge Computing Environments","authors":"Yaodong Huang;Tingting Yao;Zelin Lin;Xiaojun Shang;Yukun Yuan;Laizhong Cui;Yuanyuan Yang","doi":"10.1109/TC.2024.3475590","DOIUrl":"https://doi.org/10.1109/TC.2024.3475590","url":null,"abstract":"Heterogeneous devices in edge computing bring challenges as well as opportunities for edge computing to utilize powerful and heterogeneous hardware for a variety of complex tasks. In this paper, we propose a service function chain placement strategy considering the heterogeneity of devices in deviceless edge computing environments. The service function chain system utilizes lightweight virtualization technologies to manage resources, considering the heterogeneity of devices to support various complex tasks, and offer low latency services to user requests. We propose an optimal service function chain placement problem minimizing the service delay and formulate it into a quasi-convex problem. We implement different edge applications that can be served by function chains and conduct extensive experiments over real heterogeneous edge devices. Results from the experiments and simulations show that our proposed service function chain scheme is applicable in edge environments, and perform well over services latency, resource utilization as well as the power consumption of edge devices.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"222-236"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Control Flow in Spatial Architectures: Insights Into Control Flow Plane Design","authors":"Jinyi Deng;Xinru Tang;Jiahao Zhang;Yuxuan Li;Linyun Zhang;Fengbin Tu;Shaojun Wei;Yang Hu;Shouyi Yin","doi":"10.1109/TC.2024.3475582","DOIUrl":"https://doi.org/10.1109/TC.2024.3475582","url":null,"abstract":"Spatial architecture is a high-performance paradigm that employs control flow graphs and data flow graphs as computation model, and producer/consumer models as execution model. However, existing spatial architectures struggle with control flow handling challenges. Upon thoroughly characterizing their PE execution models, we observe that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This degrades its performance in intensive control programs. To tackle the existing control flow handling challenges, Marionette, a spatial architecture with an explicit-designed control flow plane, is proposed. We elaborately develop a full stack of Marionette architecture, from ISA, compiler, simulator to RTL. Marionette's flexible Control Flow Plane enables autonomous, peer-to-peer, and temporally loosely-coupled control flow management. Its Proactive PE Configuration ensures computation-overlapped and timely configuration to promote Branch Divergence handling capability. Besides, Marionette's Agile PE Assignment improves pipeline performance of imperfect loops. Compared to state-of-the-art spatial architectures, the experimental results demonstrate that Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88\u0000<inline-formula><tex-math>$mathbf{times}$</tex-math></inline-formula>\u0000, 3.38\u0000<inline-formula><tex-math>$mathbf{times}$</tex-math></inline-formula>\u0000, 1.55\u0000<inline-formula><tex-math>$mathbf{times}$</tex-math></inline-formula>\u0000, and 2.66\u0000<inline-formula><tex-math>$mathbf{times}$</tex-math></inline-formula>\u0000 in a variety of challenging intensive control programs.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"185-199"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Modular Multiplication Using Variable Length Algorithms","authors":"Shahab Mirzaei-Teshnizi;Parviz Keshavarzi","doi":"10.1109/TC.2024.3475574","DOIUrl":"https://doi.org/10.1109/TC.2024.3475574","url":null,"abstract":"This paper presents two improved modular multiplication algorithms: variable length Interleaved modular multiplication (VLIM) algorithm and parallel modular multiplication (P_MM) method using variable length algorithms to achieve high throughput rates. The new Interleaved modular multiplication algorithm applies the zero counting and partitioning algorithm to a multiplier’s non-adjacent form (NAF). It divides this input into sections with variable-radix. The sections include a digit of zero sequences and a non-zero digit (-1 or 1) in the most valuable place. Therefore, in addition to reducing the number of required clock pulses, high-radix partial multiplication \u0000<inline-formula><tex-math>$mathbf{X}^{left(mathbf{i}right)}cdot mathbf{Y}$</tex-math></inline-formula>\u0000 is simplified and performed as a binary addition or subtraction operation, and multiplication operations for consecutive zero bits are executed in one clock cycle instead of several clock cycles. The proposed parallel modular multiplication algorithm divides the multiplier into two parts. It utilizes (VLIM) and variable length Montgomery modular multiplication (VLM3) methods to compute the modular multiplication for the upper and lower portions in parallel, according to the proximity of their multiplication time. The implementation results on a Xilinx Virtex-7 FPGA show that the parallel modular multiplication computes a 2048-bit modular multiplication in 0.903 µs, with a maximum clock frequency of 387 MHz and area × time per bit value equal to 9.14.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"143-154"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ClusPar: A Game-Theoretic Approach for Efficient and Scalable Streaming Edge Partitioning","authors":"Zezhong Ding;Deyu Kong;Zhuoxu Zhang;Xike Xie;Jianliang Xu","doi":"10.1109/TC.2024.3475568","DOIUrl":"https://doi.org/10.1109/TC.2024.3475568","url":null,"abstract":"Streaming edge partitioning plays a crucial role in the distributed processing of large-scale web graphs, such as pagerank. The quality of partitioning is of utmost importance and directly affects the runtime cost of distributed graph processing. However, streaming graph clustering, a key component of mainstream streaming edge partitioning, is vertex-centric. This incurs a mismatch with the edge-centric partitioning strategy, necessitating additional post-processing and several graph traversals to transition from vertex-centric clusters to edge-centric partitions. This transition not only adds extra runtime overhead but also risks a decline in partitioning quality. In this paper, we propose a novel algorithm, called ClusPar, to address the problem of streaming edge partitioning. The ClusPar framework consists of two steps, streaming edge clustering and edge cluster partitioning. Different from prior studies, the first step traverses the input graph in a single pass to generate edge-centric clusters, while the second step applies game theory over these edge-centric clusters to produce partitions. Extensive experiments show that ClusPar outperforms the state-of-the-art streaming edge partitioning methods in terms of the partitioning quality, efficiency, and scalability.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"116-130"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yannis Steve Nsuloun Fotse;Vianney Kengne Tchendji;Mthulisi Velempini
{"title":"Federated Learning Based DDoS Attacks Detection in Large Scale Software-Defined Network","authors":"Yannis Steve Nsuloun Fotse;Vianney Kengne Tchendji;Mthulisi Velempini","doi":"10.1109/TC.2024.3474180","DOIUrl":"https://doi.org/10.1109/TC.2024.3474180","url":null,"abstract":"Software-Defined Networking (SDN) is an innovative concept that segments the network into three planes: a control plane comprising of one or multiple controllers; a data plane responsible for data transmission; and an application plane which enables the reconfiguration of network functionalities. Nevertheless, this approach has exposed the controller as a prime target for malicious elements to attack it, such as Distributed Denial of Service (DDoS) attacks. Current DDoS defense schemes often increased the controller load and resource consumption. These schemes are typically tailored for single-controller architectures, a significant limitation when considering the scalability requirements of large-scale SDN. To address these limitations, we introduce an efficient Federated Learning approach, named “FedLAD,” designed to counter DDoS attacks in SDN-based large-scale networks, particularly in multi-controller architectures. Federated learning is a decentralized approach to machine learning where models are trained across multiple devices as controllers store local data samples, without exchanging them. The evaluation of the proposed scheme's performance, using InSDN, CICDDoS2019, and CICDoS2017 datasets, shows an accuracy exceeding 98%, a significant improvement compared to related works. Furthermore, the evaluation of the FedLAD protocol with real-time traffic in an SDN context demonstrates its ability to detect DDoS attacks with high accuracy and minimal resource consumption. To the best of our knowledge, this work introduces a new technique in applying FL for DDoS attack detection in large-scale SDN.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"101-115"},"PeriodicalIF":3.6,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10705345","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tie Qiu;Jingchen Sun;Ning Chen;Songwei Zhang;Weisheng Si;Xingwei Wang
{"title":"Olive-Like Networking: A Uniformity Driven Robust Topology Generation Scheme for IoT System","authors":"Tie Qiu;Jingchen Sun;Ning Chen;Songwei Zhang;Weisheng Si;Xingwei Wang","doi":"10.1109/TC.2024.3465934","DOIUrl":"https://doi.org/10.1109/TC.2024.3465934","url":null,"abstract":"With the scale of the Internet of Things (IoT) system growing constantly, node failures frequently occur due to device malfunctions or cyberattacks. Existing robust network generation methods utilize heuristic algorithms or neural network approaches to optimize the initial topology. These methods do not explore the core of topology robustness, namely how edges are allocated to each node in the topology. As a result, these methods use massive iterative processes to optimize the initial topology, leading to substantial time overhead when the scale of the topology is large. We examine various robust networks and observe that uniform degree distribution is the core of topology robustness. Consequently, we propose a novel UNIformity driven robusT topologY generation scheme (UNITY) for IoT systems to prevent the node degree from becoming excessively high or low, thereby balancing node degrees. Comprehensive experimental results demonstrate that networks generated with UNITY have an “olive-like” topology consisting of a substantial number of medium-degree nodes and possess strong robustness against both random node failures and targeted attacks. This promising result indicates that the UNITY makes a significant advancement in designing robust IoT systems.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"86-100"},"PeriodicalIF":3.6,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FiDRL: Flexible Invocation-Based Deep Reinforcement Learning for DVFS Scheduling in Embedded Systems","authors":"Jingjin Li;Weixiong Jiang;Yuting He;Qingyu Yang;Anqi Gao;Yajun Ha;Ender Özcan;Ruibin Bai;Tianxiang Cui;Heng Yu","doi":"10.1109/TC.2024.3465933","DOIUrl":"https://doi.org/10.1109/TC.2024.3465933","url":null,"abstract":"Deep Reinforcement Learning (DRL)-based Dynamic Voltage Frequency Scaling (DVFS) has shown great promise for energy conservation in embedded systems. While many works were devoted to validating its efficacy or improving its performance, few discuss the feasibility of the DRL agent deployment for embedded computing. State-of-the-art approaches focus on the miniaturization of agents’ inferential networks, such as pruning and quantization, to minimize their energy and resource consumption. However, this spatial-based paradigm still proves inadequate for resource-stringent systems. In this paper, we address the feasibility from a temporal perspective, where FiDRL, a flexible invocation-based DRL model is proposed to judiciously invoke itself to minimize the overall system energy consumption, given that the DRL agent incurs non-negligible energy overhead during invocations. Our approach is three-fold: (1) FiDRL that extends DRL by incorporating the agent's invocation interval into the action space to achieve invocation flexibility; (2) a FiDRL-based DVFS approach for both inter- and intra-task scheduling that minimizes the overall execution energy consumption; and (3) a FiDRL-based DVFS platform design and an on/off-chip hybrid algorithm specialized for training the DRL agent for embedded systems. Experiment results show that FiDRL achieves 55.1% agent invocation cost reduction, under 23.3% overall energy reduction, compared to state-of-the-art approaches.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"71-85"},"PeriodicalIF":3.6,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohai Dai;Wei Li;Guanxiong Wang;Jiang Xiao;Haoyang Chen;Shufei Li;Albert Y. Zomaya;Hai Jin
{"title":"Remora: A Low-Latency DAG-Based BFT Through Optimistic Paths","authors":"Xiaohai Dai;Wei Li;Guanxiong Wang;Jiang Xiao;Haoyang Chen;Shufei Li;Albert Y. Zomaya;Hai Jin","doi":"10.1109/TC.2024.3461309","DOIUrl":"https://doi.org/10.1109/TC.2024.3461309","url":null,"abstract":"Standing as a foundational element within blockchain systems, the \u0000<i>Byzantine Fault Tolerant</i>\u0000 (BFT) consensus has garnered significant attention over the past decade. The introduction of a \u0000<i>Directed Acyclic Directed</i>\u0000 (DAG) structure into BFT consensus design, termed DAG-based BFT, has emerged to bolster throughput. However, prevalent DAG-based protocols grapple with substantial latency issues, suffering from a latency gap compared to non-DAG protocols. For instance, leading-edge DAG-based protocols named GradedDAG and BullShark exhibit a good-case latency of \u0000<inline-formula><tex-math>$4$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$6$</tex-math></inline-formula>\u0000 communication rounds, respectively. In contrast, the non-DAG protocol, exemplified by PBFT, attains a latency of \u0000<inline-formula><tex-math>$3$</tex-math></inline-formula>\u0000 rounds in favorable conditions. To bridge this latency gap, we propose Remora, a novel DAG-based BFT protocol. Remora achieves a reduced latency of \u0000<inline-formula><tex-math>$3$</tex-math></inline-formula>\u0000 rounds by incorporating optimistic paths. At its core, Remora endeavors to commit blocks through the optimistic path initially, facilitating low latency in favorable situations. Conversely, in unfavorable scenarios, Remora seamlessly transitions to a pessimistic path to ensure liveness. Various experiments validate Remora's feasibility and efficiency, highlighting its potential as a robust solution in the realm of BFT consensus protocols.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"57-70"},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10680428","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balanced Modular Addition for the Moduli Set $ {2^{q},2^{q}mp 1,2^{2q}+1}${2q,2q∓1,22q+1} via Moduli-($ 2^{q}mp sqrt{-1}$2q∓-1) Adders","authors":"Ghassem Jaberipur;Elham Rahman;Jeong-A Lee","doi":"10.1109/TC.2024.3461235","DOIUrl":"https://doi.org/10.1109/TC.2024.3461235","url":null,"abstract":"Moduli-set \u0000<inline-formula><tex-math>$ mathbf{tau}={2^{boldsymbol{q}},2^{boldsymbol{q}}pm 1}$</tex-math></inline-formula>\u0000 is often the base of choice for realization of digital computations via residue number systems. The optimum arithmetic performance in parallel residue channels, is generally achieved via equal bit-width residues (e.g., \u0000<inline-formula><tex-math>$ boldsymbol{q}~ mathbf{i}mathbf{n}~ mathbf{tau}$</tex-math></inline-formula>\u0000) that usually leads to equal computation speed within all the residue channels. However, the commonly difficult and costly task of reverse conversion (RC) is often eased in the existence of conjugate moduli. For example, \u0000<inline-formula><tex-math>$ 2^{boldsymbol{q}}mp 1in mathbf{tau}$</tex-math></inline-formula>\u0000, lead to the efficient modulo-(\u0000<inline-formula><tex-math>$ 2^{2boldsymbol{q}}-1$</tex-math></inline-formula>\u0000) addition, as the bulk of \u0000<inline-formula><tex-math>$ mathbf{tau}$</tex-math></inline-formula>\u0000-RC, via the New-CRT reverse conversion method. Nevertheless, for additional dynamic range, \u0000<inline-formula><tex-math>$ mathbf{tau}$</tex-math></inline-formula>\u0000 is augmented with other moduli. In particular, \u0000<inline-formula><tex-math>$ mathbf{phi}=mathbf{tau}cup {2^{2boldsymbol{q}}+1}$</tex-math></inline-formula>\u0000, leads to efficient RC, where the added modulo is conjugate with the product \u0000<inline-formula><tex-math>$ 2^{2boldsymbol{q}}-1$</tex-math></inline-formula>\u0000 of \u0000<inline-formula><tex-math>$ 2^{boldsymbol{q}}mp 1in mathbf{tau}$</tex-math></inline-formula>\u0000. Therefore, the final step of \u0000<inline-formula><tex-math>$ mathbf{phi}$</tex-math></inline-formula>\u0000-RC would be fast and low cost/power modulo-(\u0000<inline-formula><tex-math>$ 2^{4boldsymbol{q}}-1$</tex-math></inline-formula>\u0000) addition. However, the \u0000<inline-formula><tex-math>$ 2boldsymbol{q}$</tex-math></inline-formula>\u0000-bit channel-width jeopardizes the existing delay-balance in \u0000<inline-formula><tex-math>$ mathbf{tau}$</tex-math></inline-formula>\u0000. As a remedial solution, given that \u0000<inline-formula><tex-math>$ 2^{2boldsymbol{q}}+1=left(2^{boldsymbol{q}}-boldsymbol{j}right)left(2^{boldsymbol{q}}+boldsymbol{j}right)$</tex-math></inline-formula>\u0000, with \u0000<inline-formula><tex-math>$ boldsymbol{j}=sqrt{-1}$</tex-math></inline-formula>\u0000, we design and implement modulo-(\u0000<inline-formula><tex-math>$ 2^{2boldsymbol{q}}+1$</tex-math></inline-formula>\u0000) adders via two parallel \u0000<inline-formula><tex-math>$ boldsymbol{q}$</tex-math></inline-formula>\u0000-bit moduli-(\u0000<inline-formula><tex-math>$ 2^{boldsymbol{q}}mp boldsymbol{j}$</tex-math></inline-formula>\u0000) adders. The analytical and synthesis based evaluations of the proposed modulo-(\u0000<inline-formula><tex-math>$ 2^{boldsymbol{q}}mp boldsymbol{j}$</tex-math></inline-formula>\u0000) adders show that the delay-balance of \u0000<inline-formula><tex-math>$ mathbf{tau}$</tex-math></inline-formula>\u0000 is preserved with no cost overhead vs. \u0000<inline-formula><tex-math>$ mathbf{phi}$</tex-math></inline-formula>\u0000. I","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"316-324"},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}