{"title":"Chiplet-Gym: Optimizing Chiplet-Based AI Accelerator Design With Reinforcement Learning","authors":"Kaniz Mishty;Mehdi Sadi","doi":"10.1109/TC.2024.3457740","DOIUrl":"10.1109/TC.2024.3457740","url":null,"abstract":"Modern Artificial Intelligence (AI) workloads demand computing systems with large silicon area to sustain throughput and competitive performance. However, prohibitive manufacturing costs and yield limitations at advanced tech nodes and die-size reaching the reticle limit restrain us from achieving this. With the recent innovations in advanced packaging technologies, chiplet-based architectures have gained significant attention in the AI hardware domain. However, the vast design space of chiplet-based AI accelerator design and the absence of system and package-level co-design methodology make it difficult for the designer to find the optimum design point regarding Power, Performance, Area, and manufacturing Cost (PPAC). This paper presents Chiplet-Gym, a Reinforcement Learning (RL)-based optimization framework to explore the vast design space of chiplet-based AI accelerators, encompassing the resource allocation, placement, and packaging architecture. We analytically model the PPAC of the chiplet-based AI accelerator and integrate it into an OpenAI gym environment to evaluate the design points. We also explore non-RL-based optimization approaches and combine these two approaches to ensure the robustness of the optimizer. The optimizer-suggested design point achieves \u0000<inline-formula><tex-math>$1.52boldsymbol{times}$</tex-math></inline-formula>\u0000 throughput, \u0000<inline-formula><tex-math>$0.27boldsymbol{times}$</tex-math></inline-formula>\u0000 energy, and \u0000<inline-formula><tex-math>$0.89boldsymbol{times}$</tex-math></inline-formula>\u0000 cost of its monolithic counterpart at iso-area.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"43-56"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging GPU in Homomorphic Encryption: Framework Design and Analysis of BFV Variants","authors":"Shiyu Shen;Hao Yang;Wangchen Dai;Lu Zhou;Zhe Liu;Yunlei Zhao","doi":"10.1109/TC.2024.3457733","DOIUrl":"10.1109/TC.2024.3457733","url":null,"abstract":"Homomorphic Encryption (HE) enhances data security by enabling computations on encrypted data, advancing privacy-focused computations. The BFV scheme, a promising HE scheme, raises considerable performance challenges. Graphics Processing Units (GPUs), with considerable parallel processing abilities, offer an effective solution. In this work, we present an in-depth study on accelerating and comparing BFV variants on GPUs, including Bajard-Eynard-Hasan-Zucca (BEHZ), Halevi-Polyakov-Shoup (HPS), and recent variants. We introduce a universal framework for all variants, propose optimized BEHZ implementation, and first support HPS variants with large parameter sets on GPUs. We also optimize low-level arithmetic and high-level operations, minimizing instructions for modular operations, enhancing hardware utilization for base conversion, and implementing efficient reuse strategies and fusion methods to reduce computational and memory consumption. Leveraging our framework, we offer comprehensive comparative analyses. Performance evaluation shows a 31.9\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over OpenFHE running on a multi-threaded CPU and 39.7% and 29.9% improvement for tensoring and relinearization over the state-of-the-art GPU BEHZ implementation. The leveled HPS variant records up to 4\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over other variants, positioning it as a highly promising alternative for specific applications.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2817-2829"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acceleration of Fast Sample Entropy for FPGAs","authors":"Chao Chen;Chengyu Liu;Jianqing Li;Bruno da Silva","doi":"10.1109/TC.2024.3457735","DOIUrl":"10.1109/TC.2024.3457735","url":null,"abstract":"Complexity measurement, essential in diverse fields like finance, biomedicine, climate science, and network traffic, demands real-time computation to mitigate risks and losses. Sample Entropy (SampEn) is an efficacious metric which quantifies the complexity by assessing the similarities among microscale patterns within the time-series data. Unfortunately, the conventional implementation of SampEn is computationally demanding, posing challenges for its application in real-time analysis, particularly for long time series. Field Programmable Gate Arrays (FPGAs) offer a promising solution due to their fast processing and energy efficiency, which can be customized to perform specific signal processing tasks directly in hardware. The presented work focuses on accelerating SampEn analysis on FPGAs for efficient time-series complexity analysis. A refined, fast, Lightweight SampEn architecture (LW SampEn) on FPGA, which is optimized to use sorted sequences to reduce computational complexity, is accelerated for FPGAs. Various sorting algorithms on FPGAs are assessed, and novel dynamic loop strategies and micro-architectures are proposed to tackle SampEn's undetermined search boundaries. Multi-source biomedical signals are used to profile the above design and select a proper architecture, underscoring the importance of customizing FPGA design for specific applications. Our optimized architecture achieves a 7x to 560x speedup over standard baseline architecture, enabling real-time processing of time-sensitive data.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"1-14"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Zhao;Tianyu Li;Guiying Meng;Ammar Hawbani;Geyong Min;Ahmed Y. Al-Dubai;Albert Y. Zomaya
{"title":"Novel Lagrange Multipliers-Driven Adaptive Offloading for Vehicular Edge Computing","authors":"Liang Zhao;Tianyu Li;Guiying Meng;Ammar Hawbani;Geyong Min;Ahmed Y. Al-Dubai;Albert Y. Zomaya","doi":"10.1109/TC.2024.3457729","DOIUrl":"10.1109/TC.2024.3457729","url":null,"abstract":"Vehicular Edge Computing (VEC) is a transportation-specific version of Mobile Edge Computing (MEC) designed for vehicular scenarios. Task offloading allows vehicles to send computational tasks to nearby Roadside Units (RSUs) in order to reduce the computation cost for the overall system. However, the state-of-the-art solutions have not fully addressed the challenge of large-scale task result feedback with low delay, due to the extremely flexible network structure and complex traffic data. In this paper, we explore the joint task offloading and resource allocation problem with result feedback cost in the VEC. In particular, this study develops a VEC computing offloading scheme, namely, a Lagrange multipliers-based adaptive computing offloading with prediction model, considering multiple RSUs and vehicles within their coverage areas. First, the VEC network architecture employs GAN to establish a prediction model, utilizing the powerful predictive capabilities of GAN to forecast the maximum distance of future trajectories, thereby reducing the decision space for task offloading. Subsequently, we propose a real-time adaptive model and adjust the parameters in different scenarios to accommodate the dynamic characteristic of the VEC network. Finally, we apply Lagrange Multiplier-based Non-Uniform Genetic Algorithm (LM-NUGA) to make task offloading decision. Effectively, this algorithm provides reliable and efficient computing services. The results from simulation indicate that our proposed scheme efficiently reduces the computation cost for the whole VEC system. This paves the way for a new generation of disruptive and reliable offloading schemes.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2868-2881"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Implementation of Unsigned Approximate Hybrid Square Rooters for Error-Resilient Applications","authors":"Lalit Bandil;Bal Chand Nagar","doi":"10.1109/TC.2024.3457731","DOIUrl":"10.1109/TC.2024.3457731","url":null,"abstract":"In this paper, the authors proposed an approximate hybrid square rooter (AHSQR). It is the combination of array and logarithmic-based square rooter (SQR) to create a balance between accuracy and hardware performance. An array-based SQR is utilized as an exact SQR (ESQR) to obtain the MSBs of output for high precision, while a logarithmic SQR is used to estimate the remaining output digits to enhance design metrics. A modified AHSQR (MAHSQR) is also proposed to retain accuracy at increasing degrees of approximation by computing the square root of LSBs using the ESQR unit. This reduces the mean relative error distance by up to 31% and the normalized mean error distance by up to 26%. Various accuracy metrics and hardware characteristics are evaluated and analyzed for 16-bit unsigned exact, state-of-the-art, and proposed SQRs. The proposed SQRs are designed using Verilog and implemented using Artix7 FPGA. The results show that the proposed SQRs performances are improved compared to the state-of-the-art methods by being approximately 70% smaller, 2.5 times faster, and consuming only 25% of the power of the ESQR. Applications of the proposed SQRs as a Sobel edge detector, and K-means clustering for image processing, and an envelope detector for communication systems are also included.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2734-2746"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Divya Praneetha Ravipati;Ramanuj Goel;Victor M. van Santen;Hussam Amrouch;Preeti Ranjan Panda
{"title":"CAPE: Criticality-Aware Performance and Energy Optimization Policy for NCFET-Based Caches","authors":"Divya Praneetha Ravipati;Ramanuj Goel;Victor M. van Santen;Hussam Amrouch;Preeti Ranjan Panda","doi":"10.1109/TC.2024.3457734","DOIUrl":"10.1109/TC.2024.3457734","url":null,"abstract":"Caches are crucial yet power-hungry components in present-day computing systems. With the Negative Capacitance Fin Field-Effect Transistor (NCFET) gaining significant attention due to its internal voltage amplification, allowing for better operation at lower voltages (stronger ON-current and reduced leakage current), the introduction of NCFET technology in caches can reduce power consumption without loss in performance. Apart from the benefits offered by the technology, we leverage the unique characteristics offered by NCFETs and propose a dynamic voltage scaling based criticality-aware performance and energy optimization policy (CAPE) for on-chip caches. We present the first work towards optimizing energy in NCFET-based caches with minimal impact on performance. Compared to operating at a nominal voltage of 0.7 V, CAPE shows improvement in Last-Level Cache (LLC) energy savings by up to 19.2%, while the baseline policies devised for traditional CMOS- (/FinFET-) based caches are ineffective in improving NCFET-based LLC energy savings. Compared to the considered baseline policies, our CAPE policy also demonstrates better LLC energy-delay product (EDP) and throughput savings.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2830-2843"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compressed Test Pattern Generation for Deep Neural Networks","authors":"Dina A. Moussa;Michael Hefenbrock;Mehdi Tahoori","doi":"10.1109/TC.2024.3457738","DOIUrl":"10.1109/TC.2024.3457738","url":null,"abstract":"Deep neural networks (DNNs) have emerged as an effective approach in many artificial intelligence tasks. Several specialized accelerators are often used to enhance DNN's performance and lower their energy costs. However, the presence of faults can drastically impair the performance and accuracy of these accelerators. Usually, many test patterns are required for certain types of faults to reach a target fault coverage, which in turn hence increases the testing overhead and storage cost, particularly for in-field testing. For this reason, compression is typically done after test generation step to reduce the storage cost for the generated test patterns. However, compression is more efficient when considered in an earlier stage. This paper generates the test pattern in a compressed form to require less storage. This is done by generating all test patterns as a linear combination of a set of jointly used test patterns (basis), for which only the coefficients need to be stored. The fault coverage achieved by the generated test patterns is compared to that of the adversarial and randomly generated test images. The experimental results showed that our proposed test pattern outperformed and achieved high fault coverage (up to 99.99%) and a high compression ratio (up to 307.2\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"307-315"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziheng Wang;Xiaoshe Dong;Heng Chen;Yan Kang;Qiang Wang
{"title":"CUSPX: Efficient GPU Implementations of Post-Quantum Signature SPHINCS+","authors":"Ziheng Wang;Xiaoshe Dong;Heng Chen;Yan Kang;Qiang Wang","doi":"10.1109/TC.2024.3457736","DOIUrl":"10.1109/TC.2024.3457736","url":null,"abstract":"Quantum computers pose a serious threat to existing cryptographic systems. While Post-Quantum Cryptography (PQC) offers resilience against quantum attacks, its performance limitations often hinder widespread adoption. Among the three National Institute of Standards and Technology (NIST)-selected general-purpose PQC schemes, SPHINCS\u0000<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>\u0000 is particularly susceptible to these limitations. We introduce CUSPX (\u0000<u>CU</u>\u0000DA \u0000<u>SP</u>\u0000HIN\u0000<u>CS</u>\u0000 \u0000<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>\u0000), the first large-scale parallel implementation of SPHINCS\u0000<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>\u0000 capable of running across 10,000 cores. CUSPX leverages a novel three-level parallelism framework, applying it to \u0000<i>algorithmic parallelism</i>\u0000, \u0000<i>data parallelism</i>\u0000, and \u0000<i>hybrid parallelism</i>\u0000. Notably, CUSPX introduces parallel Merkle tree construction algorithms for arbitrary parallel scales and several load-balancing solutions, further enhancing performance. By treating tasks parallelism as the top level of parallelism, CUSPX provides a four-level parallel scheme that can run with any number of tasks. Evaluated on a single GeForce RTX 3090 using the SPHINCS\u0000<inline-formula><tex-math>${}^{+}$</tex-math></inline-formula>\u0000-SHA-256-128s-simple parameter set, CUSPX achieves a single task's signature generation latency of 0.67 ms, demonstrating a 5,105\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over a single-thread version and an 18.50\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over the previous fastest implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"15-28"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Letian Huang;Tianjin Zhao;Ziren Wang;Junkai Zhan;Junshi Wang;Xiaohang Wang
{"title":"Component Dependencies Based Network-on-Chip Test","authors":"Letian Huang;Tianjin Zhao;Ziren Wang;Junkai Zhan;Junshi Wang;Xiaohang Wang","doi":"10.1109/TC.2024.3457732","DOIUrl":"10.1109/TC.2024.3457732","url":null,"abstract":"On-line test of NoC is essential for its reliability. This paper proposed an integral test solution for on-line test of NoC to reduce the test cost and improve the reliability of NOC. The test solution includes a new partitioning method, as well as a test method and a test schedule which are based on the proposed partitioning method. The new partitioning method partitions the NoC into a new type of basis unit under test (UUT) named as interdependent components based unit under test (iDC-UUT), which applies component test methods. The iDC-UUT have very low level of functional interdependency and simple physical connection, which results in small test overhead and high test coverage. The proposed test method consists of DFT architecture, test wrapper and test vectors, which can speed-up the test procedure and further improve the test coverage. The proposed test schedule reduces the blockage probability of data packets during testing by increasing the degree of test disorder, so as to further reduce the test cost. Experimental results show that the proposed test solution reduces power and area by 12.7% and 22.7% over an existing test solution. The average latency is reduced by 22.6% to 38.4% over the existing test solution.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2805-2816"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FLALM: A Flexible Low Area-Latency Montgomery Modular Multiplication on FPGA","authors":"Yujun Xie;Yuan Liu;Xin Zheng;Bohan Lan;Dengyun Lei;Dehao Xiang;Shuting Cai;Xiaoming Xiong","doi":"10.1109/TC.2024.3457739","DOIUrl":"10.1109/TC.2024.3457739","url":null,"abstract":"Montgomery Modular Multiplication (MMM) is widely used in many public key cryptography systems. This paper presents a Flexible Low Area-Latency MMM (FLALM) implementation, which supports Generic Montgomery Modular Multiplication (GMM) and Square Montgomery Modular Multiplication (SMM) operations. A new SMM schedule for the Finely Integrated Product Scanning (FIPS) GMM algorithm is proposed to accelerate SMM with tiny additional design. Furthermore, a new FIPS dual-schedule is proposed to solve the data hazards of this algorithm. Finally, we explore the trade-off between area and latency, and present the FLALM to accelerate GMM and SMM. The FLALM is implemented on FPGA (Virtex-7 platform). The results show that the area*latency (AL) value of FLALM (wordsize \u0000<inline-formula><tex-math>$w$</tex-math></inline-formula>\u0000=128) is 38.1% and 44.7% better than the previous state-of-art scalable references when performing 1024-bit and 2048-bit GMM, respectively. Moreover, when computing SMM, the advantage of AL value is raised to 73.7% and 86.3% respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"29-42"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}