{"title":"FiDRL: Flexible Invocation-Based Deep Reinforcement Learning for DVFS Scheduling in Embedded Systems","authors":"Jingjin Li;Weixiong Jiang;Yuting He;Qingyu Yang;Anqi Gao;Yajun Ha;Ender Özcan;Ruibin Bai;Tianxiang Cui;Heng Yu","doi":"10.1109/TC.2024.3465933","DOIUrl":"https://doi.org/10.1109/TC.2024.3465933","url":null,"abstract":"Deep Reinforcement Learning (DRL)-based Dynamic Voltage Frequency Scaling (DVFS) has shown great promise for energy conservation in embedded systems. While many works were devoted to validating its efficacy or improving its performance, few discuss the feasibility of the DRL agent deployment for embedded computing. State-of-the-art approaches focus on the miniaturization of agents’ inferential networks, such as pruning and quantization, to minimize their energy and resource consumption. However, this spatial-based paradigm still proves inadequate for resource-stringent systems. In this paper, we address the feasibility from a temporal perspective, where FiDRL, a flexible invocation-based DRL model is proposed to judiciously invoke itself to minimize the overall system energy consumption, given that the DRL agent incurs non-negligible energy overhead during invocations. Our approach is three-fold: (1) FiDRL that extends DRL by incorporating the agent's invocation interval into the action space to achieve invocation flexibility; (2) a FiDRL-based DVFS approach for both inter- and intra-task scheduling that minimizes the overall execution energy consumption; and (3) a FiDRL-based DVFS platform design and an on/off-chip hybrid algorithm specialized for training the DRL agent for embedded systems. Experiment results show that FiDRL achieves 55.1% agent invocation cost reduction, under 23.3% overall energy reduction, compared to state-of-the-art approaches.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"71-85"},"PeriodicalIF":3.6,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohai Dai;Wei Li;Guanxiong Wang;Jiang Xiao;Haoyang Chen;Shufei Li;Albert Y. Zomaya;Hai Jin
{"title":"Remora: A Low-Latency DAG-Based BFT Through Optimistic Paths","authors":"Xiaohai Dai;Wei Li;Guanxiong Wang;Jiang Xiao;Haoyang Chen;Shufei Li;Albert Y. Zomaya;Hai Jin","doi":"10.1109/TC.2024.3461309","DOIUrl":"https://doi.org/10.1109/TC.2024.3461309","url":null,"abstract":"Standing as a foundational element within blockchain systems, the \u0000<i>Byzantine Fault Tolerant</i>\u0000 (BFT) consensus has garnered significant attention over the past decade. The introduction of a \u0000<i>Directed Acyclic Directed</i>\u0000 (DAG) structure into BFT consensus design, termed DAG-based BFT, has emerged to bolster throughput. However, prevalent DAG-based protocols grapple with substantial latency issues, suffering from a latency gap compared to non-DAG protocols. For instance, leading-edge DAG-based protocols named GradedDAG and BullShark exhibit a good-case latency of \u0000<inline-formula><tex-math>$4$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$6$</tex-math></inline-formula>\u0000 communication rounds, respectively. In contrast, the non-DAG protocol, exemplified by PBFT, attains a latency of \u0000<inline-formula><tex-math>$3$</tex-math></inline-formula>\u0000 rounds in favorable conditions. To bridge this latency gap, we propose Remora, a novel DAG-based BFT protocol. Remora achieves a reduced latency of \u0000<inline-formula><tex-math>$3$</tex-math></inline-formula>\u0000 rounds by incorporating optimistic paths. At its core, Remora endeavors to commit blocks through the optimistic path initially, facilitating low latency in favorable situations. Conversely, in unfavorable scenarios, Remora seamlessly transitions to a pessimistic path to ensure liveness. Various experiments validate Remora's feasibility and efficiency, highlighting its potential as a robust solution in the realm of BFT consensus protocols.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"57-70"},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10680428","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balanced Modular Addition for the Moduli Set $ {2^{q},2^{q}mp 1,2^{2q}+1}${2q,2q∓1,22q+1} via Moduli-($ 2^{q}mp sqrt{-1}$2q∓-1) Adders","authors":"Ghassem Jaberipur;Elham Rahman;Jeong-A Lee","doi":"10.1109/TC.2024.3461235","DOIUrl":"https://doi.org/10.1109/TC.2024.3461235","url":null,"abstract":"Moduli-set \u0000<inline-formula><tex-math>$ mathbf{tau}={2^{boldsymbol{q}},2^{boldsymbol{q}}pm 1}$</tex-math></inline-formula>\u0000 is often the base of choice for realization of digital computations via residue number systems. The optimum arithmetic performance in parallel residue channels, is generally achieved via equal bit-width residues (e.g., \u0000<inline-formula><tex-math>$ boldsymbol{q}~ mathbf{i}mathbf{n}~ mathbf{tau}$</tex-math></inline-formula>\u0000) that usually leads to equal computation speed within all the residue channels. However, the commonly difficult and costly task of reverse conversion (RC) is often eased in the existence of conjugate moduli. For example, \u0000<inline-formula><tex-math>$ 2^{boldsymbol{q}}mp 1in mathbf{tau}$</tex-math></inline-formula>\u0000, lead to the efficient modulo-(\u0000<inline-formula><tex-math>$ 2^{2boldsymbol{q}}-1$</tex-math></inline-formula>\u0000) addition, as the bulk of \u0000<inline-formula><tex-math>$ mathbf{tau}$</tex-math></inline-formula>\u0000-RC, via the New-CRT reverse conversion method. Nevertheless, for additional dynamic range, \u0000<inline-formula><tex-math>$ mathbf{tau}$</tex-math></inline-formula>\u0000 is augmented with other moduli. In particular, \u0000<inline-formula><tex-math>$ mathbf{phi}=mathbf{tau}cup {2^{2boldsymbol{q}}+1}$</tex-math></inline-formula>\u0000, leads to efficient RC, where the added modulo is conjugate with the product \u0000<inline-formula><tex-math>$ 2^{2boldsymbol{q}}-1$</tex-math></inline-formula>\u0000 of \u0000<inline-formula><tex-math>$ 2^{boldsymbol{q}}mp 1in mathbf{tau}$</tex-math></inline-formula>\u0000. Therefore, the final step of \u0000<inline-formula><tex-math>$ mathbf{phi}$</tex-math></inline-formula>\u0000-RC would be fast and low cost/power modulo-(\u0000<inline-formula><tex-math>$ 2^{4boldsymbol{q}}-1$</tex-math></inline-formula>\u0000) addition. However, the \u0000<inline-formula><tex-math>$ 2boldsymbol{q}$</tex-math></inline-formula>\u0000-bit channel-width jeopardizes the existing delay-balance in \u0000<inline-formula><tex-math>$ mathbf{tau}$</tex-math></inline-formula>\u0000. As a remedial solution, given that \u0000<inline-formula><tex-math>$ 2^{2boldsymbol{q}}+1=left(2^{boldsymbol{q}}-boldsymbol{j}right)left(2^{boldsymbol{q}}+boldsymbol{j}right)$</tex-math></inline-formula>\u0000, with \u0000<inline-formula><tex-math>$ boldsymbol{j}=sqrt{-1}$</tex-math></inline-formula>\u0000, we design and implement modulo-(\u0000<inline-formula><tex-math>$ 2^{2boldsymbol{q}}+1$</tex-math></inline-formula>\u0000) adders via two parallel \u0000<inline-formula><tex-math>$ boldsymbol{q}$</tex-math></inline-formula>\u0000-bit moduli-(\u0000<inline-formula><tex-math>$ 2^{boldsymbol{q}}mp boldsymbol{j}$</tex-math></inline-formula>\u0000) adders. The analytical and synthesis based evaluations of the proposed modulo-(\u0000<inline-formula><tex-math>$ 2^{boldsymbol{q}}mp boldsymbol{j}$</tex-math></inline-formula>\u0000) adders show that the delay-balance of \u0000<inline-formula><tex-math>$ mathbf{tau}$</tex-math></inline-formula>\u0000 is preserved with no cost overhead vs. \u0000<inline-formula><tex-math>$ mathbf{phi}$</tex-math></inline-formula>\u0000. I","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"316-324"},"PeriodicalIF":3.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging GPU in Homomorphic Encryption: Framework Design and Analysis of BFV Variants","authors":"Shiyu Shen;Hao Yang;Wangchen Dai;Lu Zhou;Zhe Liu;Yunlei Zhao","doi":"10.1109/TC.2024.3457733","DOIUrl":"10.1109/TC.2024.3457733","url":null,"abstract":"Homomorphic Encryption (HE) enhances data security by enabling computations on encrypted data, advancing privacy-focused computations. The BFV scheme, a promising HE scheme, raises considerable performance challenges. Graphics Processing Units (GPUs), with considerable parallel processing abilities, offer an effective solution. In this work, we present an in-depth study on accelerating and comparing BFV variants on GPUs, including Bajard-Eynard-Hasan-Zucca (BEHZ), Halevi-Polyakov-Shoup (HPS), and recent variants. We introduce a universal framework for all variants, propose optimized BEHZ implementation, and first support HPS variants with large parameter sets on GPUs. We also optimize low-level arithmetic and high-level operations, minimizing instructions for modular operations, enhancing hardware utilization for base conversion, and implementing efficient reuse strategies and fusion methods to reduce computational and memory consumption. Leveraging our framework, we offer comprehensive comparative analyses. Performance evaluation shows a 31.9\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over OpenFHE running on a multi-threaded CPU and 39.7% and 29.9% improvement for tensoring and relinearization over the state-of-the-art GPU BEHZ implementation. The leveled HPS variant records up to 4\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over other variants, positioning it as a highly promising alternative for specific applications.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2817-2829"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chiplet-Gym: Optimizing Chiplet-Based AI Accelerator Design With Reinforcement Learning","authors":"Kaniz Mishty;Mehdi Sadi","doi":"10.1109/TC.2024.3457740","DOIUrl":"10.1109/TC.2024.3457740","url":null,"abstract":"Modern Artificial Intelligence (AI) workloads demand computing systems with large silicon area to sustain throughput and competitive performance. However, prohibitive manufacturing costs and yield limitations at advanced tech nodes and die-size reaching the reticle limit restrain us from achieving this. With the recent innovations in advanced packaging technologies, chiplet-based architectures have gained significant attention in the AI hardware domain. However, the vast design space of chiplet-based AI accelerator design and the absence of system and package-level co-design methodology make it difficult for the designer to find the optimum design point regarding Power, Performance, Area, and manufacturing Cost (PPAC). This paper presents Chiplet-Gym, a Reinforcement Learning (RL)-based optimization framework to explore the vast design space of chiplet-based AI accelerators, encompassing the resource allocation, placement, and packaging architecture. We analytically model the PPAC of the chiplet-based AI accelerator and integrate it into an OpenAI gym environment to evaluate the design points. We also explore non-RL-based optimization approaches and combine these two approaches to ensure the robustness of the optimizer. The optimizer-suggested design point achieves \u0000<inline-formula><tex-math>$1.52boldsymbol{times}$</tex-math></inline-formula>\u0000 throughput, \u0000<inline-formula><tex-math>$0.27boldsymbol{times}$</tex-math></inline-formula>\u0000 energy, and \u0000<inline-formula><tex-math>$0.89boldsymbol{times}$</tex-math></inline-formula>\u0000 cost of its monolithic counterpart at iso-area.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"43-56"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acceleration of Fast Sample Entropy for FPGAs","authors":"Chao Chen;Chengyu Liu;Jianqing Li;Bruno da Silva","doi":"10.1109/TC.2024.3457735","DOIUrl":"10.1109/TC.2024.3457735","url":null,"abstract":"Complexity measurement, essential in diverse fields like finance, biomedicine, climate science, and network traffic, demands real-time computation to mitigate risks and losses. Sample Entropy (SampEn) is an efficacious metric which quantifies the complexity by assessing the similarities among microscale patterns within the time-series data. Unfortunately, the conventional implementation of SampEn is computationally demanding, posing challenges for its application in real-time analysis, particularly for long time series. Field Programmable Gate Arrays (FPGAs) offer a promising solution due to their fast processing and energy efficiency, which can be customized to perform specific signal processing tasks directly in hardware. The presented work focuses on accelerating SampEn analysis on FPGAs for efficient time-series complexity analysis. A refined, fast, Lightweight SampEn architecture (LW SampEn) on FPGA, which is optimized to use sorted sequences to reduce computational complexity, is accelerated for FPGAs. Various sorting algorithms on FPGAs are assessed, and novel dynamic loop strategies and micro-architectures are proposed to tackle SampEn's undetermined search boundaries. Multi-source biomedical signals are used to profile the above design and select a proper architecture, underscoring the importance of customizing FPGA design for specific applications. Our optimized architecture achieves a 7x to 560x speedup over standard baseline architecture, enabling real-time processing of time-sensitive data.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"1-14"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Zhao;Tianyu Li;Guiying Meng;Ammar Hawbani;Geyong Min;Ahmed Y. Al-Dubai;Albert Y. Zomaya
{"title":"Novel Lagrange Multipliers-Driven Adaptive Offloading for Vehicular Edge Computing","authors":"Liang Zhao;Tianyu Li;Guiying Meng;Ammar Hawbani;Geyong Min;Ahmed Y. Al-Dubai;Albert Y. Zomaya","doi":"10.1109/TC.2024.3457729","DOIUrl":"10.1109/TC.2024.3457729","url":null,"abstract":"Vehicular Edge Computing (VEC) is a transportation-specific version of Mobile Edge Computing (MEC) designed for vehicular scenarios. Task offloading allows vehicles to send computational tasks to nearby Roadside Units (RSUs) in order to reduce the computation cost for the overall system. However, the state-of-the-art solutions have not fully addressed the challenge of large-scale task result feedback with low delay, due to the extremely flexible network structure and complex traffic data. In this paper, we explore the joint task offloading and resource allocation problem with result feedback cost in the VEC. In particular, this study develops a VEC computing offloading scheme, namely, a Lagrange multipliers-based adaptive computing offloading with prediction model, considering multiple RSUs and vehicles within their coverage areas. First, the VEC network architecture employs GAN to establish a prediction model, utilizing the powerful predictive capabilities of GAN to forecast the maximum distance of future trajectories, thereby reducing the decision space for task offloading. Subsequently, we propose a real-time adaptive model and adjust the parameters in different scenarios to accommodate the dynamic characteristic of the VEC network. Finally, we apply Lagrange Multiplier-based Non-Uniform Genetic Algorithm (LM-NUGA) to make task offloading decision. Effectively, this algorithm provides reliable and efficient computing services. The results from simulation indicate that our proposed scheme efficiently reduces the computation cost for the whole VEC system. This paves the way for a new generation of disruptive and reliable offloading schemes.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2868-2881"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Implementation of Unsigned Approximate Hybrid Square Rooters for Error-Resilient Applications","authors":"Lalit Bandil;Bal Chand Nagar","doi":"10.1109/TC.2024.3457731","DOIUrl":"10.1109/TC.2024.3457731","url":null,"abstract":"In this paper, the authors proposed an approximate hybrid square rooter (AHSQR). It is the combination of array and logarithmic-based square rooter (SQR) to create a balance between accuracy and hardware performance. An array-based SQR is utilized as an exact SQR (ESQR) to obtain the MSBs of output for high precision, while a logarithmic SQR is used to estimate the remaining output digits to enhance design metrics. A modified AHSQR (MAHSQR) is also proposed to retain accuracy at increasing degrees of approximation by computing the square root of LSBs using the ESQR unit. This reduces the mean relative error distance by up to 31% and the normalized mean error distance by up to 26%. Various accuracy metrics and hardware characteristics are evaluated and analyzed for 16-bit unsigned exact, state-of-the-art, and proposed SQRs. The proposed SQRs are designed using Verilog and implemented using Artix7 FPGA. The results show that the proposed SQRs performances are improved compared to the state-of-the-art methods by being approximately 70% smaller, 2.5 times faster, and consuming only 25% of the power of the ESQR. Applications of the proposed SQRs as a Sobel edge detector, and K-means clustering for image processing, and an envelope detector for communication systems are also included.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2734-2746"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Divya Praneetha Ravipati;Ramanuj Goel;Victor M. van Santen;Hussam Amrouch;Preeti Ranjan Panda
{"title":"CAPE: Criticality-Aware Performance and Energy Optimization Policy for NCFET-Based Caches","authors":"Divya Praneetha Ravipati;Ramanuj Goel;Victor M. van Santen;Hussam Amrouch;Preeti Ranjan Panda","doi":"10.1109/TC.2024.3457734","DOIUrl":"10.1109/TC.2024.3457734","url":null,"abstract":"Caches are crucial yet power-hungry components in present-day computing systems. With the Negative Capacitance Fin Field-Effect Transistor (NCFET) gaining significant attention due to its internal voltage amplification, allowing for better operation at lower voltages (stronger ON-current and reduced leakage current), the introduction of NCFET technology in caches can reduce power consumption without loss in performance. Apart from the benefits offered by the technology, we leverage the unique characteristics offered by NCFETs and propose a dynamic voltage scaling based criticality-aware performance and energy optimization policy (CAPE) for on-chip caches. We present the first work towards optimizing energy in NCFET-based caches with minimal impact on performance. Compared to operating at a nominal voltage of 0.7 V, CAPE shows improvement in Last-Level Cache (LLC) energy savings by up to 19.2%, while the baseline policies devised for traditional CMOS- (/FinFET-) based caches are ineffective in improving NCFET-based LLC energy savings. Compared to the considered baseline policies, our CAPE policy also demonstrates better LLC energy-delay product (EDP) and throughput savings.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2830-2843"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compressed Test Pattern Generation for Deep Neural Networks","authors":"Dina A. Moussa;Michael Hefenbrock;Mehdi Tahoori","doi":"10.1109/TC.2024.3457738","DOIUrl":"10.1109/TC.2024.3457738","url":null,"abstract":"Deep neural networks (DNNs) have emerged as an effective approach in many artificial intelligence tasks. Several specialized accelerators are often used to enhance DNN's performance and lower their energy costs. However, the presence of faults can drastically impair the performance and accuracy of these accelerators. Usually, many test patterns are required for certain types of faults to reach a target fault coverage, which in turn hence increases the testing overhead and storage cost, particularly for in-field testing. For this reason, compression is typically done after test generation step to reduce the storage cost for the generated test patterns. However, compression is more efficient when considered in an earlier stage. This paper generates the test pattern in a compressed form to require less storage. This is done by generating all test patterns as a linear combination of a set of jointly used test patterns (basis), for which only the coefficients need to be stored. The fault coverage achieved by the generated test patterns is compared to that of the adversarial and randomly generated test images. The experimental results showed that our proposed test pattern outperformed and achieved high fault coverage (up to 99.99%) and a high compression ratio (up to 307.2\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"307-315"},"PeriodicalIF":3.6,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}