A. Alper Goksoy;Alish Kanani;Satrajit Chatterjee;Umit Ogras
{"title":"Runtime Monitoring of ML-Based Scheduling Algorithms Toward Robust Domain-Specific SoCs","authors":"A. Alper Goksoy;Alish Kanani;Satrajit Chatterjee;Umit Ogras","doi":"10.1109/TCAD.2024.3445815","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445815","url":null,"abstract":"Machine learning (ML) algorithms are being rapidly adopted to perform dynamic resource management tasks in heterogeneous system on chips. For example, ML-based task schedulers can make quick, high-quality decisions at runtime. Like any ML model, these offline-trained policies depend critically on the representative power of the training data. Hence, their performance may diminish or even catastrophically fail under unknown workloads, especially new applications. This article proposes a novel framework to continuously monitor the system to detect unforeseen scenarios using a gradient-based generalization metric called coherence. The proposed framework accurately determines whether the current policy generalizes to new inputs. If not, it incrementally trains the ML scheduler to ensure the robustness of the task-scheduling decisions. The proposed framework is evaluated thoroughly with a domain-specific SoC and six real-world applications. It can detect whether the trained scheduler generalizes to the current workload with 88.75%–98.39% accuracy. Furthermore, it enables \u0000<inline-formula> <tex-math>$1.1times -14times $ </tex-math></inline-formula>\u0000 faster execution time when the scheduler is incrementally trained. Finally, overhead analysis performed on an Nvidia Jetson Xavier NX board shows that the proposed framework can run as a real-time background task.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4202-4213"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FDPUF: Frequency-Domain PUF for Robust Authentication of Edge Devices","authors":"Shubhra Deb Paul;Aritra Dasgupta;Swarup Bhunia","doi":"10.1109/TCAD.2024.3447211","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447211","url":null,"abstract":"Counterfeiting, overproduction, and cloning of integrated circuits (ICs) and associated hardware have emerged as major security concerns in the modern globalized microelectronics supply chain. One way to combat these issues effectively is to deploy hardware authentication techniques that utilize physical unclonable functions (PUFs). PUFs utilize intrinsic variations in hardware that occur during the manufacturing and fabrication process to generate device-specific fingerprints or immutable signatures that cannot be replicated by counterfeits and clones. However, unavoidable factors like environmental noise and harmonics can significantly deteriorate the quality of the PUF signature. Besides, conventional PUF solutions are generally not amenable to in-field authentication of hardware, which has emerged as a critical need for Internet of Things (IoT) edge devices to detect physical attacks on them. In this article, we introduce frequency-domain PUF or FDPUF, a novel PUF that analyzes time-domain current waveforms in the frequency domain to create high-quality authentication signatures that are suitable for in-field authentication. FDPUF decomposes electrical signals into their spectral coefficients, filters out unnecessary low-energy components, reconstructs the waveforms, and generates high-quality digital fingerprints for device authentication purposes. Compared to the existing authentication mechanisms, the higher quality of the signatures through the frequency-domain analysis makes the proposed FDPUF more suitable for protecting the integrity of the edge computing hardware. We perform experimental measurements on FPGA and analyze FDPUF properties using the National Institute of Standards and Technology test suite to demonstrate that the FDPUF provides better uniqueness and robustness than its time-domain counterpart while being attractive for in-field authentication.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3479-3490"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patrick Schmid;Paul Palomero Bernardo;Christoph Gerum;Oliver Bringmann
{"title":"GOURD: Tensorizing Streaming Applications to Generate Multi-Instance Compute Platforms","authors":"Patrick Schmid;Paul Palomero Bernardo;Christoph Gerum;Oliver Bringmann","doi":"10.1109/TCAD.2024.3445810","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445810","url":null,"abstract":"In this article, we rethink the dataflow processing paradigm to a higher level of abstraction to automate the generation of multi-instance compute and memory platforms with interfaces to I/O devices (sensors and actuators). Since the different compute instances (NPUs, CPUs, DSPs, etc.) and I/O devices do not necessarily have compatible interfaces on a dataflow level, an automated translation is required. However, in multidimensional dataflow scenarios, it becomes inherently difficult to reason about buffer sizes and iteration order without knowing the shape of the data access pattern (DAP) that the dataflow follows. To capture this shape and the platform composition, we define a domain-specific representation (DSR) and devise a toolchain to generate a synthesizable platform, including appropriate streaming buffers for platform-specific tensorization of the data between incompatible interfaces. This allows platforms, such as sensor edge AI devices, to be easily specified by simply focusing on the shape of the data provided by the sensors and transmitted among compute units, giving the ability to evaluate and generate different dataflow design alternatives with significantly reduced design time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4166-4177"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10745814","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Abdullah Hanif;Ayoub Arous;Muhammad Shafique
{"title":"DREAMx: A Data-Driven Error Estimation Methodology for Adders Composed of Cascaded Approximate Units","authors":"Muhammad Abdullah Hanif;Ayoub Arous;Muhammad Shafique","doi":"10.1109/TCAD.2024.3447209","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447209","url":null,"abstract":"Due to the significance and broad utilization of adders in computing systems, the design of low-power approximate adders (LPAAs) has received a significant amount of attention from the system design community. However, the selection and deployment of appropriate approximate modules require a thorough design space exploration, which is (in general) an extremely time-consuming process. Toward reducing the exploration time, different error estimation techniques have been proposed in the literature for evaluating the quality metrics of approximate adders. However, most of them are based on certain assumptions that limit the usability of such techniques for real-world settings. In this work, we highlight the impact of those assumptions on the quality of error estimates provided by the state-of-the-art techniques and how they limit the use of such techniques for real-world settings. Moreover, we highlight the significance of considering input data characteristics to improve the quality of error estimation. Based on our analysis, we propose a systematic data-driven error estimation methodology, DREAMx, for adders composed of cascaded approximate units, which covers a predominant set of LPAAs. DREAMx in principle factors in the dependence between input bits based on the given input distribution to compute the probability mass function (PMF) of error value at the output of an approximate adder. It achieves improved results compared to the state-of-the-art techniques while offering a substantial decrease in the overall execution(/exploration) time compared to exhaustive simulations. Our results further show that there exists a delicate tradeoff between the achievable quality of error estimates and the overall execution time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3348-3357"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large Data Transfer Optimization for Improved Robustness in Real-Time V2X-Communication","authors":"Alex Bendrick;Nora Sperling;Rolf Ernst","doi":"10.1109/TCAD.2024.3436548","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3436548","url":null,"abstract":"Vehicle-to-everything (V2X) roadmaps envision future applications that require the reliable exchange of large sensor data over a wireless network in real time. Applications include sensor fusion for cooperative perception or remote vehicle control that are subject to stringent real-time and safety constraints. Real-time requirements result from end-to-end latency constraints, while reliability refers to the quest for loss-free sensor data transfer to reach maximum application quality. In wireless networks, both requirements are in conflict, because of the need for error correction. Notably, the established video coding standards are not suitable for this task, as demonstrated in experiments. This article shows that middleware-based backward error correction (BEC) in combination with application controlled selective data transmission is far more effective for this purpose. The mechanisms proposed in this article use application and context knowledge to dynamically adapt the data object volume at high error rates at sustained application resilience. We evaluate popular camera datasets and perception pipelines from the automotive domain and apply two complementary strategies. The results and comparisons show that this approach has great benefits, far beyond the state of the art. It also shows that there is no single strategy that outperforms the other in all use cases.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3515-3526"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iFKVS: Lightweight Key-Value Store for Flash-Based Intermittently Computing Devices","authors":"Yen-Hsun Chen;Ting-En Liao;Li-Pin Chang","doi":"10.1109/TCAD.2024.3443698","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443698","url":null,"abstract":"Energy harvesting enables long-running sensing applications on tiny Internet of Things (IoT) devices without a battery installed. To overcome the intermittency of ambient energy sources, system software creates intermittent computation using checkpoints. While the scope of intermittent computation is quickly expanding, there is a strong demand for data storage and local data processing in such IoT devices. When considering data storage options, flash memory is more compelling than other types of nonvolatile memory due to its affordability and availability. We introduce iFKVS, a flash-based key-value store for multisensor IoT devices. In this study, we aim at supporting efficient key-value operations while guaranteeing the correctness of program execution across power interruptions. For indexing of multidimensional sensor data, we propose a quadtree-based structure for the minimization of extra writes from splitting and rebalancing; for checkpointing in flash storage, we propose a rollback-based algorithm that exploits the capabilities of byte-level writing and one-way bit flipping of flash memory. Experimental results based on a real energy-driven testbed demonstrate that with the same index structure design, our rollback-based approach obtains a significant reduction of 45% and 84% in the total execution time compared with checkpointing using write-ahead logging (WAL) and copying on write (COW), respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3564-3575"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BERN-NN-IBF: Enhancing Neural Network Bound Propagation Through Implicit Bernstein Form and Optimized Tensor Operations","authors":"Wael Fatnassi;Arthur Feeney;Valen Yamamoto;Aparna Chandramowlishwaran;Yasser Shoukry","doi":"10.1109/TCAD.2024.3447577","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447577","url":null,"abstract":"Neural networks have emerged as powerful tools across various domains, exhibiting remarkable empirical performance that motivated their widespread adoption in safety-critical applications, which, in turn, necessitates rigorous formal verification techniques to ensure their reliability and robustness. Tight bound propagation plays a crucial role in the formal verification process by providing precise bounds that can be used to formulate and verify properties, such as safety, robustness, and fairness. While state-of-the-art tools use linear and convex approximations to compute upper/lower bounds for each neuron’s outputs, recent advances have shown that nonlinear approximations based on Bernstein polynomials lead to tighter bounds but suffer from scalability issues. To that end, this article introduces BERN-NN-IBF, a significant enhancement of the Bernstein-polynomial-based bound propagation algorithms. BERN-NN-IBF offers three main contributions: 1) a memory-efficient encoding of Bernstein polynomials to scale the bound propagation algorithms; 2) optimized tensor operations for the new polynomial encoding to maintain the integrity of the bounds while enhancing computational efficiency; and 3) tighter under-approximations of the ReLU activation function using quadratic polynomials tailored to minimize approximation errors. Through comprehensive testing, we demonstrate that BERN-NN-IBF achieves tighter bounds and higher computational efficiency compared to the original BERN-NN and state-of-the-art methods, including linear and convex programming used within the winner of the VNN-COMPETITION.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4334-4345"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyu Ren;Qiao Li;Yina Lv;Min Ye;Nan Guan;Chun Jason Xue
{"title":"Near-Free Lifetime Extension for 3-D nand Flash via Opportunistic Self-Healing","authors":"Tianyu Ren;Qiao Li;Yina Lv;Min Ye;Nan Guan;Chun Jason Xue","doi":"10.1109/TCAD.2024.3447225","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447225","url":null,"abstract":"3-D \u0000<sc>nand</small>\u0000 flash memories are the dominant storage media in modern data centers due to their high performance, large storage capacity, and low-power consumption. However, the lifetime of flash memory has decreased as technology scaling advances. Recent work has revealed that the number of achievable program/erase (P/E) cycles of flash blocks is related to the dwell time (DT) between two adjacent erase operations. A longer DT can lead to higher-achievable P/E cycles and, therefore, a longer lifetime for flash memories. This article found that the achievable P/E cycles would increase when flash blocks endure uneven DT distribution. Based on this observation, this article presents an opportunistic self-healing method to extend the lifetime of flash memory. By maintaining two groups with unequal block counts, namely, Active Group and Healing Group, the proposed method creates an imbalance in erase operation distribution. The Active Group undergoes more frequent erase operations, resulting in shorter DT, while the Healing Group experiences longer DT. Periodically, the roles of the two groups are switched based on the Active Group’s partitioning ratio. This role switching ensures that each block experiences both short and long DT periods, leading to an uneven DT distribution that magnifies the self-healing effect. The evaluation shows that the proposed method can improve the flash lifetime by 19.3% and 13.2% on average with near-free overheads, compared with the baseline and the related work, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4226-4237"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AttentionRC: A Novel Approach to Improve Locality Sensitive Hashing Attention on Dual-Addressing Memory","authors":"Chun-Lin Chu;Yun-Chih Chen;Wei Cheng;Ing-Chao Lin;Yuan-Hao Chang","doi":"10.1109/TCAD.2024.3447217","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447217","url":null,"abstract":"Attention is a crucial component of the Transformer architecture and a key factor in its success. However, it suffers from quadratic growth in time and space complexity as input sequence length increases. One popular approach to address this issue is the Reformer model, which uses locality-sensitive hashing (LSH) attention to reduce computational complexity. LSH attention hashes similar tokens in the input sequence to the same bucket and attends tokens only within the same bucket. Meanwhile, a new emerging nonvolatile memory (NVM) architecture, row column NVM (RC-NVM), has been proposed to support row- and column-oriented addressing (i.e., dual addressing). In this work, we present AttentionRC, which takes advantage of RC-NVM to further improve the efficiency of LSH attention. We first propose an LSH-friendly data mapping strategy that improves memory write and read cycles by 60.9% and 4.9%, respectively. Then, we propose a sort-free RC-aware bucket access and a swap strategy that utilizes dual-addressing to reduce 38% of the data access cycles in attention. Finally, by taking advantage of dual-addressing, we propose transpose-free attention to eliminate the transpose operations that were previously required by the attention, resulting in a 51% reduction in the matrix multiplication time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3925-3936"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU Performance Optimization via Intergroup Cache Cooperation","authors":"Guosheng Wang;Yajuan Du;Weiming Huang","doi":"10.1109/TCAD.2024.3443707","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443707","url":null,"abstract":"Modern GPUs have integrated multilevel cache hierarchy to provide high bandwidth and mitigate the memory wall problem. However, the benefit of on-chip cache is far from achieving optimal performance. In this article, we investigate existing cache architecture and find that the cache utilization is imbalanced and there exists serious data duplication among L1 cache groups.In order to exploit the duplicate data, we propose an intergroup cache cooperation (ICC) method to establish the cooperation across L1 cache groups. According the cooperation scope, we design two schemes of the adjacent cache cooperation (ICC-AGC) and the multiple cache cooperation (ICC-MGC). In ICC-AGC, we design an adjacent cooperative directory table to realize the perception of duplicate data and integrate a lightweight network for communication. In ICC-MGC, a ring bi-directional network is designed to realize the connection among multiple groups. And we present a two-way sending mechanism and a dynamic sending mechanism to balance the overhead and efficiency involved in request probing and sending.Evaluation results show that the proposed two ICC methods can reduce the average traffic to L2 cache by 10% and 20%, respectively, and improve overall GPU performance by 19% and 49% on average, respectively, compared with the existing work.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4142-4153"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}