{"title":"BERN-NN-IBF: Enhancing Neural Network Bound Propagation Through Implicit Bernstein Form and Optimized Tensor Operations","authors":"Wael Fatnassi;Arthur Feeney;Valen Yamamoto;Aparna Chandramowlishwaran;Yasser Shoukry","doi":"10.1109/TCAD.2024.3447577","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447577","url":null,"abstract":"Neural networks have emerged as powerful tools across various domains, exhibiting remarkable empirical performance that motivated their widespread adoption in safety-critical applications, which, in turn, necessitates rigorous formal verification techniques to ensure their reliability and robustness. Tight bound propagation plays a crucial role in the formal verification process by providing precise bounds that can be used to formulate and verify properties, such as safety, robustness, and fairness. While state-of-the-art tools use linear and convex approximations to compute upper/lower bounds for each neuron’s outputs, recent advances have shown that nonlinear approximations based on Bernstein polynomials lead to tighter bounds but suffer from scalability issues. To that end, this article introduces BERN-NN-IBF, a significant enhancement of the Bernstein-polynomial-based bound propagation algorithms. BERN-NN-IBF offers three main contributions: 1) a memory-efficient encoding of Bernstein polynomials to scale the bound propagation algorithms; 2) optimized tensor operations for the new polynomial encoding to maintain the integrity of the bounds while enhancing computational efficiency; and 3) tighter under-approximations of the ReLU activation function using quadratic polynomials tailored to minimize approximation errors. Through comprehensive testing, we demonstrate that BERN-NN-IBF achieves tighter bounds and higher computational efficiency compared to the original BERN-NN and state-of-the-art methods, including linear and convex programming used within the winner of the VNN-COMPETITION.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4334-4345"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyu Ren;Qiao Li;Yina Lv;Min Ye;Nan Guan;Chun Jason Xue
{"title":"Near-Free Lifetime Extension for 3-D nand Flash via Opportunistic Self-Healing","authors":"Tianyu Ren;Qiao Li;Yina Lv;Min Ye;Nan Guan;Chun Jason Xue","doi":"10.1109/TCAD.2024.3447225","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447225","url":null,"abstract":"3-D \u0000<sc>nand</small>\u0000 flash memories are the dominant storage media in modern data centers due to their high performance, large storage capacity, and low-power consumption. However, the lifetime of flash memory has decreased as technology scaling advances. Recent work has revealed that the number of achievable program/erase (P/E) cycles of flash blocks is related to the dwell time (DT) between two adjacent erase operations. A longer DT can lead to higher-achievable P/E cycles and, therefore, a longer lifetime for flash memories. This article found that the achievable P/E cycles would increase when flash blocks endure uneven DT distribution. Based on this observation, this article presents an opportunistic self-healing method to extend the lifetime of flash memory. By maintaining two groups with unequal block counts, namely, Active Group and Healing Group, the proposed method creates an imbalance in erase operation distribution. The Active Group undergoes more frequent erase operations, resulting in shorter DT, while the Healing Group experiences longer DT. Periodically, the roles of the two groups are switched based on the Active Group’s partitioning ratio. This role switching ensures that each block experiences both short and long DT periods, leading to an uneven DT distribution that magnifies the self-healing effect. The evaluation shows that the proposed method can improve the flash lifetime by 19.3% and 13.2% on average with near-free overheads, compared with the baseline and the related work, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4226-4237"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AttentionRC: A Novel Approach to Improve Locality Sensitive Hashing Attention on Dual-Addressing Memory","authors":"Chun-Lin Chu;Yun-Chih Chen;Wei Cheng;Ing-Chao Lin;Yuan-Hao Chang","doi":"10.1109/TCAD.2024.3447217","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447217","url":null,"abstract":"Attention is a crucial component of the Transformer architecture and a key factor in its success. However, it suffers from quadratic growth in time and space complexity as input sequence length increases. One popular approach to address this issue is the Reformer model, which uses locality-sensitive hashing (LSH) attention to reduce computational complexity. LSH attention hashes similar tokens in the input sequence to the same bucket and attends tokens only within the same bucket. Meanwhile, a new emerging nonvolatile memory (NVM) architecture, row column NVM (RC-NVM), has been proposed to support row- and column-oriented addressing (i.e., dual addressing). In this work, we present AttentionRC, which takes advantage of RC-NVM to further improve the efficiency of LSH attention. We first propose an LSH-friendly data mapping strategy that improves memory write and read cycles by 60.9% and 4.9%, respectively. Then, we propose a sort-free RC-aware bucket access and a swap strategy that utilizes dual-addressing to reduce 38% of the data access cycles in attention. Finally, by taking advantage of dual-addressing, we propose transpose-free attention to eliminate the transpose operations that were previously required by the attention, resulting in a 51% reduction in the matrix multiplication time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3925-3936"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU Performance Optimization via Intergroup Cache Cooperation","authors":"Guosheng Wang;Yajuan Du;Weiming Huang","doi":"10.1109/TCAD.2024.3443707","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443707","url":null,"abstract":"Modern GPUs have integrated multilevel cache hierarchy to provide high bandwidth and mitigate the memory wall problem. However, the benefit of on-chip cache is far from achieving optimal performance. In this article, we investigate existing cache architecture and find that the cache utilization is imbalanced and there exists serious data duplication among L1 cache groups.In order to exploit the duplicate data, we propose an intergroup cache cooperation (ICC) method to establish the cooperation across L1 cache groups. According the cooperation scope, we design two schemes of the adjacent cache cooperation (ICC-AGC) and the multiple cache cooperation (ICC-MGC). In ICC-AGC, we design an adjacent cooperative directory table to realize the perception of duplicate data and integrate a lightweight network for communication. In ICC-MGC, a ring bi-directional network is designed to realize the connection among multiple groups. And we present a two-way sending mechanism and a dynamic sending mechanism to balance the overhead and efficiency involved in request probing and sending.Evaluation results show that the proposed two ICC methods can reduce the average traffic to L2 cache by 10% and 20%, respectively, and improve overall GPU performance by 19% and 49% on average, respectively, compared with the existing work.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4142-4153"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Tang;Yukai Song;Naveena Elango;Sheena Ratnam Priya;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu
{"title":"CHEF: A Framework for Deploying Heterogeneous Models on Clusters With Heterogeneous FPGAs","authors":"Yue Tang;Yukai Song;Naveena Elango;Sheena Ratnam Priya;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu","doi":"10.1109/TCAD.2024.3438994","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438994","url":null,"abstract":"Deep neural networks (DNNs) are rapidly evolving from streamlined single-modality single-task (SMST) to multimodality multitask (MMMT) with large variations for different layers and complex data dependencies among layers. To support such models, hardware systems also evolved to be heterogeneous. The heterogeneous system comes from the prevailing trend to integrate diverse accelerators into the system for lower latency. FPGAs have high-computation density and communication bandwidth and are configurable to be deployed with different designs of accelerators, which are widely used for various machine-learning applications. However, scaling from SMST to MMMT on heterogeneous FPGAs is challenging since MMMT has much larger layer variations, a massive number of layers, and complex data dependency among different backbones. Previous mapping algorithms are either inefficient or over-simplified which makes them impractical in general scenarios. In this work, we propose CHEF to enable efficient implementation of MMMT models in realistic heterogeneous FPGA clusters, i.e., deploying heterogeneous accelerators on heterogeneous FPGAs (A2F) and mapping the heterogeneous DNNs on the deployed heterogeneous accelerators (M2A). We propose CHEF-A2F, a two-stage accelerators-to-FPGAs deployment approach to co-optimize hardware deployment and accelerator mapping. In addition, we propose CHEF-M2A, which can support general and practical cases compared to previous mapping algorithms. To the best of our knowledge, this is the first attempt to implement MMMT models in real heterogeneous FPGA clusters. Experimental results show that the latency obtained with CHEF is near-optimal while the search time is 10\u0000<inline-formula> <tex-math>$000times $ </tex-math></inline-formula>\u0000 less than exhaustively searching the optimal solution.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3937-3948"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NeRF-PIM: PIM Hardware-Software Co-Design of Neural Rendering Networks","authors":"Jaeyoung Heo;Sungjoo Yoo","doi":"10.1109/TCAD.2024.3443712","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443712","url":null,"abstract":"Neural radiance field (NeRF) has emerged as a state-of-the-art technique, offering unprecedented realism in rendering. Despite its advancements, the adoption of NeRF is constrained by high computational cost, leading to slow rendering speed. Voxel-based optimization of NeRF addresses this by reducing the computational cost, but it introduces substantial memory overheads. To address this problem, we propose NeRF-PIM, a hardware-software co-design approach. In order to address the problem of the memory accesses to the large model (of the voxel grid) with poor locality and low compute density, we propose exploiting processing-in-memory (PIM) together with PIM-aware software optimizations in terms of the data layout, redundancy removal, and computation reuse. Our PIM hardware aims to accelerate the trilinear interpolation and dot product operations. Specifically, to address the low utilization of internal bandwidth due to the random accesses to the voxels, we propose a data layout that judiciously exploits the characteristics of the interpolation operation on the voxel grid, which helps remove bank conflicts in voxel accesses and also improves the efficiency of PIM command issue by exploiting the all-bank mode in the existing PIM device. As PIM-aware software optimizations, we also propose occupancy-grid-aware pruning and one-voxel two-sampling (1V2S) methods, which contribute to compute the efficiency improvement (by avoiding the redundant computation on the empty space) and memory traffic reduction (by reusing the per-voxel dot product results). We conduct experiments using an actual baseline HBM-PIM device. Our NeRF-PIM demonstrates a speedup of 7.4 and \u0000<inline-formula> <tex-math>$5.0times $ </tex-math></inline-formula>\u0000 compared to the baseline on the two datasets, Synthetic-NeRF and Tanks and Temples, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3900-3912"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HuNT: Exploiting Heterogeneous PIM Devices to Design a 3-D Manycore Architecture for DNN Training","authors":"Chukwufumnanya Ogbogu;Gaurav Narang;Biresh Kumar Joardar;Janardhan Rao Doppa;Krishnendu Chakrabarty;Partha Pratim Pande","doi":"10.1109/TCAD.2024.3444708","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3444708","url":null,"abstract":"Processing-in-memory (PIM) architectures have emerged as an attractive computing paradigm for accelerating deep neural network (DNN) training and inferencing. However, a plethora of PIM devices, e.g., resistive random-access memory, ferroelectric field-effect transistor, phase change memory, MRAM, static random-access memory, exists and each of these devices offers advantages and drawbacks in terms of power, latency, area, and nonidealities. A heterogeneous architecture that combines the benefits of multiple devices in a single platform can enable energy-efficient and high-performance DNN training and inference. 3-D integration enables the design of such a heterogeneous architecture where multiple planar tiers consisting of different PIM devices can be integrated into a single platform. In this work, we propose the HuNT framework, which hunts for (finds) an optimal DNN neural layer mapping, and planar tier configurations for a 3-D heterogeneous architecture. Overall, our experimental results demonstrate that the HuNT-enabled 3-D heterogeneous architecture achieves up to \u0000<inline-formula> <tex-math>$10 {times }$ </tex-math></inline-formula>\u0000 and \u0000<inline-formula> <tex-math>$3.5 {times }$ </tex-math></inline-formula>\u0000 improvement with respect to the homogeneous and existing heterogeneous PIM-based architectures, respectively, in terms of energy-efficiency (TOPS/W). Similarly, the proposed HuNT-enabled architecture outperforms existing homogeneous and heterogeneous architectures by up to \u0000<inline-formula> <tex-math>$8 {times }$ </tex-math></inline-formula>\u0000 and \u0000<inline-formula> <tex-math>$2.4times $ </tex-math></inline-formula>\u0000, respectively, in terms of compute-efficiency (TOPS/mm2) without compromising the final DNN accuracy.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3300-3311"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Fuzzing of IoT Messaging Protocols Through Collaborative Packet Generation","authors":"Zhengxiong Luo;Junze Yu;Qingpeng Du;Yanyang Zhao;Feifan Wu;Heyuan Shi;Wanli Chang;Yu Jiang","doi":"10.1109/TCAD.2024.3444705","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3444705","url":null,"abstract":"Internet of Things (IoT) messaging protocols play an important role in facilitating communications between users and IoT devices. Mainstream IoT platforms employ brokers, server-side implementations of IoT messaging protocols, to enable and mediate this user-device communication. Due to the complex nature of managing communications among devices with diverse roles and functionalities, comprehensive testing of the protocol brokers necessitates collaborative parallel fuzzing. However, being unaware of the relationship between test packets generated by different parties, existing parallel fuzzing methods fail to explore the brokers’ diverse processing logic effectively. This article introduces MPF\u0000<sc>uzz</small>\u0000, a parallel fuzzing tool designed to secure IoT messaging protocols through collaborative packet generation. The approach leverages the critical role of certain fields within IoT messaging protocols that specify the logic for message forwarding and processing by protocol brokers. MPF\u0000<sc>uzz</small>\u0000 employs an information synchronization mechanism to synchronize these key fields across different fuzzing instances and introduces a semantic-aware refinement module that optimizes generated test packets by utilizing the shared information and field semantics. This strategy facilitates a collaborative refinement of test packets across otherwise isolated fuzzing instances, thereby boosting the efficiency of parallel fuzzing. We evaluated MPF\u0000<sc>uzz</small>\u0000 on six widely used IoT messaging protocol implementations. Compared to two state-of-the-art protocol fuzzers with parallel capabilities, Peach and AFLNet, as well as two representative parallel fuzzers, SPFuzz and AFLTeam, MPF\u0000<sc>uzz</small>\u0000 achieves (6.1%, \u0000<inline-formula> <tex-math>$174.5times $ </tex-math></inline-formula>\u0000), (20.2%, \u0000<inline-formula> <tex-math>$607.2times $ </tex-math></inline-formula>\u0000), (1.9%, \u0000<inline-formula> <tex-math>$4.1times $ </tex-math></inline-formula>\u0000), and (17.4%, \u0000<inline-formula> <tex-math>$570.2times $ </tex-math></inline-formula>\u0000) higher branch coverage and fuzzing speed under the same computing resource. Furthermore, MPF\u0000<sc>uzz</small>\u0000 exposed seven previously unknown vulnerabilities in these extensively tested projects, all of which have been assigned with CVE identifiers.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3431-3442"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Stevanato;Matteo Zini;Alessandro Biondi;Bruno Morelli;Alessandro Biasci
{"title":"Learning Memory-Contention Timing Models With Automated Platform Profiling","authors":"Andrea Stevanato;Matteo Zini;Alessandro Biondi;Bruno Morelli;Alessandro Biasci","doi":"10.1109/TCAD.2024.3449237","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3449237","url":null,"abstract":"Commercial off-the-shelf (COTS) multicore platforms are often used to enable the execution of mixed-criticality real-time applications. In these systems, the memory subsystem is one of the most notable sources of interference and unpredictability, with the memory controller (MC) being a key component orchestrating the data flow between processing units and main memory. The worst-case response times of real-time tasks is indeed particularly affected by memory contention and, in turn, by the MC behavior as well. This article presents FrATM2, a Framework to Automatically learn the Timing Models of the Memory subsystem. The framework automatically generates and executes micro-benchmarks on bare-metal hardware to profile the platform behavior in a large number of memory-contention scenarios. After aggregating and filtering the collected measurements, FrATM2 trains MC models to bound memory-related interference. The MC models can be used to enable response-time analysis. The framework was evaluated on an AMD/Xilinx Ultrascale+ SoC, collecting gigabytes of raw experimental data by testing tents of thousands of contention scenarios.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3816-3827"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems society information","authors":"","doi":"10.1109/TCAD.2024.3479789","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3479789","url":null,"abstract":"","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"C2-C2"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10745843","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}