{"title":"FreePrune: An Automatic Pruning Framework Across Various Granularities Based on Training-Free Evaluation","authors":"Miao Tang;Ning Liu;Tao Yang;Haining Fang;Qiu Lin;Yujuan Tan;Xianzhang Chen;Duo Liu;Kan Zhong;Ao Ren","doi":"10.1109/TCAD.2024.3443694","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443694","url":null,"abstract":"Network pruning is an effective technique that reduces the computational costs of networks while maintaining accuracy. However, pruning requires expert knowledge and hyperparameter tuning, such as determining the pruning rate for each layer. Automatic pruning methods address this challenge by proposing an effective training-free metric to quickly evaluate the pruned network without fine-tuning. However, most existing automatic pruning methods only investigate a certain pruning granularity, and it remains unclear whether metrics benefit automatic pruning at different granularities. Neural architecture search also studies training-free metrics to accelerate network generation. Nevertheless, whether they apply to pruning needs further investigation. In this study, we first systematically analyze various advanced training-free metrics for various granularities in pruning, and then we investigate the correlation between the training-free metric score and the after-fine-tuned model accuracy. Based on the analysis, we proposed FreePrune score, a more general metric compatible with all pruning granularities. Aiming at generating high-quality pruned networks and unleashing the power of FreePrune score, we further propose FreePrune, an automatic framework that can rapidly generate and evaluate the candidate networks, leading to a final pruned network with both high accuracy and pruning rate. Experiments show that our method achieves high correlation on various pruning granularities and comprehensively improves the accuracy.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4033-4044"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing Neural Networks on Nonvolatile FPGAs With Reprogramming","authors":"Hao Zhang;Jian Zuo;Huichuan Zheng;Sijia Liu;Meihan Luo;Mengying Zhao","doi":"10.1109/TCAD.2024.3443708","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443708","url":null,"abstract":"NV-FPGAs have attracted significant attention in research due to their high density, low leakage power, and reduced error rates. The nonvolatile memory (NVM) crossbar’s compute-in-memory (CiM) capability further enables NV-FPGAs to execute high-efficiency, high-throughput neural network (NN) inference tasks. However, with the rapid increase in network size and considering that the parameter size often exceeds the memory capacity of the field programmable gate array (FPGA), implementing the entire network on a single FPGA chip becomes impractical. In this article, we utilize FPGA’s inherent run time reprogramming feature to implement oversized NNs on NV-FPGAs. This approach splits NN models into multiple tasks for the cyclical execution. Specifically, we propose a performance-driven task adapter (PD-Adapter), which aims to achieve high-performance NN inference by employing the task deployment to optimize settings, such as processing element size and quantity, and the task switching to select the most suitable switching type for each task. We integrate the proposed PD-Adapter into an open-source toolchain and evaluate it. Experimental results demonstrate that the PD-Adapter can achieve a run time reduction of 85.37% and 76.12% compared to the baseline and execution-time-first policy, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3961-3972"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Indoor–Outdoor Energy Management for Wearable IoT Devices With Conformal Prediction and Rollout","authors":"Nuzhat Yamin;Ganapati Bhat","doi":"10.1109/TCAD.2024.3448382","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3448382","url":null,"abstract":"Internet of Things (IoT) devices have the potential to enable a wide range of applications, including smart health and agriculture. However, they are limited by their small battery capacities. Utilizing energy harvesting is a promising approach to augment the battery life of IoT devices. However, relying solely on harvested energy is insufficient due to the stochastic nature of ambient sources. Predicting and accounting for uncertainty in the energy harvest (EH) is critical for optimal energy management (EM) in wearable IoT devices. This article proposes a two-step uncertainty-aware EH prediction and management framework for wearable IoT devices. First, the framework employs an energy-efficient conformal prediction (CP) method to predict future EH and construct prediction intervals. Contrasting to prior CP approaches, we propose constructing the prediction intervals using a combination of residuals from previous hours and days. Second, the framework proposes a near-optimal EM approach that utilizes a rollout algorithm. The rollout algorithm efficiently simulates various energy allocation trajectories as a function of predicted EH bounds. Using results from the rollout, the proposed approach constructs energy allocation bounds that maximize application utility (quality of service) with a high probability. Evaluations using real-world energy data from ARAS and Mannheim datasets show that the proposed CP for EH prediction provides 93% coverage probability with an average width of 9.5 J and 1.9 J, respectively. Moreover, EM using the rollout algorithm provides energy allocation decisions that are within 1.9–2.9 J of the optimal with minimal overhead.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3370-3381"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Meta-Scanner: Detecting Fault Attacks via Scanning FPGA Designs Metadata","authors":"Hassan Nassar;Jonas Krautter;Lars Bauer;Dennis Gnad;Mehdi Tahoori;Jörg Henkel","doi":"10.1109/TCAD.2024.3443769","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443769","url":null,"abstract":"With the rise of the big data, processing in the cloud has become more significant. One method of accelerating applications in the cloud is to use field programmable gate arrays (FPGAs) to provide the needed acceleration for the user-specific applications. Multitenant FPGAs are a solution to increase efficiency. In this case, multiple cloud users upload their accelerator designs to the same FPGA fabric to use them in the cloud. However, multitenant FPGAs are vulnerable to low-level denial-of-service attacks that induce excessive voltage drops using the legitimate configurations. Through such attacks, the availability of the cloud resources to the nonmalicious tenants can be hugely impacted, leading to downtime and thus financial losses to the cloud service provider. In this article, we propose a tool for the offline classification to identify which FPGA designs can be malicious during operation by analysing the metadata of the bitstream generation step. We generate and test 475 FPGA designs that include 38% malicious designs. We identify and extract five relevant features out of the metadata provided from the bitstream generation step. Using ten-fold cross-validation to train a random forest classifier, we achieve an average accuracy of 97.9%. This significantly surpasses the conservative comparison with the state-of-the-art approaches, which stands at 84.0%, as our approach detects stealthy attacks undetectable by the existing methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3443-3454"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"D-Linker: Debloating Shared Libraries by Relinking From Object Files","authors":"Jiatai He;Pengpeng Hou;Jiageng Yu;Ji Qi;Ying Sun;Lijuan Li;Ruilin Zhao;Yanjun Wu","doi":"10.1109/TCAD.2024.3446712","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446712","url":null,"abstract":"Shared libraries are widely used in software development to execute third-party functions. However, the size and complexity of shared libraries tend to increase with the need to support more features, resulting in bloated shared libraries. This leads to resource waste and security issues as a significant amount of generic functionality is included unnecessarily in most scenarios, especially in embedded systems. To address this issue, previous works attempt to debloat shared libraries through binary rewriting or recompilation. However, these works face a tradeoff between flexibility in usage (needs recompilation and runtime support) and the effectiveness of debloating (binary rewriting achieves insufficient file size reduction). We propose D-Linker, a tool that debloats shared libraries by reducing both code and data sections in link-time at the object level without recompilation. Our key insight is that object-level shared library debloating is especially suitable for embedded systems because it strikes a balance of flexibility and efficiency. D-Linker identifies the required ELF object files of the shared libraries in an application and relinks them to produce a debloated shared library with better-debloating effectiveness by avoiding the data reference analysis. Our approach achieves over 70% of gadgets reduction as a security benefit and an average size reduction of 49.6% for a stripped libc of coreutils. The results also indicate that D-Linker improves debloating effectiveness by approximately 30% compared to binary-level shared library debloating and incurs a 5% decrease in code gadgets reduction compared to source-code-level shared library debloating.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3768-3779"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling and Analysis of the LatestTime Message Synchronization Policy in ROS","authors":"Chenhao Wu;Ruoxiang Li;Naijun Zhan;Nan Guan","doi":"10.1109/TCAD.2024.3446709","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446709","url":null,"abstract":"Sensor fusion plays a critical role in modern robotics and autonomous systems. In reality, the sensor data destined for the fusion algorithm may have substantially different sampling times. Without proper management, this could lead to poor sensor fusion quality. Robot operating system (ROS) is the most popular robotic software framework, providing essential mechanisms for synchronizing messages to mitigate timing inconsistencies during sensor fusion. Recently, ROS introduced a new LatestTime message synchronization policy. In this article, we formally model the behavior of the LatestTime policy and analyze its worst-case real-time performance. Our investigation uncovers a defect of the LatestTime policy that may cause infinite latency in publishing subsequent outputs. We propose a solution to address this defect and develop safe and tight upper bounds on worst-case real-time performance, in terms of both the maximal temporal inconsistency of its outputs and the incurred latency. Experiments are conducted to evaluate the precision, safety and robustness of our theoretical results.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3576-3587"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sen Wang;Dong Li;Shao-Yu Huang;Xuanliang Deng;Ashrarul H. Sifat;Jia-Bin Huang;Changhee Jung;Ryan Williams;Haibo Zeng
{"title":"Time-Triggered Scheduling for Nonpreemptive Real-Time DAG Tasks Using 1-Opt Local Search","authors":"Sen Wang;Dong Li;Shao-Yu Huang;Xuanliang Deng;Ashrarul H. Sifat;Jia-Bin Huang;Changhee Jung;Ryan Williams;Haibo Zeng","doi":"10.1109/TCAD.2024.3442985","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3442985","url":null,"abstract":"Modern real-time systems often involve numerous computational tasks characterized by intricate dependency relationships. Within these systems, data propagate through cause–effect chains from one task to another, making it imperative to minimize end-to-end latency to ensure system safety and reliability. In this article, we introduce innovative nonpreemptive scheduling techniques designed to reduce the worst-case end-to-end latency and/or time disparity for task sets modeled with directed acyclic graphs (DAGs). This is challenging because of the noncontinuous and nonconvex characteristics of the objective functions, hindering the direct application of standard optimization frameworks. Customized optimization frameworks aiming at achieving optimal solutions may suffer from scalability issues, while general heuristic algorithms often lack theoretical performance guarantees. To address this challenge, we incorporate the “1-opt” concept from the optimization literature (Essentially, 1-opt means that the quality of a solution cannot be improved if only one single variable can be changed) into the design of our algorithm. We propose a novel optimization algorithm that effectively balances the tradeoff between theoretical guarantees and algorithm scalability. By demonstrating its theoretical performance guarantees, we establish that the algorithm produces 1-opt solutions while maintaining polynomial run-time complexity. Through extensive large-scale experiments, we demonstrate that our algorithm can effectively reduce the latency metrics by 20% to 40%, compared to state-of-the-art methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3650-3661"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Li;Zhigang Cai;Balazs Gerofi;Yutaka Ishikawa;Jianwei Liao
{"title":"Page Type-Aware Full-Sequence Program Scheduling via Reinforcement Learning in High Density SSDs","authors":"Jun Li;Zhigang Cai;Balazs Gerofi;Yutaka Ishikawa;Jianwei Liao","doi":"10.1109/TCAD.2024.3444718","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3444718","url":null,"abstract":"Full-sequence program (FSP) can program multiple bits simultaneously, and thus complete a multiple-page write at one time for naturally enhancing write performance of high density 3-D solid-state drives (SSDs). This article proposes an FSP scheduling approach for the 3-D quad-level cell (QLC) SSDs, to further boost their read responsiveness. Considering each FSP operation in QLC SSDs spans \u0000<monospace>four</monospace>\u0000 different types of QLC pages having dissimilar read latency, we introduce matching four pages of application data to the suited QLC pages and flush them together with the one-shot program of FSP. To this end, we employ reinforcement learning to classify the (cached) application data into \u0000<monospace>four</monospace>\u0000 categories on the basis of their historical access frequency and the associating request size. Thus, the frequently read data can be mapped to the QLC pages having less access latency, meanwhile the other data can be flushed onto the slow QLC pages. Then, we can group four different categories of data pages and flush them together into a four-page unit of 3-D QLC SSDs with an FSP operation. In addition, a proactive rewrite method is also triggered for grouping the hot read data with the cached data to form an FSP unit. Through a series of emulation tests on several realistic disk traces, we show that the proposed mechanisms yields notable performance improvement on the read responsiveness.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3696-3707"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GEAR: Graph-Evolving Aware Data Arranger to Enhance the Performance of Traversing Evolving Graphs on SCM","authors":"Wen-Yi Wang;Chun-Feng Wu;Yun-Chih Chen;Tei-Wei Kuo;Yuan-Hao Chang","doi":"10.1109/TCAD.2024.3447222","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447222","url":null,"abstract":"In the era of big data, social network services continuously modify social connections, leading to dynamic and evolving graph data structures. These evolving graphs, vital for representing social relationships, pose significant memory challenges as they grow over time. To address this, storage-class-memory (SCM) emerges as a cost-effective solution alongside DRAM. However, contemporary graph evolution processes often scatter neighboring vertices across multiple pages, causing weak graph spatial locality and high-TLB misses during traversals. This article introduces SCM-Based graph-evolving aware data arranger (GEAR), a joint management middleware optimizing data arrangement on SCMs to enhance graph traversal efficiency. SCM-based GEAR comprises multilevel page allocation, locality-aware data placement, and dual-granularity wear leveling techniques. Multilevel page allocation prevents scattering of neighbor vertices relying on managing each page in a finer-granularity, while locality-aware data placement reserves space for future updates, maintaining strong graph spatial locality. The dual-granularity wear leveler evenly distributes updates across SCM pages with considering graph traversing characteristics. Evaluation results demonstrate SCM-based GEAR’s superiority, achieving 23% to 70% reduction in traversal time compared to state-of-the-art frameworks.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3674-3684"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OPIMA: Optical Processing-in-Memory for Convolutional Neural Network Acceleration","authors":"Febin Sunny;Amin Shafiee;Abhishek Balasubramaniam;Mahdi Nikdast;Sudeep Pasricha","doi":"10.1109/TCAD.2024.3446870","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3446870","url":null,"abstract":"Recent advances in machine learning (ML) have spotlighted the pressing need for computing architectures that bridge the gap between memory bandwidth and processing power. The advent of deep neural networks has pushed traditional Von Neumann architectures to their limits due to the high latency and energy consumption costs associated with data movement between the processor and memory for these workloads. One of the solutions to overcome this bottleneck is to perform computation within the main memory through processing-in-memory (PIM), thereby limiting data movement and the costs associated with it. However, dynamic random-access memory-based PIM struggles to achieve high throughput and energy efficiency due to internal data movement bottlenecks and the need for frequent refresh operations. In this work, we introduce OPIMA, a PIM-based ML accelerator, architected within an optical main memory. OPIMA has been designed to leverage the inherent massive parallelism within main memory while performing high-speed, low-energy optical computation to accelerate ML models based on convolutional neural networks. We present a comprehensive analysis of OPIMA to guide design choices and operational mechanisms. In addition, we evaluate the performance and energy consumption of OPIMA, comparing it with conventional electronic computing systems and emerging photonic PIM architectures. The experimental results show that OPIMA can achieve \u0000<inline-formula> <tex-math>$2.98times $ </tex-math></inline-formula>\u0000 higher throughput and \u0000<inline-formula> <tex-math>$137times $ </tex-math></inline-formula>\u0000 better energy efficiency than the best known prior work.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3888-3899"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}