Xiaotian Zhao;Ruge Xu;Yimin Gao;Vaibhav Verma;Mircea R. Stan;Xinfei Guo
{"title":"Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing","authors":"Xiaotian Zhao;Ruge Xu;Yimin Gao;Vaibhav Verma;Mircea R. Stan;Xinfei Guo","doi":"10.1109/TC.2024.3441860","DOIUrl":"10.1109/TC.2024.3441860","url":null,"abstract":"As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of \u0000<inline-formula><tex-math>$15.50times$</tex-math></inline-formula>\u0000 to \u0000<inline-formula><tex-math>$47.67times$</tex-math></inline-formula>\u0000 over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves \u0000<inline-formula><tex-math>$2.2%sim 6.7%$</tex-math></inline-formula>\u0000 higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over \u0000<inline-formula><tex-math>$1.3%$</tex-math></inline-formula>\u0000 compared to a greedy-based search.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2504-2519"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kui Ye;Shengxin Dai;Bing Guo;Yan Shen;Chuanjie Liu;Kejun Bi;Fei Chen;Yuchuan Hu;Mingjie Zhao
{"title":"A Mutual-Influence-Aware Heuristic Method for Quantum Circuit Mapping","authors":"Kui Ye;Shengxin Dai;Bing Guo;Yan Shen;Chuanjie Liu;Kejun Bi;Fei Chen;Yuchuan Hu;Mingjie Zhao","doi":"10.1109/TC.2024.3441825","DOIUrl":"10.1109/TC.2024.3441825","url":null,"abstract":"Quantum circuit mapping (QCM) is a crucial preprocessing step for executing a logical circuit (LC) on noisy intermediate-scale quantum (NISQ) devices. Balancing the introduction of extra gates and the efficiency of preprocessing poses a significant challenge for the mapping process. To address this challenge, we propose the mutual-influence-aware (MIA) heuristic method by integrating an initial mapping search framework, an initial mapping generator, and a heuristic circuit mapper. Initially, the framework utilizes the generator to obtain a favorable starting point for the initial mapping search. With this starting point, the search process can efficiently discover a promising initial mapping within a few bidirectional iterations. The circuit mapper considers mutual influences of SWAP gates and is invoked once per iteration. Ultimately, the best result from all iterations is considered the QCM outcome. The experimental results on extensive benchmark circuits demonstrate that, compared to the iterated local search (ILS) method, which represents the current state-of-the-art, our MIA method introduces a similar number of extra gates while achieving nearly 95 times faster execution.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2855-2867"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Response-Time Analysis of Bundled Gang Tasks Under Partitioned FP Scheduling","authors":"Veronica Rispo;Federico Aromolo;Daniel Casini;Alessandro Biondi","doi":"10.1109/TC.2024.3441823","DOIUrl":"10.1109/TC.2024.3441823","url":null,"abstract":"The study of parallel task models for real-time systems has become fundamental due to the increasing computational demand of modern applications. Recently, gang scheduling has gained attention for improving performance in tightly synchronized parallel applications. Nevertheless, existing studies often overestimate computational demand by assuming a constant number of cores for each task. In contrast, the bundled model accurately represents internal parallelism by means of a string of segments demanding for a variable number of cores. This model is particularly relevant to modern real-time systems, as it allows transforming general parallel tasks into bundled tasks while preserving accurate parallelism. However, it has only been analyzed for global scheduling, which carries analytical pessimism and considerable run-time overheads. This paper introduces two response-time analysis techniques for parallel real-time tasks under partitioned, fixed-priority gang scheduling under the bundled model, together with a set of specialized allocation heuristics. Experimental results compare the proposed methods against state-of-the-art approaches.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2534-2547"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10633880","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Generation and Optimization Framework of NoC-Based Neural Network Accelerator Through Reinforcement Learning","authors":"Yongqi Xue;Jinlun Ji;Xinming Yu;Shize Zhou;Siyue Li;Xinyi Li;Tong Cheng;Shiping Li;Kai Chen;Zhonghai Lu;Li Li;Yuxiang Fu","doi":"10.1109/TC.2024.3441822","DOIUrl":"10.1109/TC.2024.3441822","url":null,"abstract":"Choices of dataflows, which are known as intra-core neural network (NN) computation loop nest scheduling and inter-core hardware mapping strategies, play a critical role in the performance and energy efficiency of NoC-based neural network accelerators. Confronted with an enormous dataflow exploration space, this paper proposes an automatic framework for generating and optimizing the full-layer-mappings based on two reinforcement learning algorithms including A2C and PPO. Combining soft and hard constraints, this work transforms the mapping configuration into a sequential decision problem and aims to explore the performance and energy efficient hardware mapping for NoC systems. We evaluate the performance of the proposed framework on 10 experimental neural networks. The results show that compared with the direct-X mapping, the direct-Y mapping, GA-base mapping, and NN-aware mapping, our optimization framework reduces the average execution time of 10 experimental NNs by 9.09\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000, improves the throughput by 11.27\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000, reduces the energy by 12.62\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000, and reduces the time-energy-product (TEP) by 14.49\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000. The results also show that the performance enhancement is related to the coefficient of variation of the neural network to be computed.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2882-2896"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Container Scheduling With Fast Function Startup and Low Memory Cost in Edge Computing","authors":"Zhenzheng Li;Jiong Lou;Jianfei Wu;Jianxiong Guo;Zhiqing Tang;Ping Shen;Weijia Jia;Wei Zhao","doi":"10.1109/TC.2024.3441836","DOIUrl":"10.1109/TC.2024.3441836","url":null,"abstract":"Extending serverless computing to the edge has emerged as a promising approach to support service, but startup containerized serverless functions lead to the cold-start delay. Recent research has introduced container caching methods to alleviate the cold-start delay, including cache as the entire container or the Zygote container. However, container caching incurs memory costs. The system must ensure fast function startup and low memory cost of edge servers, which has been overlooked in the literature. This paper aims to jointly optimize startup delay and memory cost. We formulate an online joint optimization problem that encompasses container scheduling decisions, including invocation distribution, container startup, and container caching. To solve the problem, we propose an online algorithm with a competitive ratio and low computational complexity. The proposed algorithm decomposes the problem into two subproblems and solves them sequentially. Each container is assigned a randomized strategy, and these container-level decisions are merged to constitute overall container caching decisions. Furthermore, a greedy-based subroutine is designed to solve the subproblem associated with invocation distribution and container startup decisions. Experiments on the real-world dataset indicate that the algorithm can reduce average startup delay by up to 23% and lower memory costs by up to 15%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2747-2760"},"PeriodicalIF":3.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical Higher-Order Correlation Attacks Against Code-Based Masking","authors":"Wei Cheng;Jingdian Ming;Sylvain Guilley;Jean-Luc Danger","doi":"10.1109/TC.2024.3424208","DOIUrl":"10.1109/TC.2024.3424208","url":null,"abstract":"Masking is one of the most well-established methods to thwart side-channel attacks. Many masking schemes have been proposed in the literature, and code-based masking emerges and unifies several masking schemes in a coding-theoretic framework. In this work, we investigate the side-channel resistance of code-based masking from a non-profiling perspective by utilizing correlation-based side-channel attacks. We present a systematic evaluation of correlation attacks with various higher-order (centered) moments and then present the form of optimal correlation attacks. Interestingly, the Pearson correlation coefficient between the hypothetical leakage and the measured traces is connected to the signal-to-noise ratio in higher-order moments, and it turns out to be easy to evaluate rather than launch repeated attacks. We also identify some ineffective higher-order correlation attacks at certain orders when the device leaks under the Hamming weight leakage model. Our theoretical findings are verified through both simulated and real-world measurements.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 10","pages":"2364-2377"},"PeriodicalIF":3.6,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141570093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Su;Xiaoshuang Xing;Xiaolu Cheng;Rui Guo;Chuanwen Luo
{"title":"LPAH: Illustrating Efficient Live Patching With Alignment Holes in Kernel Data","authors":"Chao Su;Xiaoshuang Xing;Xiaolu Cheng;Rui Guo;Chuanwen Luo","doi":"10.1109/TC.2024.3424263","DOIUrl":"10.1109/TC.2024.3424263","url":null,"abstract":"The Linux kernel is regularly updated to enhance security, improve performance, and introduce new functionalities. Traditional updating methods typically require rebooting, leading to service disruptions and potential data loss. Live-patching technology dynamically updates the kernel modules without rebooting, ensuring continuous service availability. However, this technique has its drawbacks. Since live-patching alters the original structure of data types, it can no longer utilize base offsets to access the members, imposing considerable overheads. This paper proposes LPAH (Live Patching with Alignment Holes), a live patching system that leverages the fragmented space generated by compile-time alignment for data types, to enable effective live patching updates for security vulnerability fixes, feature enhancements, and user-defined patching tasks. LPAH capitalizes on the relationship between these alignment holes and data objects. This approach ensures efficient access to extended data members while preserving the original data's integrity. This approach allows other functions to remain unaffected by updates and replacements through explicit type casts. Extensive experimental results show that LPAH offers valid and robust live patching for multiple real vulnerabilities in the Linux kernel, without degrading performance. Our method provides an efficient way to install security patches in the Linux kernel, and thus reenforces kernel security.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 10","pages":"2434-2448"},"PeriodicalIF":3.6,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141570092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thai-Hoang Nguyen;Muhammad Imran;Jaehyuk Choi;Joon-Sung Yang
{"title":"HYDRA: A Hybrid Resistance Drift Resilient Architecture for Phase Change Memory-Based Neural Network Accelerators","authors":"Thai-Hoang Nguyen;Muhammad Imran;Jaehyuk Choi;Joon-Sung Yang","doi":"10.1109/TC.2024.3404096","DOIUrl":"10.1109/TC.2024.3404096","url":null,"abstract":"In-memory Computing (IMC) using Phase Change Memory (PCM) has proven to be effective for efficient processing of Deep Neural Networks (DNNs). However, with the use of multi-level cell PCM (MLC-PCM) in NVMs-based accelerators, errors due to resistance drift in MLC-PCM can severely degrade the DNNs accuracy. In this paper, an analysis of the impact of resistance drift errors on accuracy of MLC-PCM based DNN accelerator shows that the drift errors alone can significantly impact the accuracy. This paper proposes Hydra, which is a hybrid resistance drift resilient architecture for MLC-PCM based DNN accelerators which use IMC for efficient computations. Hydra utilizes Tri-level cell PCM, which has a negligible resistance drift error rate, to store the critical bits of DNNs parameters and MLC-PCM (4-level cell), which has a higher error rate (but offers more storage density), for the non-critical bits. Experimental results on various DNN architectures, configurations and datasets show that, with the presence of resistance drift errors in PCM, Hydra can maintain the baseline accuracy of DNNs for up to 1 year (resistance drift is time-dependent), whereas conventional drift tolerance techniques lead to a significant accuracy drop in just a few seconds.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 9","pages":"2123-2135"},"PeriodicalIF":3.6,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Novas: Tackling Online Dynamic Video Analytics With Service Adaptation at Mobile Edge Servers","authors":"Liang Zhang;Hongzi Zhu;Wen Fei;Yunzhe Li;Mingjin Zhang;Jiannong Cao;Minyi Guo","doi":"10.1109/TC.2024.3416675","DOIUrl":"10.1109/TC.2024.3416675","url":null,"abstract":"Video analytics at mobile edge servers offers significant benefits like reduced response time and enhanced privacy. However, guaranteeing various quality-of-service (QoS) requirements of dynamic video analysis requests on heterogeneous edge devices remains challenging. In this paper, we propose a scalable online video analytics scheme, called Novas, which automatically makes precise service configuration adjustments upon constant video content changes. Specifically, Novas leverages the filtered confidence sum and a two-window t-test to online detect accuracy fluctuations without ground truth information. In such cases, Novas efficiently estimates the performance of all potential service configurations through a singular value decomposition (SVD)-based collaborative filtering method. Finally, given the NP-hardness of the optimal scheduling problem, a heuristic scheduling strategy that maximizes the minimum remaining resources is devised to schedule the most suitable configurations to servers for execution. We evaluate the effectiveness of Novas through extensive hybrid experiments conducted on a dedicated testbed. Results show that Novas can achieve a substantial over 27\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 improvement in satisfying the accuracy requirements compared with existing methods adopting fixed configurations, while ensuring latency requirements. Moreover, Novas improves the goodput of the system by an average of 37.86% compared to existing state-of-the-art scheduling solutions.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 9","pages":"2220-2232"},"PeriodicalIF":3.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Unified and Fully Automated Framework for Wavelet-Based Attacks on Random Delay","authors":"Qianmei Wu;Fan Zhang;Shize Guo;Kun Yang;Haoting Shen","doi":"10.1109/TC.2024.3416682","DOIUrl":"10.1109/TC.2024.3416682","url":null,"abstract":"As a common defense against side-channel attacks, random delay insertion introduces noise into the executive flow of encryption, which increases attack complexity. Accordingly, various techniques are exploited to mitigate the defense effect of such insertions. As an advanced mathematical technique, wavelet analysis is considered to be a more effective technology according to its detailed and comprehensive interpretation of signals. In this paper, we propose a unified and fully automated wavelet-based attack framework (denoted as \u0000<bold>UWAF</b>\u0000), whose data processing is kept within one unified wavelet domain, with three enhanced components: denoising, alignment and key extraction. We put forward a new idea of combining machine learning with wavelet analysis to realize the full automation of the program for attack framework, rendering it possible to search exhaustively for the optimal combination of parameter settings in wavelet transform. Our proposal finds a new setting of wavelet parameters that have not been exploited ever before and achieves the performance enhancement for about 20 times fewer traces required for successful key recovery. \u0000<bold>UWAF</b>\u0000 is compared with several mainstream attack frameworks. Experimental results show that it outperforms those counterparts, and can be considered as an effective framework-level solution to defeat the countermeasure of random delay insertion.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 9","pages":"2206-2219"},"PeriodicalIF":3.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141939831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}