{"title":"Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation Serving","authors":"Xin You;Hailong Yang;Siqi Wang;Tao Peng;Chen Ding;Xinyuan Li;Bangduo Chen;Zhongzhi Luan;Tongxuan Liu;Yong Li;Depei Qian","doi":"10.1109/TC.2024.3449749","DOIUrl":"10.1109/TC.2024.3449749","url":null,"abstract":"Recommendation serving with deep learning models is one of the most valuable services of modern E-commerce companies. In production, to accommodate billions of recommendation queries with stringent service level agreements, high-performant recommendation serving systems play an essential role in meeting such daunting demand. Unfortunately, existing model serving frameworks fail to achieve efficient serving due to unique challenges such as 1) the input format mismatch between service needs and the model's ability and 2) heavy software contentions to concurrently execute the constrained operations. To address the above challenges, we propose \u0000<i>RecServe</i>\u0000, a high-performant serving system for recommendation with the optimized design of \u0000<i>structured features</i>\u0000 and \u0000<i>SessionGroups</i>\u0000 for recommendation serving. With \u0000<i>structured features</i>\u0000, \u0000<i>RecServe</i>\u0000 packs single-user-multiple-candidates inputs by semi-automatically transforming computation graphs with annotated input tensors, which can significantly reduce redundant network transmission, data movements, and useless computations. With \u0000<i>session group</i>\u0000, \u0000<i>RecServe</i>\u0000 further adopts resource isolations for multiple compute streams and cost-aware operator scheduler with critical-path-based schedule policy to enable concurrent kernel execution, further improving serving throughput. The experiment results demonstrate that \u0000<i>RecServe</i>\u0000 can achieve maximum performance speedups of 12.3\u0000<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$22.0boldsymbol{times}$</tex-math></inline-formula>\u0000 compared to the state-of-the-art serving system on CPU and GPU platforms, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2474-2487"},"PeriodicalIF":3.6,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Falic: An FPGA-Based Multi-Scalar Multiplication Accelerator for Zero-Knowledge Proof","authors":"Yongkui Yang;Zhenyan Lu;Jingwei Zeng;Xingguo Liu;Xuehai Qian;Zhibin Yu","doi":"10.1109/TC.2024.3449121","DOIUrl":"10.1109/TC.2024.3449121","url":null,"abstract":"In this paper, we propose Falic, a novel FPGA-based accelerator to accelerate multi-scalar multiplication (MSM), the most time-consuming phase of zk-SNARK proof generation. Falic innovates three techniques. First, it leverages globally asynchronous locally synchronous (GALS) strategy to build multiple small and lightweight MSM cores to parallelize the independent inner product computation on different portions of the scalar vector and point vector. Second, each MSM core contains just one large-integer modular multiplier (LIMM) that is multiplexed to perform the point additions (PADDs) generated during MSM. We strike a balance between the throughput and hardware cost by batching the appropriate number of PADDs and selecting the computation graph of PADD with proper parallelism degree. Finally, the performance is further improved by a simple cache structure that enables the computation reuse. We implement Falic on two different FPGAs with different hardware resources, i.e., the Xilinx U200 and Xilinx U250. Compared to the prior FPGA-based accelerator, Falic improves the MSM throughput by \u0000<inline-formula><tex-math>$3.9boldsymbol{times}$</tex-math></inline-formula>\u0000. Experimental results also show that Falic achieves a throughput speedup of up to \u0000<inline-formula><tex-math>$1.62boldsymbol{times}$</tex-math></inline-formula>\u0000 and saves as much as \u0000<inline-formula><tex-math>$8.5boldsymbol{times}$</tex-math></inline-formula>\u0000 energy compared to an RTX 2080Ti GPU.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2791-2804"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ao Zhou;Jianlei Yang;Yingjie Qi;Tong Qiao;Yumeng Shi;Cenlin Duan;Weisheng Zhao;Chunming Hu
{"title":"HGNAS: Hardware-Aware Graph Neural Architecture Search for Edge Devices","authors":"Ao Zhou;Jianlei Yang;Yingjie Qi;Tong Qiao;Yumeng Shi;Cenlin Duan;Weisheng Zhao;Chunming Hu","doi":"10.1109/TC.2024.3449108","DOIUrl":"10.1109/TC.2024.3449108","url":null,"abstract":"Graph Neural Networks (GNNs) are becoming increasingly popular for graph-based learning tasks such as point cloud processing due to their state-of-the-art (SOTA) performance. Nevertheless, the research community has primarily focused on improving model expressiveness, lacking consideration of how to design efficient GNN models for edge scenarios with real-time requirements and limited resources. Examining existing GNN models reveals varied execution across platforms and frequent Out-Of-Memory (OOM) problems, highlighting the need for hardware-aware GNN design. To address this challenge, this work proposes a novel hardware-aware graph neural architecture search framework tailored for resource constraint edge devices, namely HGNAS. To achieve hardware awareness, HGNAS integrates an efficient GNN hardware performance predictor that evaluates the latency and peak memory usage of GNNs in milliseconds. Meanwhile, we study GNN memory usage during inference and offer a peak memory estimation method, enhancing the robustness of architecture evaluations when combined with predictor outcomes. Furthermore, HGNAS constructs a fine-grained design space to enable the exploration of extreme performance architectures by decoupling the GNN paradigm. In addition, the multi-stage hierarchical search strategy is leveraged to facilitate the navigation of huge candidates, which can reduce the single search time to a few GPU hours. To the best of our knowledge, HGNAS is the first automated GNN design framework for edge devices, and also the first work to achieve hardware awareness of GNNs across different platforms. Extensive experiments across various applications and edge devices have proven the superiority of HGNAS. It can achieve up to a \u0000<inline-formula><tex-math>$10.6boldsymbol{times}$</tex-math></inline-formula>\u0000 speedup and an \u0000<inline-formula><tex-math>$82.5%$</tex-math></inline-formula>\u0000 peak memory reduction with negligible accuracy loss compared to DGCNN on ModelNet40.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2693-2707"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling Efficient Deep Learning on MCU With Transient Redundancy Elimination","authors":"Jiesong Liu;Feng Zhang;Jiawei Guan;Hsin-Hsuan Sung;Xiaoguang Guo;Saiqin Long;Xiaoyong Du;Xipeng Shen","doi":"10.1109/TC.2024.3449102","DOIUrl":"10.1109/TC.2024.3449102","url":null,"abstract":"Deploying deep neural networks (DNNs) with satisfactory performance in resource-constrained environments is challenging. This is especially true of microcontrollers due to their tight space and computational capabilities. However, there is a growing demand for DNNs on microcontrollers, as executing large DNNs on microcontrollers is critical to reducing energy consumption, increasing performance efficiency, and eliminating privacy concerns. This paper presents a novel and systematic data redundancy elimination method to implement efficient DNNs on microcontrollers through innovations in computation and space optimization. By making the optimization itself a trainable component in the target neural networks, this method maximizes performance benefits while keeping the DNN accuracy stable. Experiments are performed on two microcontroller boards with three popular DNNs, namely CifarNet, ZfNet and SqueezeNet. Experiments show that this solution eliminates more than 96% of computations in DNNs and makes them fit well on microcontrollers, yielding 3.4-5\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup with little loss of accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2649-2663"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BiRD: Bi-Directional Input Reuse Dataflow for Enhancing Depthwise Convolution Performance on Systolic Arrays","authors":"Mingeon Park;Seokjin Hwang;Hyungmin Cho","doi":"10.1109/TC.2024.3449103","DOIUrl":"10.1109/TC.2024.3449103","url":null,"abstract":"Depthwise convolution (DWConv) is an effective technique for reducing the size and computational requirements of convolutional neural networks. However, DWConv's input reuse pattern is not easily transformed into dense matrix multiplications, leading to low utilization of processing elements (PEs) on existing systolic arrays. In this paper, we introduce a novel systolic array dataflow mechanism called \u0000<i>BiRD</i>\u0000, designed to maximize input reuse and boost DWConv performance. BiRD utilizes two directions of input reuse and necessitates only minor modifications to a typical weight-stationary type systolic array. We evaluate BiRD on the Gemmini platform, comparing it with existing dataflow types. The results demonstrate that BiRD achieves significant performance improvements in computation time reduction, while incurring minimal area overhead and improved energy consumption compared to other dataflow types. For example, on a 32\u0000<inline-formula><tex-math>$times{}$</tex-math></inline-formula>\u000032 systolic array, it results in a 9.8% area overhead, significantly smaller than other dataflow types for DWConv. Compared to matrix multiplication-based DWConv, BiRD achieves a 4.7\u0000<inline-formula><tex-math>$times{}$</tex-math></inline-formula>\u0000 performance improvement for DWConv layers of MobileNet-V2, resulting in a 55.8% reduction in total inference computation time and a 44.9% reduction in energy consumption. Our results highlight the effectiveness of BiRD in enhancing the performance of DWConv on systolic arrays.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2708-2721"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks","authors":"Beatrice Alessandra Motetti;Matteo Risso;Alessio Burrello;Enrico Macii;Massimo Poncino;Daniele Jahier Pagliari","doi":"10.1109/TC.2024.3449084","DOIUrl":"10.1109/TC.2024.3449084","url":null,"abstract":"The resource requirements of deep neural networks (DNNs) pose significant challenges to their deployment on edge devices. Common approaches to address this issue are pruning and mixed-precision quantization, which lead to latency and memory occupation improvements. These optimization techniques are usually applied independently. We propose a novel methodology to apply them jointly via a lightweight gradient-based search, and in a hardware-aware manner, greatly reducing the time required to generate Pareto-optimal DNNs in terms of accuracy versus cost (i.e., latency or memory). We test our approach on three edge-relevant benchmarks, namely CIFAR-10, Google Speech Commands, and Tiny ImageNet. When targeting the optimization of the memory footprint, we are able to achieve a size reduction of 47.50% and 69.54% at iso-accuracy with the baseline networks with all weights quantized at 8 and 2-bit, respectively. Our method surpasses a previous state-of-the-art approach with up to 56.17% size reduction at iso-accuracy. With respect to the sequential application of state-of-the-art pruning and mixed-precision optimizations, we obtain comparable or superior results, but with a significantly lowered training time. In addition, we show how well-tailored cost models can improve the cost versus accuracy trade-offs when targeting specific hardware for deployment.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2619-2633"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimized Quantum Circuit of AES With Interlacing-Uncompute Structure","authors":"Mengyuan Zhang;Tairong Shi;Wenling Wu;Han Sui","doi":"10.1109/TC.2024.3449094","DOIUrl":"10.1109/TC.2024.3449094","url":null,"abstract":"In the post-quantum era, the security level of encryption algorithms is often evaluated based on the quantum resources required to attack AES. In this work, we make thoroughly estimations on various performance metrics of the quantum circuit of AES-128/192/256. Firstly, we introduce a generic round structure for in-place implementation of the AES algorithm, maximizing the parallelism between nonlinear components. Specifically, when employed as an encryption oracle, our structure reduces the \u0000<inline-formula><tex-math>$T$</tex-math></inline-formula>\u0000-depth from \u0000<inline-formula><tex-math>$2rd$</tex-math></inline-formula>\u0000 to \u0000<inline-formula><tex-math>$(r+1)d$</tex-math></inline-formula>\u0000. Furthermore, by leveraging the properties of block-cyclic matrices, we present an in-place implementation circuit for MixColumn with depth 10, utilizing 105 CNOT gates. In relation to the S-box, we have assessed its minimum circuit width at different \u0000<inline-formula><tex-math>$T$</tex-math></inline-formula>\u0000-depths and provide multiple versions of circuit implementations for a depth-width trade-off. Finally, based on our optimized S-box circuit, we conduct a comprehensive analysis of the implementation complexity of different round structures, where our structure exhibits significant advantages in terms of low \u0000<inline-formula><tex-math>$T$</tex-math></inline-formula>\u0000-depth.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2563-2575"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Eslami;Tara Ghasempouri;Samuel Pagliarini
{"title":"SCARF: Securing Chips With a Robust Framework Against Fabrication-Time Hardware Trojans","authors":"Mohammad Eslami;Tara Ghasempouri;Samuel Pagliarini","doi":"10.1109/TC.2024.3449082","DOIUrl":"10.1109/TC.2024.3449082","url":null,"abstract":"The globalization of the semiconductor industry has introduced security challenges to Integrated Circuits (ICs), particularly those related to the threat of Hardware Trojans (HTs) – malicious logic that can be introduced during IC fabrication. While significant efforts are directed towards verifying the correctness and reliability of ICs, their security is often overlooked. In this paper, we propose a comprehensive framework that integrates a suite of methodologies for both front-end and back-end stages of design, aimed at enhancing the security of ICs. Initially, we outline a systematic methodology to transform existing verification assets into potent security checkers by repurposing verification assertions. To further improve security, we introduce an innovative methodology for integrating online monitors during physical synthesis – a back-end insertion providing an additional layer of defense. Experimental results demonstrate a significant increase in security, measured by our introduced metric, Security Coverage (SC), with a marginal rise in area and power consumption, typically under 20%. The insertion of online monitors during physical synthesis enhances security metrics by up to 33.5%. This holistic framework offers a comprehensive defense mechanism across the entire spectrum of IC design.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2761-2775"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Single-Key Attack on Full-Round Shadow Designed for IoT Nodes","authors":"Yuhan Zhang;Wenling Wu;Lei Zhang;Yafei Zheng","doi":"10.1109/TC.2024.3449040","DOIUrl":"10.1109/TC.2024.3449040","url":null,"abstract":"With the rapid advancement of the Internet of Things (IoT), many innovative lightweight block ciphers have been introduced to meet the stringent security demands of IoT devices. Among these, the Shadow cipher stands out for its compactness, making it particularly well-suited for deployment in resource-constrained IoT nodes (IEEE Internet of Things Journal, 2021). This paper demonstrates two real-time attacks on Shadow for the first time: real-time plaintext recovery and key recovery. Firstly, numerous properties of Shadow are discussed, illustrating an equivalent representation of the two-round Shadow and the relationship between the round keys. Secondly, we introduce multiple two-round iterative linear approximations. Employing these approximations enables the derivation of full-round linear distinguishers. Moreover, we have uncovered numerous linear relationships between plaintext and ciphertext. Real-time plaintext recovery is achievable based on these established relationships. On average, it takes 5 seconds to recover the plaintext for a fixed ciphertext of Shadow-32. Thirdly, many properties of the propagation of difference through SIMON-like function are illustrated. According to these properties, various differential distinguishers up to full rounds are presented, allowing real-time key recovery. Specifically, the 64-bit master key of Shadow-32 can be retrieved in around two days on average. Experiments verify all our results.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2776-2790"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ROLoad-PMP: Securing Sensitive Operations for Kernels and Bare-Metal Firmware","authors":"Wende Tan;Chenyang Li;Yangyu Chen;Yuan Li;Chao Zhang;Jianping Wu","doi":"10.1109/TC.2024.3449105","DOIUrl":"10.1109/TC.2024.3449105","url":null,"abstract":"A common way for attackers to compromise victim systems is hijacking sensitive operations (e.g., control-flow transfers) with attacker-controlled inputs. Existing solutions in general only protect parts of these targets and have high performance overheads, which are impractical and hard to deploy on systems with limited resources (e.g., IoT devices) or for low-level software like kernels and bare-metal firmware. In this paper, we present a lightweight hardware-software co-design solution ROLoad-PMP to protect sensitive operations from being hijacked for low-level software. First, we propose new instructions, which only load data from read-only memory regions with specific keys, to guarantee the integrity of pointees pointed by (potentially corrupted) data pointers. Then, we provide a program hardening mechanism to protect sensitive operations, by classifying and placing their operands into read-only memory with different keys at compile-time and loading them with ROLoad-PMP-family instructions at runtime. We have implemented an FPGA-based prototype of ROLoad-PMP based on RISC-V, and demonstrated an important defense application, i.e., forward-edge control-flow integrity. Results showed that ROLoad-PMP only costs few extra hardware resources (\u0000<inline-formula><tex-math>$lt 1.40%$</tex-math></inline-formula>\u0000). Moreover, it enables many lightweight (e.g., with negligible overheads \u0000<inline-formula><tex-math>$lt 0.853%$</tex-math></inline-formula>\u0000) defenses, and provides broader and stronger security guarantees than existing hardware solutions, e.g., ARM BTI and Intel CET.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 12","pages":"2722-2733"},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}