Arne Symons;Linyan Mei;Steven Colleman;Pouya Houshmand;Sebastian Karl;Marian Verhelst
{"title":"Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators","authors":"Arne Symons;Linyan Mei;Steven Colleman;Pouya Houshmand;Sebastian Karl;Marian Verhelst","doi":"10.1109/TC.2024.3477938","DOIUrl":"https://doi.org/10.1109/TC.2024.3477938","url":null,"abstract":"As the landscape of deep neural networks evolves, heterogeneous dataflow accelerators, in the form of multi-core architectures or chiplet-based designs, promise more flexibility and higher inference performance through scalability. So far, these systems exploit the increased parallelism by coarsely mapping a single layer at a time across cores, which incurs frequent costly off-chip memory accesses, or by pipelining batches of inputs, which falls short in meeting the demands of latency-critical applications. To alleviate these bottlenecks, this work explores a new fine-grain mapping paradigm, referred to as layer fusion, on heterogeneous dataflow accelerators through a novel design space exploration framework called \u0000<i>Stream</i>\u0000. \u0000<i>Stream</i>\u0000 captures a wide variety of heterogeneous dataflow architectures and mapping granularities, and implements a memory and communication-aware latency and energy analysis validated with three distinct state-of-the-art hardware implementations. As such, it facilitates a holistic exploration of architecture and mapping, by strategically allocating the workload through constraint optimization. The findings demonstrate that the integration of layer fusion with heterogeneous dataflow accelerators yields up to \u0000<inline-formula><tex-math>$2.2times$</tex-math></inline-formula>\u0000 lower energy-delay product in inference efficiency, addressing both energy consumption and latency concerns. The framework is available open-source at: github.com/kuleuven-micas/stream.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"237-249"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Davide Galli;Francesco Lattari;Matteo Matteucci;Davide Zoni
{"title":"A Deep Learning-Assisted Template Attack Against Dynamic Frequency Scaling Countermeasures","authors":"Davide Galli;Francesco Lattari;Matteo Matteucci;Davide Zoni","doi":"10.1109/TC.2024.3477997","DOIUrl":"https://doi.org/10.1109/TC.2024.3477997","url":null,"abstract":"In the last decades, machine learning techniques have been extensively used in place of classical template attacks to implement profiled side-channel analysis. This manuscript focuses on the application of machine learning to counteract Dynamic Frequency Scaling defenses. While state-of-the-art attacks have shown promising results against desynchronization countermeasures, a robust attack strategy has yet to be realized. Motivated by the simplicity and effectiveness of template attacks for devices lacking desynchronization countermeasures, this work presents a Deep Learning-assisted Template Attack (DLaTA) methodology specifically designed to target highly desynchronized traces through Dynamic Frequency Scaling. A deep learning-based pre-processing step recovers information obscured by desynchronization, followed by a template attack for key extraction. Specifically, we developed a three-stage deep learning pipeline to resynchronize traces to a uniform reference clock frequency. The experimental results on the AES cryptosystem executed on a RISC-V System-on-Chip reported a Guessing Entropy equal to 1 and a Guessing Distance greater than 0.25. Results demonstrate the method's ability to successfully retrieve secret keys even in the presence of high desynchronization. As an additional contribution, we publicly release our \u0000<monospace>DFS_DESYNCH</monospace>\u0000 database\u0000<xref><sup>1</sup></xref>\u0000<fn><label><sup>1</sup></label><p><uri>https://github.com/hardware-fab/DLaTA</uri></p></fn>\u0000 containing the first set of real-world highly desynchronized power traces from the execution of a software AES cryptosystem.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"293-306"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10713265","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balancing Privacy and Accuracy Using Significant Gradient Protection in Federated Learning","authors":"Benteng Zhang;Yingchi Mao;Xiaoming He;Huawei Huang;Jie Wu","doi":"10.1109/TC.2024.3477971","DOIUrl":"https://doi.org/10.1109/TC.2024.3477971","url":null,"abstract":"Previous state-of-the-art studies have demonstrated that adversaries can access sensitive user data by membership inference attacks (MIAs) in Federated Learning (FL). Introducing differential privacy (DP) into the FL framework is an effective way to enhance the privacy of FL. Nevertheless, in differentially private federated learning (DP-FL), local gradients become excessively sparse in certain training rounds. Especially when training with low privacy budgets, there is a risk of introducing excessive noise into clients’ gradients. This issue can lead to a significant degradation in the accuracy of the global model. Thus, how to balance the user's privacy and global model accuracy becomes a challenge in DP-FL. To this end, we propose an approach, known as differential privacy federated aggregation, based on significant gradient protection (DP-FedASGP). DP-FedASGP can mitigate excessive noises by protecting significant gradients and accelerate the convergence of the global model by calculating dynamic aggregation weights for gradients. Experimental results show that DP-FedASGP achieves comparable privacy protection effects to DP-FedAvg and cpSGD (communication-private SGD based on gradient quantization) but outperforms DP-FedSNLC (sparse noise based on clipping losses and privacy budget costs) and FedSMP (sparsified model perturbation). Furthermore, the average global test accuracy of DP-FedASGP across four datasets and three models is about \u0000<inline-formula><tex-math>$2.62$</tex-math></inline-formula>\u0000%, \u0000<inline-formula><tex-math>$4.71$</tex-math></inline-formula>\u0000%, \u0000<inline-formula><tex-math>$0.45$</tex-math></inline-formula>\u0000%, and \u0000<inline-formula><tex-math>$0.19$</tex-math></inline-formula>\u0000% higher than the above methods, respectively. These improvements indicate that DP-FedASGP is a promising approach for balancing the privacy and accuracy of DP-FL.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"278-292"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Liu;Song Guo;Jie Zhang;Zicong Hong;Yufeng Zhan;Qihua Zhou
{"title":"Collaborative Neural Architecture Search for Personalized Federated Learning","authors":"Yi Liu;Song Guo;Jie Zhang;Zicong Hong;Yufeng Zhan;Qihua Zhou","doi":"10.1109/TC.2024.3477945","DOIUrl":"https://doi.org/10.1109/TC.2024.3477945","url":null,"abstract":"Personalized federated learning (pFL) is a promising approach to train customized models for multiple clients over heterogeneous data distributions. However, existing works on pFL often rely on the optimization of model parameters and ignore the personalization demand on neural network architecture, which can greatly affect the model performance in practice. Therefore, generating personalized models with different neural architectures for different clients is a key issue in implementing pFL in a heterogeneous environment. Motivated by Neural Architecture Search (NAS), a model architecture searching methodology, this paper aims to automate the model design in a collaborative manner while achieving good training performance for each client. Specifically, we reconstruct the centralized searching of NAS into the distributed scheme called Personalized Architecture Search (PAS), where differentiable architecture fine-tuning is achieved via gradient-descent optimization, thus making each client obtain the most appropriate model. Furthermore, to aggregate knowledge from heterogeneous neural architectures, a knowledge distillation-based training framework is proposed to achieve a good trade-off between generalization and personalization in federated learning. Extensive experiments demonstrate that our architecture-level personalization method achieves higher accuracy under the non-iid settings, while not aggravating model complexity over state-of-the-art benchmarks.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"250-262"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao Xin;Chengjun Jia;Wenjun Li;Ori Rottenstreich;Yang Xu;Gaogang Xie;Zhihong Tian;Jun Li
{"title":"A Heterogeneous and Adaptive Architecture for Decision-Tree-Based ACL Engine on FPGA","authors":"Yao Xin;Chengjun Jia;Wenjun Li;Ori Rottenstreich;Yang Xu;Gaogang Xie;Zhihong Tian;Jun Li","doi":"10.1109/TC.2024.3477955","DOIUrl":"https://doi.org/10.1109/TC.2024.3477955","url":null,"abstract":"Access Control Lists (ACLs) are crucial for ensuring the security and integrity of modern cloud and carrier networks by regulating access to sensitive information and resources. However, previous software and hardware implementations no longer meet the requirements of modern datacenters. The emergence of FPGA-based SmartNICs presents an opportunity to offload ACL functions from the host CPU, leading to improved network performance in datacenter applications. However, previous FPGA-based ACL designs lacked the necessary flexibility to support different rulesets without hardware reconfiguration while maintaining high performance. In this paper, we propose HACL, a heterogeneous and adaptive architecture for decision-tree-based ACL engine on FPGA. By employing techniques such as tree decomposition and recirculated pipeline scheduling, HACL can accommodate various rulesets without reconfiguring the underlying architecture. To facilitate the efficient mapping of different decision trees to memory and optimize the throughput of a ruleset, we also introduce a heterogeneous framework with a compiler in CPU platform for HACL. We implement HACL on a typical SmartNIC and evaluate its performance. The results demonstrate that HACL achieves a throughput exceeding 260 Mpps when processing 100K-scale ACL rulesets, with low hardware resource utilization. By integrating more engines, HACL can achieve even higher throughput and support larger rulesets.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"263-277"},"PeriodicalIF":3.6,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance and Environment-Aware Advanced Driving Assistance Systems","authors":"Sreenitha Kasarapu;Sai Manoj Pudukotai Dinakarrao","doi":"10.1109/TC.2024.3475572","DOIUrl":"https://doi.org/10.1109/TC.2024.3475572","url":null,"abstract":"In autonomous and self-driving vehicles, visual perception of the driving environment plays a key role. Vehicles rely on machine learning (ML) techniques such as deep neural networks (DNNs), which are extensively trained on manually annotated databases to achieve this goal. However, the availability of training data that can represent different environmental conditions can be limited. Furthermore, as different driving terrains require different decisions by the driver, it is tedious and impractical to design a database with all possible scenarios. This work proposes a semi-parametric approach that bypasses the manual annotation required to train vehicle perception systems in autonomous and self-driving vehicles. We present a novel “Performance and Environment-aware Advanced Driving Assistance Systems” which employs one-shot learning for efficient data generation using user action and response in addition to the synthetic traffic data generated as Pareto optimal solutions from one-shot objects using a set of generalization functions. Adapting to the driving environments through such optimization adds more robustness and safety features to autonomous driving. We evaluate the proposed framework on environment perception challenges encountered in autonomous driving assistance systems. To accelerate the learning and adapt in real-time to perceived data, a novel deep learning-based Alternating Direction Method of Multipliers (dlADMM) algorithm is introduced to improve the convergence capabilities of regular machine learning models. This methodology optimizes the training process and makes applying the machine learning model to real-world problems more feasible. We evaluated the proposed technique on AlexNet and MobileNetv2 networks and achieved more than 18\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup. By making the proposed technique behavior-aware we observed performance of upto 99% while detecting traffic signals.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"131-142"},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dependability of the K Minimum Values Sketch: Protection and Comparative Analysis","authors":"Jinhua Zhu;Zhen Gao;Pedro Reviriego;Shanshan Liu;Fabrizio Lombardi","doi":"10.1109/TC.2024.3475588","DOIUrl":"https://doi.org/10.1109/TC.2024.3475588","url":null,"abstract":"A basic operation in big data analysis is to find the cardinality estimate; to estimate the cardinality at high speed and with a low memory requirement, data sketches that provide approximate estimates, are usually used. The K Minimum Value (KMV) sketch is one of the most popular options; however, soft errors on memories in KMV may substantially degrade performance. This paper is the first to consider the impact of soft errors on the KMV sketch and to compare it with HyperLogLog (HLL), another widely used sketch for cardinality estimate. Initially, the operation of KMV in the presence of soft errors (so its dependability) in the memory is studied by a theoretical analysis and simulation by error injection. The evaluation results show that errors during the construction phase of KMV may cause large deviations in the estimate results. Subsequently, based on the algorithmic features of the KMV sketch, two protection schemes are proposed. The first scheme is based on using a single parity check (SPC) to detect errors and reduce their impact on the cardinality estimate; the second scheme is based on the incremental property of the memory list in KMV. The presented evaluation shows that both schemes can dramatically improve the performance of KMV, and the SPC scheme performs better even though it requires more memory footprint and overheads in the checking operation. Finally, it is shown that soft errors on the unprotected KMV produce larger worst-case errors than in HLL, but the average impact of errors is lower; also, the protected KMV using the proposed schemes are more dependable than HLL with existing protection techniques.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"210-221"},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sketch-Based Adaptive Communication Optimization in Federated Learning","authors":"Pan Zhang;Lei Xu;Lin Mei;Chungen Xu","doi":"10.1109/TC.2024.3475578","DOIUrl":"https://doi.org/10.1109/TC.2024.3475578","url":null,"abstract":"In recent years, cross-device federated learning (FL), particularly in the context of Internet of Things (IoT) applications, has demonstrated its remarkable potential. Despite significant efforts, empirical evidence suggests that FL algorithms have yet to gain widespread practical adoption. The primary obstacle stems from the inherent bandwidth overhead associated with gradient exchanges between clients and the server, resulting in substantial delays, especially within communication networks. To deal with the problem, various solutions are proposed with the hope of finding a better balance between efficiency and accuracy. Following this goal, we focus on investigating how to design a lightweight FL algorithm that requires less communication cost while maintaining comparable accuracy. Specifically, we propose a Sketch-based FL algorithm that combines the incremental singular value decomposition (ISVD) method in a way that does not negatively affect accuracy much in the training process. Moreover, we also provide adaptive gradient error accumulation and error compensation mechanisms to mitigate accumulated gradient errors caused by sketch compression and improve the model accuracy. Our extensive experimentation with various datasets demonstrates the efficacy of our proposed approach. Specifically, our scheme achieves nearly a 93% reduction in communication cost during the training of multi-layer perceptron models (MLP) using the MNIST dataset.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"170-184"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aodong Chen;Fei Xu;Li Han;Yuan Dong;Li Chen;Zhi Zhou;Fangming Liu
{"title":"Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs","authors":"Aodong Chen;Fei Xu;Li Han;Yuan Dong;Li Chen;Zhi Zhou;Fangming Liu","doi":"10.1109/TC.2024.3475589","DOIUrl":"https://doi.org/10.1109/TC.2024.3475589","url":null,"abstract":"GPUs have become the \u0000<i>defacto</i>\u0000 hardware devices for accelerating Deep Neural Network (DNN) inference workloads. However, the conventional \u0000<i>sequential execution mode of DNN operators</i>\u0000 in mainstream deep learning frameworks cannot fully utilize GPU resources, even with the operator fusion enabled, due to the increasing complexity of model structures and a greater diversity of operators. Moreover, the \u0000<i>inadequate operator launch order</i>\u0000 in parallelized execution scenarios can lead to GPU resource wastage and unexpected performance interference among operators. In this paper, we propose \u0000<i>Opara</i>\u0000, a resource- and interference-aware DNN \u0000<u>Op</u>\u0000erator \u0000<u>para</u>\u0000llel scheduling framework to accelerate DNN inference on GPUs. Specifically, \u0000<i>Opara</i>\u0000 first employs \u0000<monospace>CUDA Streams</monospace>\u0000 and \u0000<monospace>CUDA Graph</monospace>\u0000 to \u0000<i>parallelize</i>\u0000 the execution of multiple operators automatically. To further expedite DNN inference, \u0000<i>Opara</i>\u0000 leverages the resource demands of operators to judiciously adjust the operator launch order on GPUs, overlapping the execution of compute-intensive and memory-intensive operators. We implement and open source a prototype of \u0000<i>Opara</i>\u0000 based on PyTorch in a \u0000<i>non-intrusive</i>\u0000 manner. Extensive prototype experiments with representative DNN and Transformer-based models demonstrate that \u0000<i>Opara</i>\u0000 outperforms the default sequential \u0000<monospace>CUDA Graph</monospace>\u0000 in PyTorch and the state-of-the-art operator parallelism systems by up to \u0000<inline-formula><tex-math>$1.68boldsymbol{times}$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$1.29boldsymbol{times}$</tex-math></inline-formula>\u0000, respectively, yet with acceptable runtime overhead.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"325-333"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yohan Chatelain;Loïc Tetrel;Christopher J. Markiewicz;Mathias Goncalves;Gregory Kiar;Oscar Esteban;Pierre Bellec;Tristan Glatard
{"title":"A Numerical Variability Approach to Results Stability Tests and Its Application to Neuroimaging","authors":"Yohan Chatelain;Loïc Tetrel;Christopher J. Markiewicz;Mathias Goncalves;Gregory Kiar;Oscar Esteban;Pierre Bellec;Tristan Glatard","doi":"10.1109/TC.2024.3475586","DOIUrl":"https://doi.org/10.1109/TC.2024.3475586","url":null,"abstract":"Ensuring the long-term reproducibility of data analyses requires results stability tests to verify that analysis results remain within acceptable variation bounds despite inevitable software updates and hardware evolutions. This paper introduces a numerical variability approach for results stability tests, which determines acceptable variation bounds using random rounding of floating-point calculations. By applying the resulting stability test to \u0000<italic>fMRIPrep</i>\u0000, a widely-used neuroimaging tool, we show that the test is sensitive enough to detect subtle updates in image processing methods while remaining specific enough to accept numerical variations within a reference version of the application. This result contributes to enhancing the reliability and reproducibility of data analyses by providing a robust and flexible method for stability testing.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"200-209"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}