{"title":"Performance and Environment-Aware Advanced Driving Assistance Systems","authors":"Sreenitha Kasarapu;Sai Manoj Pudukotai Dinakarrao","doi":"10.1109/TC.2024.3475572","DOIUrl":"https://doi.org/10.1109/TC.2024.3475572","url":null,"abstract":"In autonomous and self-driving vehicles, visual perception of the driving environment plays a key role. Vehicles rely on machine learning (ML) techniques such as deep neural networks (DNNs), which are extensively trained on manually annotated databases to achieve this goal. However, the availability of training data that can represent different environmental conditions can be limited. Furthermore, as different driving terrains require different decisions by the driver, it is tedious and impractical to design a database with all possible scenarios. This work proposes a semi-parametric approach that bypasses the manual annotation required to train vehicle perception systems in autonomous and self-driving vehicles. We present a novel “Performance and Environment-aware Advanced Driving Assistance Systems” which employs one-shot learning for efficient data generation using user action and response in addition to the synthetic traffic data generated as Pareto optimal solutions from one-shot objects using a set of generalization functions. Adapting to the driving environments through such optimization adds more robustness and safety features to autonomous driving. We evaluate the proposed framework on environment perception challenges encountered in autonomous driving assistance systems. To accelerate the learning and adapt in real-time to perceived data, a novel deep learning-based Alternating Direction Method of Multipliers (dlADMM) algorithm is introduced to improve the convergence capabilities of regular machine learning models. This methodology optimizes the training process and makes applying the machine learning model to real-world problems more feasible. We evaluated the proposed technique on AlexNet and MobileNetv2 networks and achieved more than 18\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup. By making the proposed technique behavior-aware we observed performance of upto 99% while detecting traffic signals.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"131-142"},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sketch-Based Adaptive Communication Optimization in Federated Learning","authors":"Pan Zhang;Lei Xu;Lin Mei;Chungen Xu","doi":"10.1109/TC.2024.3475578","DOIUrl":"https://doi.org/10.1109/TC.2024.3475578","url":null,"abstract":"In recent years, cross-device federated learning (FL), particularly in the context of Internet of Things (IoT) applications, has demonstrated its remarkable potential. Despite significant efforts, empirical evidence suggests that FL algorithms have yet to gain widespread practical adoption. The primary obstacle stems from the inherent bandwidth overhead associated with gradient exchanges between clients and the server, resulting in substantial delays, especially within communication networks. To deal with the problem, various solutions are proposed with the hope of finding a better balance between efficiency and accuracy. Following this goal, we focus on investigating how to design a lightweight FL algorithm that requires less communication cost while maintaining comparable accuracy. Specifically, we propose a Sketch-based FL algorithm that combines the incremental singular value decomposition (ISVD) method in a way that does not negatively affect accuracy much in the training process. Moreover, we also provide adaptive gradient error accumulation and error compensation mechanisms to mitigate accumulated gradient errors caused by sketch compression and improve the model accuracy. Our extensive experimentation with various datasets demonstrates the efficacy of our proposed approach. Specifically, our scheme achieves nearly a 93% reduction in communication cost during the training of multi-layer perceptron models (MLP) using the MNIST dataset.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"170-184"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aodong Chen;Fei Xu;Li Han;Yuan Dong;Li Chen;Zhi Zhou;Fangming Liu
{"title":"Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs","authors":"Aodong Chen;Fei Xu;Li Han;Yuan Dong;Li Chen;Zhi Zhou;Fangming Liu","doi":"10.1109/TC.2024.3475589","DOIUrl":"https://doi.org/10.1109/TC.2024.3475589","url":null,"abstract":"GPUs have become the \u0000<i>defacto</i>\u0000 hardware devices for accelerating Deep Neural Network (DNN) inference workloads. However, the conventional \u0000<i>sequential execution mode of DNN operators</i>\u0000 in mainstream deep learning frameworks cannot fully utilize GPU resources, even with the operator fusion enabled, due to the increasing complexity of model structures and a greater diversity of operators. Moreover, the \u0000<i>inadequate operator launch order</i>\u0000 in parallelized execution scenarios can lead to GPU resource wastage and unexpected performance interference among operators. In this paper, we propose \u0000<i>Opara</i>\u0000, a resource- and interference-aware DNN \u0000<u>Op</u>\u0000erator \u0000<u>para</u>\u0000llel scheduling framework to accelerate DNN inference on GPUs. Specifically, \u0000<i>Opara</i>\u0000 first employs \u0000<monospace>CUDA Streams</monospace>\u0000 and \u0000<monospace>CUDA Graph</monospace>\u0000 to \u0000<i>parallelize</i>\u0000 the execution of multiple operators automatically. To further expedite DNN inference, \u0000<i>Opara</i>\u0000 leverages the resource demands of operators to judiciously adjust the operator launch order on GPUs, overlapping the execution of compute-intensive and memory-intensive operators. We implement and open source a prototype of \u0000<i>Opara</i>\u0000 based on PyTorch in a \u0000<i>non-intrusive</i>\u0000 manner. Extensive prototype experiments with representative DNN and Transformer-based models demonstrate that \u0000<i>Opara</i>\u0000 outperforms the default sequential \u0000<monospace>CUDA Graph</monospace>\u0000 in PyTorch and the state-of-the-art operator parallelism systems by up to \u0000<inline-formula><tex-math>$1.68boldsymbol{times}$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$1.29boldsymbol{times}$</tex-math></inline-formula>\u0000, respectively, yet with acceptable runtime overhead.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"325-333"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yohan Chatelain;Loïc Tetrel;Christopher J. Markiewicz;Mathias Goncalves;Gregory Kiar;Oscar Esteban;Pierre Bellec;Tristan Glatard
{"title":"A Numerical Variability Approach to Results Stability Tests and Its Application to Neuroimaging","authors":"Yohan Chatelain;Loïc Tetrel;Christopher J. Markiewicz;Mathias Goncalves;Gregory Kiar;Oscar Esteban;Pierre Bellec;Tristan Glatard","doi":"10.1109/TC.2024.3475586","DOIUrl":"https://doi.org/10.1109/TC.2024.3475586","url":null,"abstract":"Ensuring the long-term reproducibility of data analyses requires results stability tests to verify that analysis results remain within acceptable variation bounds despite inevitable software updates and hardware evolutions. This paper introduces a numerical variability approach for results stability tests, which determines acceptable variation bounds using random rounding of floating-point calculations. By applying the resulting stability test to \u0000<italic>fMRIPrep</i>\u0000, a widely-used neuroimaging tool, we show that the test is sensitive enough to detect subtle updates in image processing methods while remaining specific enough to accept numerical variations within a reference version of the application. This result contributes to enhancing the reliability and reproducibility of data analyses by providing a robust and flexible method for stability testing.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"200-209"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient and Fast High-Performance Library Generation for Deep Learning Accelerators","authors":"Jun Bi;Yuanbo Wen;Xiaqing Li;Yongwei Zhao;Yuxuan Guo;Enshuai Zhou;Xing Hu;Zidong Du;Ling Li;Huaping Chen;Tianshi Chen;Qi Guo","doi":"10.1109/TC.2024.3475575","DOIUrl":"https://doi.org/10.1109/TC.2024.3475575","url":null,"abstract":"The widespread adoption of deep learning accelerators (DLAs) underscores their pivotal role in improving the performance and energy efficiency of neural networks. To fully leverage the capabilities of these accelerators, exploration-based library generation approaches have been widely used to substantially reduce software development overhead. However, these approaches have been challenged by issues related to sub-optimal optimization results and excessive optimization overheads. In this paper, we propose \u0000<small>Heron</small>\u0000 to generate high-performance libraries of DLAs in an efficient and fast way. The key is automatically enforcing massive constraints through the entire program generation process and guiding the exploration with an accurate pre-trained cost model. \u0000<small>Heron</small>\u0000 represents the search space as a constrained satisfaction problem (CSP) and explores the space via evolving the CSPs. Thus, the sophisticated constraints of the search space are strictly preserved during the entire exploration process. The exploration algorithm has the flexibility to engage in space exploration using either online-trained models or pre-trained models. Experimental results demonstrate that \u0000<small>Heron</small>\u0000 averagely achieves 2.71\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup over three state-of-the-art automatic generation approaches. Also, compared to vendor-provided hand-tuned libraries, \u0000<small>Heron</small>\u0000 achieves a 2.00\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 speedup on average. When employing a pre-trained model, \u0000<small>Heron</small>\u0000 achieves 11.6\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 compilation time speedup, incurring a minor impact on execution time.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"155-169"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaodong Huang;Tingting Yao;Zelin Lin;Xiaojun Shang;Yukun Yuan;Laizhong Cui;Yuanyuan Yang
{"title":"Efficient Service Function Chain Placement Over Heterogeneous Devices in Deviceless Edge Computing Environments","authors":"Yaodong Huang;Tingting Yao;Zelin Lin;Xiaojun Shang;Yukun Yuan;Laizhong Cui;Yuanyuan Yang","doi":"10.1109/TC.2024.3475590","DOIUrl":"https://doi.org/10.1109/TC.2024.3475590","url":null,"abstract":"Heterogeneous devices in edge computing bring challenges as well as opportunities for edge computing to utilize powerful and heterogeneous hardware for a variety of complex tasks. In this paper, we propose a service function chain placement strategy considering the heterogeneity of devices in deviceless edge computing environments. The service function chain system utilizes lightweight virtualization technologies to manage resources, considering the heterogeneity of devices to support various complex tasks, and offer low latency services to user requests. We propose an optimal service function chain placement problem minimizing the service delay and formulate it into a quasi-convex problem. We implement different edge applications that can be served by function chains and conduct extensive experiments over real heterogeneous edge devices. Results from the experiments and simulations show that our proposed service function chain scheme is applicable in edge environments, and perform well over services latency, resource utilization as well as the power consumption of edge devices.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"222-236"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Control Flow in Spatial Architectures: Insights Into Control Flow Plane Design","authors":"Jinyi Deng;Xinru Tang;Jiahao Zhang;Yuxuan Li;Linyun Zhang;Fengbin Tu;Shaojun Wei;Yang Hu;Shouyi Yin","doi":"10.1109/TC.2024.3475582","DOIUrl":"https://doi.org/10.1109/TC.2024.3475582","url":null,"abstract":"Spatial architecture is a high-performance paradigm that employs control flow graphs and data flow graphs as computation model, and producer/consumer models as execution model. However, existing spatial architectures struggle with control flow handling challenges. Upon thoroughly characterizing their PE execution models, we observe that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This degrades its performance in intensive control programs. To tackle the existing control flow handling challenges, Marionette, a spatial architecture with an explicit-designed control flow plane, is proposed. We elaborately develop a full stack of Marionette architecture, from ISA, compiler, simulator to RTL. Marionette's flexible Control Flow Plane enables autonomous, peer-to-peer, and temporally loosely-coupled control flow management. Its Proactive PE Configuration ensures computation-overlapped and timely configuration to promote Branch Divergence handling capability. Besides, Marionette's Agile PE Assignment improves pipeline performance of imperfect loops. Compared to state-of-the-art spatial architectures, the experimental results demonstrate that Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88\u0000<inline-formula><tex-math>$mathbf{times}$</tex-math></inline-formula>\u0000, 3.38\u0000<inline-formula><tex-math>$mathbf{times}$</tex-math></inline-formula>\u0000, 1.55\u0000<inline-formula><tex-math>$mathbf{times}$</tex-math></inline-formula>\u0000, and 2.66\u0000<inline-formula><tex-math>$mathbf{times}$</tex-math></inline-formula>\u0000 in a variety of challenging intensive control programs.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"185-199"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Modular Multiplication Using Variable Length Algorithms","authors":"Shahab Mirzaei-Teshnizi;Parviz Keshavarzi","doi":"10.1109/TC.2024.3475574","DOIUrl":"https://doi.org/10.1109/TC.2024.3475574","url":null,"abstract":"This paper presents two improved modular multiplication algorithms: variable length Interleaved modular multiplication (VLIM) algorithm and parallel modular multiplication (P_MM) method using variable length algorithms to achieve high throughput rates. The new Interleaved modular multiplication algorithm applies the zero counting and partitioning algorithm to a multiplier’s non-adjacent form (NAF). It divides this input into sections with variable-radix. The sections include a digit of zero sequences and a non-zero digit (-1 or 1) in the most valuable place. Therefore, in addition to reducing the number of required clock pulses, high-radix partial multiplication \u0000<inline-formula><tex-math>$mathbf{X}^{left(mathbf{i}right)}cdot mathbf{Y}$</tex-math></inline-formula>\u0000 is simplified and performed as a binary addition or subtraction operation, and multiplication operations for consecutive zero bits are executed in one clock cycle instead of several clock cycles. The proposed parallel modular multiplication algorithm divides the multiplier into two parts. It utilizes (VLIM) and variable length Montgomery modular multiplication (VLM3) methods to compute the modular multiplication for the upper and lower portions in parallel, according to the proximity of their multiplication time. The implementation results on a Xilinx Virtex-7 FPGA show that the parallel modular multiplication computes a 2048-bit modular multiplication in 0.903 µs, with a maximum clock frequency of 387 MHz and area × time per bit value equal to 9.14.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"143-154"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ClusPar: A Game-Theoretic Approach for Efficient and Scalable Streaming Edge Partitioning","authors":"Zezhong Ding;Deyu Kong;Zhuoxu Zhang;Xike Xie;Jianliang Xu","doi":"10.1109/TC.2024.3475568","DOIUrl":"https://doi.org/10.1109/TC.2024.3475568","url":null,"abstract":"Streaming edge partitioning plays a crucial role in the distributed processing of large-scale web graphs, such as pagerank. The quality of partitioning is of utmost importance and directly affects the runtime cost of distributed graph processing. However, streaming graph clustering, a key component of mainstream streaming edge partitioning, is vertex-centric. This incurs a mismatch with the edge-centric partitioning strategy, necessitating additional post-processing and several graph traversals to transition from vertex-centric clusters to edge-centric partitions. This transition not only adds extra runtime overhead but also risks a decline in partitioning quality. In this paper, we propose a novel algorithm, called ClusPar, to address the problem of streaming edge partitioning. The ClusPar framework consists of two steps, streaming edge clustering and edge cluster partitioning. Different from prior studies, the first step traverses the input graph in a single pass to generate edge-centric clusters, while the second step applies game theory over these edge-centric clusters to produce partitions. Extensive experiments show that ClusPar outperforms the state-of-the-art streaming edge partitioning methods in terms of the partitioning quality, efficiency, and scalability.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"116-130"},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yannis Steve Nsuloun Fotse;Vianney Kengne Tchendji;Mthulisi Velempini
{"title":"Federated Learning Based DDoS Attacks Detection in Large Scale Software-Defined Network","authors":"Yannis Steve Nsuloun Fotse;Vianney Kengne Tchendji;Mthulisi Velempini","doi":"10.1109/TC.2024.3474180","DOIUrl":"https://doi.org/10.1109/TC.2024.3474180","url":null,"abstract":"Software-Defined Networking (SDN) is an innovative concept that segments the network into three planes: a control plane comprising of one or multiple controllers; a data plane responsible for data transmission; and an application plane which enables the reconfiguration of network functionalities. Nevertheless, this approach has exposed the controller as a prime target for malicious elements to attack it, such as Distributed Denial of Service (DDoS) attacks. Current DDoS defense schemes often increased the controller load and resource consumption. These schemes are typically tailored for single-controller architectures, a significant limitation when considering the scalability requirements of large-scale SDN. To address these limitations, we introduce an efficient Federated Learning approach, named “FedLAD,” designed to counter DDoS attacks in SDN-based large-scale networks, particularly in multi-controller architectures. Federated learning is a decentralized approach to machine learning where models are trained across multiple devices as controllers store local data samples, without exchanging them. The evaluation of the proposed scheme's performance, using InSDN, CICDDoS2019, and CICDoS2017 datasets, shows an accuracy exceeding 98%, a significant improvement compared to related works. Furthermore, the evaluation of the FedLAD protocol with real-time traffic in an SDN context demonstrates its ability to detect DDoS attacks with high accuracy and minimal resource consumption. To the best of our knowledge, this work introduces a new technique in applying FL for DDoS attack detection in large-scale SDN.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 1","pages":"101-115"},"PeriodicalIF":3.6,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10705345","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}