Wei Zhang;Yunlong Yu;Xiao Jiang;Nan Guan;Naijun Zhan;Lei Ju
{"title":"WCET Estimation for CNN Inference on FPGA SoC With Multi-DPU Engines","authors":"Wei Zhang;Yunlong Yu;Xiao Jiang;Nan Guan;Naijun Zhan;Lei Ju","doi":"10.1109/TPDS.2025.3555968","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555968","url":null,"abstract":"The Deep Learning Processor Unit (DPU) released in the official Xilinx Vitis AI toolchain stands as a commercial off-the-shelf solution tailored for accelerating convolutional neural network (CNN) inference on Xilinx FPGA devices. While most FPGA accelerator focus on high performance and energy-efficiency, analyzing the worst-case execution time (WCET) bound is essential for using CNN accelerations in real-time embedded systems design. In this work, we show that in a multi-DPU environment, the observed worst-case inference time for a CNN inference task could become 3X larger w.r.t. the best case inference time, which prompts the prominent importance of a static timing analysis for FPGA-based CNN inference. We propose, to the best of the authors’ knowledge, the first static timing analysis framework for CNN inference in a multi-DPU environment. The proposed framework introduces a generalized timing behavior model for shared bus arbitration and memory access contention between parallel running DPU engines. Additionally, it incorporates a fine-grained memory access contention analysis that takes into account the characteristics of deep learning applications. For a single-DPU environment, the analysis result is 27% tighter in average compared with the state-of-the-art results. Furthermore, our proposed method produces relatively tight estimated results in the multi-DPU environment.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1146-1160"},"PeriodicalIF":5.6,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143848782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GraFetch: Accelerating Graph Applications Through Domain Specific Hierarchical Hybrid Prefetching","authors":"Pengmiao Zhang;Rajgopal Kannan;Viktor K. Prasanna","doi":"10.1109/TPDS.2025.3575106","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3575106","url":null,"abstract":"Memory performance bottlenecks the execution of graph applications, from traditional graph analytics (GA) to rapidly evolving graph neural networks (GNNs), due to the large size and complexity of graphs. While machine learning (ML) algorithms have shown potential in data prefetching to hide memory access latency, existing approaches face challenges with phase transitions and irregular memory access patterns in graph applications. To address these challenges, we introduce GraFetch, a specialized prefetching system for accelerating graph applications. GraFetch comprises of 1) a novel Hierarchical Hybrid Prefetching (HHP) framework that supports the cooperation of phase-specific ML predictors for high-complexity pattern prefetching and rule-based prefetchers for low-complexity pattern prefetching; and 2) Domain Specific Machine Learning (DSML) models integrated in the framework, which incorporate domain knowledge of graph applications to detect phases, recognize patterns, and predict memory accesses. We evaluate our approach using popular GA frameworks GPOP and X-Stream, and state-of-the-art GNN frameworks PyG and DGL. Our domain specific attention-based memory access predictors achieve 7.4% higher F1-score for delta (consecutive address jump) prediction and 15.35% higher accuracy@10 for page prediction compared with basic attention models. GraFetch achieves an average IPC improvement of 12.47% for GA and 4.18% for GNNs over a system with no prefetcher. This outperforms state-of-the-art rule-based prefetchers BO (7.12% for GA, 1.10% for GNNs), ISB (3.82% for GA, 1.60% for GNNs), and IMP (8.47% for GA, 2.20% for GNNs), as well as ML-based prefetchers Voyager (9.61% for GA, 3.14% for GNNs) and TransFetch (10.98% for GA, 2.48% for GNNs).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1542-1559"},"PeriodicalIF":5.6,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DFU-E: A Dataflow Architecture for Edge DSP and AI Applications","authors":"Wenming Li;Zhihua Fan;Tianyu Liu;Zhen Wang;Haibin Wu;Meng Wu;Kunming Zhang;Yanhuan Liu;Ninghui Sun;Xiaochun Ye;Dongrui Fan","doi":"10.1109/TPDS.2025.3555329","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555329","url":null,"abstract":"Edge computing aims to enable swift, real-time data processing, analysis, and storage close to the data source. However, edge computing platforms are often constrained by limited processing power and efficiency. This paper presents DFU-E, a dataflow-based accelerator specifically designed to meet the demands of edge digital signal processing (DSP) and artificial intelligence (AI) applications. Our design addresses real-world requirements with three main innovations. First, to accommodate the diverse algorithms utilized at the edge, we propose a multi-layer dataflow mechanism capable of exploiting task-level, instruction block-level, instruction-level, and data-level parallelism. Second, we develop an edge dataflow architecture that includes a customized processing element (PE) array, memory, and on-chip network microarchitecture optimized for the multi-layer dataflow mechanism. Third, we design an edge dataflow software stack that enables automatic optimizations through operator fusion, dataflow graph mapping, and task scheduling. We utilize representative real-world DSP and AI applications for evaluation. Comparing with Nvidia's state-of-the-art edge computing processor, DFU-E achieves up to 1.42× geometric mean performance improvement and 1.27× energy efficiency improvement.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1100-1114"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Workload-Aware Performance Model Based Soft Preemptive Real-Time Scheduling for Neural Processing Units","authors":"Yuan Yao;Yujiao Hu;Yi Dang;Wei Tao;Kai Hu;Qiming Huang;Zhe Peng;Gang Yang;Xingshe Zhou","doi":"10.1109/TPDS.2025.3553922","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553922","url":null,"abstract":"A neural processing unit (NPU) is a microprocessor which is specially designed for various types of neural network applications. Because of its high acceleration efficiency and lower power consumption, the airborne embedded system has widely deployed NPU to replace GPU as the new accelerator. Unfortunately, the inherent scheduler of NPU does not consider real-time scheduling. Therefore, it cannot meet real-time requirements of airborne embedded systems. At present, there is less research on the multi-task real-time scheduling of the NPU device. In this article, we first design an NPU resource management framework based on Kubernetes. Then, we propose WAMSPRES, a workload-aware NPU performance model based soft preemptive real-time scheduling method. The proposed workload-aware NPU performance model can accurately predict the remaining execution time of the task when it runs with other tasks concurrently. The soft preemptive real-time scheduling algorithm can provide approximate preemption capability by dynamically adjusting the NPU computing resources of tasks. Finally, we implement a prototype NPU scheduler of the airborne embedded system for the fixed-wing UAV. The proposed models and algorithms are validated on both the simulated and realistic task sets. Experimental results illustrate that WAMSPRES can achieve low prediction error and high scheduling success rate.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1058-1070"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GIF-FHE: A Comprehensive Implementation and Evaluation of GPU-Accelerated FHE With Integer and Floating-Point Computing Power","authors":"Fangyu Zheng;Guang Fan;Wenxu Tang;Yixuan Song;Tian Zhou;Yuan Zhao;Jiankuo Dong;Jingqiang Lin;Shoumeng Yan;Jiwu Jing","doi":"10.1109/TPDS.2025.3574481","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3574481","url":null,"abstract":"Fully Homomorphic Encryption (FHE) allows computations on encrypted data without revealing the plaintext, garnering significant interest from both academic and industrial communities. However, its broader adoption has been hindered by performance limitations. Consequently, researchers have turned to GPUs for efficient FHE implementation. Nevertheless, most have predominantly favored integer units due to their ease of use, overlooking the considerable computational potential of floating-point units in GPUs. Recognizing this untapped floating-point computational power, our article introduces <b>GIF-FHE</b>, an extensive exploration and implementation of FHE, leveraging GPUs’ integer and floating-point instructions for FHE acceleration. We develop a comprehensive suite of low-level and middle-level FHE primitives, offering multiple implementation variants with support for three word size configurations (<inline-formula><tex-math>$64/52/32$</tex-math></inline-formula>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1524-1541"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization","authors":"Luca Colagrande;Luca Benini","doi":"10.1109/TPDS.2025.3555718","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555718","url":null,"abstract":"Heterogeneous multi-core architectures combine on a single chip a few large, general-purpose <italic>host</i> cores, optimized for single-thread performance, with (many) clusters of small, specialized, energy-efficient <italic>accelerator</i> cores for data-parallel processing. Offloading a computation to the many-core acceleration fabric implies synchronization and communication overheads which can hamper overall performance and efficiency, particularly for small and fine-grained parallel tasks. In this work, we present a detailed, cycle-accurate quantitative analysis of the offload overheads on Occamy, an open-source massively parallel RISC-V based heterogeneous MPSoC. We study how the overheads scale with the number of accelerator cores. We explore an approach to drastically reduce these overheads by co-designing the hardware and the offload routines. Notably, we demonstrate that by incorporating multicast capabilities into the Network-on-Chip of a large (200+ cores) accelerator fabric we can improve offloaded application runtimes by as much as 2.3x, restoring more than 70% of the ideally attainable speedups. Finally, we propose a quantitative model to estimate the runtime of selected applications accounting for the offload overheads, with an error consistently below 15%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1193-1205"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coordinating Computational Capacity for Adaptive Federated Learning in Heterogeneous Edge Computing Systems","authors":"Kechang Yang;Biao Hu;Mingguo Zhao","doi":"10.1109/TPDS.2025.3574718","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3574718","url":null,"abstract":"With the rapid growth of IoT technology and the rise of smart devices, edge computing, particularly federated learning (FL), has gained importance for preserving user data privacy. However, FL faces challenges like non-independent identically distributed data and device heterogeneity, leading to model disparities and reduced precision. Our research proposes a novel adaptive FL framework specifically engineered to synchronize computational capacities within heterogeneous edge computing landscapes. Building upon the proof of convergence boundaries for local aggregation model, this algorithm adapts the number of iterations for local updates by considering the resource consumption relationship between local aggregation model and the local updated model by various clients. This method exhibit adaptability within an environment where disparities in edge device computational capacities exist, effectively balancing computational prowess among diverse devices and enhancing the output performance of federated learning. Experiments on MNIST and PlantVillage datasets show that in heterogeneous environments, our algorithm outperforms existing methods, improving the loss function by at least 16.87% and the convergence speed by at least 2 times, in various environments (MobileNet, AlexNet).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1509-1523"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mimi Qian;Lin Cui;Xiaoquan Zhang;Fung Po Tso;Yuhui Deng;Zhetao Li;Weijia Jia
{"title":"DisPLOY: Target-Constrained Distributed Deployment for Network Measurement Tasks on Data Plane","authors":"Mimi Qian;Lin Cui;Xiaoquan Zhang;Fung Po Tso;Yuhui Deng;Zhetao Li;Weijia Jia","doi":"10.1109/TPDS.2025.3572246","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3572246","url":null,"abstract":"In programmable networks, measurement tasks are placed on programmable switches to monitor network traffic at line rate. These tasks typically require substantial resources (e.g., significant SRAM), while programmable switches are constrained by limited resources due to their hardware design (e.g., Tofino ASIC), making distributed deployment essentially. Measurement tasks must monitor specific network locations or traffic flows, introducing significant complexity in deployment optimization. This target-constrained nature makes task optimization on switches (e.g., task merging) become device-dependent and order-dependent, which can lead to deployment failures or performance degradation if ignored. In this paper, we introduce <italic>DisPLOY</i>, a novel target-constrained distributed deployment framework specifically designed for network measurement tasks on the data plane. <italic>DisPLOY</i> enables operators to specify monitoring targets—network traffic or device/link—across multiple switches. Given the monitoring targets, <italic>DisPLOY</i> effectively minimizes redundant operations and optimizes deployment to achieve both resource efficiency (e.g., minimizing stage consumption) and high-performance monitoring (e.g., high accuracy). We implement and evaluate <italic>DisPLOY</i> through deployment on both P4 hardware switches (Intel Tofino ASIC) and BMv2. Experimental results show that <italic>DisPLOY</i> significantly reduces stage consumption by up to 66% and improves ARE by up to 78.4% in flow size estimation while maintaining end-to-end performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1608-1619"},"PeriodicalIF":5.6,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Laukemann;Ahmed E. Helal;S. Isaac Geronimo Anderson;Fabio Checconi;Yongseok Soh;Jesmin Jahan Tithi;Teresa Ranadive;Brian J. Gravelle;Fabrizio Petrini;Jee Choi
{"title":"Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation","authors":"Jan Laukemann;Ahmed E. Helal;S. Isaac Geronimo Anderson;Fabio Checconi;Yongseok Soh;Jesmin Jahan Tithi;Teresa Ranadive;Brian J. Gravelle;Fabrizio Petrini;Jee Choi","doi":"10.1109/TPDS.2025.3553092","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553092","url":null,"abstract":"High-dimensional sparse data emerge in many critical application domains such as healthcare and cybersecurity. To extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes and data distributions, which pose significant challenges for making efficient use of modern parallel processors. This study breaks the prevailing assumption that compressing sparse tensors into coarse-grained structures (i.e., tensor slices or blocks) or along a particular dimension/mode (i.e., mode-specific) is more efficient than keeping them in a fine-grained, mode-agnostic form. Our novel sparse tensor representation, Adaptive Linearized Tensor Order (<inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>), encodes tensors in a compact format that can be easily streamed from memory and is amenable to both caching and parallel execution. In contrast to existing compressed tensor formats, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> constructs one tensor copy that is agnostic to both the mode orientation and the irregular distribution of nonzero elements. To demonstrate the efficacy of <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>, we accelerate popular TD methods that compute the Canonical Polyadic Decomposition (CPD) model across different types of sparse tensors. We propose a set of parallel TD algorithms that exploit the inherent data reuse of tensor computations to substantially reduce synchronization overhead, decrease memory footprint, and improve parallel performance. Additionally, we characterize the major execution bottlenecks of TD methods on multiple generations of the latest Intel Xeon Scalable processors, including Sapphire Rapids CPUs, and introduce dynamic adaptation heuristics to automatically select the best algorithm based on the sparse tensor characteristics. Across a diverse set of real-world data sets, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> outperforms the state-of-the-art approaches, achieving more than an order-of-magnitude speedup over the best mode-agnostic formats. Compared to the best mode-specific formats, which require multiple tensor copies, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>achieves <inline-formula><tex-math>$5.1times$</tex-math></inline-formula> geometric mean speedup at a fraction (25% ) of their storage costs. Moreover, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> obtains <inline-formula><tex-math>$8.4times$</tex-math></inline-formula> geometric mean speedup over the state-of-the-art memoization approach, which reduces computations by using extra memory, while requiring 14% of its memory consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"1025-1041"},"PeriodicalIF":5.6,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters","authors":"Wei Gao;Zhuoyuan Ouyang;Peng Sun;Tianwei Zhang;Yonggang Wen","doi":"10.1109/TPDS.2025.3553137","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553137","url":null,"abstract":"The high resource demand of deep learning training (DLT) workloads necessitates the design of efficient schedulers. While most existing schedulers expedite DLT workloads by considering GPU sharing and elastic training, they neglect <italic>layer elasticity</i>, which dynamically freezes certain layers of a network. This technique has been shown to significantly speed up individual workloads. In this paper, we explore how to incorporate <italic>layer elasticity</i> into DLT scheduler designs to achieve higher cluster-wide efficiency. A key factor that hinders the application of layer elasticity in GPU clusters is the potential loss in model accuracy, making users reluctant to enable layer elasticity for their workloads. It is necessary to have an efficient layer-elastic system, which can well balance training accuracy and speed for layer elasticity. We introduce <sc>IceFrog</small>, the first scheduling system that utilizes layer elasticity to improve the efficiency of DLT workloads in GPU clusters. It achieves this goal with superior algorithmic designs and intelligent resource management. In particular, (1) we model the frozen penalty and layer-aware throughput to measure the effective progress metric of layer-elastic workloads. (2) We design a novel scheduler to further improve the efficiency of layer elasticity. We implement and deploy <sc>IceFrog</small> in a physical cluster of 48 GPUs. Extensive evaluations and large-scale simulations show that <sc>IceFrog</small> reduces average job completion times by 36-48% relative to state-of-the-art DL schedulers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1071-1086"},"PeriodicalIF":5.6,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}