IEEE Transactions on Parallel and Distributed Systems最新文献_第5页

Workload-Aware Performance Model Based Soft Preemptive Real-Time Scheduling for Neural Processing Units 基于负载感知性能模型的神经处理单元软抢占实时调度

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-28 DOI: 10.1109/TPDS.2025.3553922

Yuan Yao;Yujiao Hu;Yi Dang;Wei Tao;Kai Hu;Qiming Huang;Zhe Peng;Gang Yang;Xingshe Zhou

{"title":"Workload-Aware Performance Model Based Soft Preemptive Real-Time Scheduling for Neural Processing Units","authors":"Yuan Yao;Yujiao Hu;Yi Dang;Wei Tao;Kai Hu;Qiming Huang;Zhe Peng;Gang Yang;Xingshe Zhou","doi":"10.1109/TPDS.2025.3553922","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553922","url":null,"abstract":"A neural processing unit (NPU) is a microprocessor which is specially designed for various types of neural network applications. Because of its high acceleration efficiency and lower power consumption, the airborne embedded system has widely deployed NPU to replace GPU as the new accelerator. Unfortunately, the inherent scheduler of NPU does not consider real-time scheduling. Therefore, it cannot meet real-time requirements of airborne embedded systems. At present, there is less research on the multi-task real-time scheduling of the NPU device. In this article, we first design an NPU resource management framework based on Kubernetes. Then, we propose WAMSPRES, a workload-aware NPU performance model based soft preemptive real-time scheduling method. The proposed workload-aware NPU performance model can accurately predict the remaining execution time of the task when it runs with other tasks concurrently. The soft preemptive real-time scheduling algorithm can provide approximate preemption capability by dynamically adjusting the NPU computing resources of tasks. Finally, we implement a prototype NPU scheduler of the airborne embedded system for the fixed-wing UAV. The proposed models and algorithms are validated on both the simulated and realistic task sets. Experimental results illustrate that WAMSPRES can achieve low prediction error and high scheduling success rate.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1058-1070"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GIF-FHE: A Comprehensive Implementation and Evaluation of GPU-Accelerated FHE With Integer and Floating-Point Computing Power GIF-FHE：具有整数和浮点计算能力的gpu加速FHE的综合实现和评估

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-28 DOI: 10.1109/TPDS.2025.3574481

Fangyu Zheng;Guang Fan;Wenxu Tang;Yixuan Song;Tian Zhou;Yuan Zhao;Jiankuo Dong;Jingqiang Lin;Shoumeng Yan;Jiwu Jing

引用次数: 0

Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization 在大规模并行开源RISC-V MPSoC中驯服卸载开销：分析和优化

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-28 DOI: 10.1109/TPDS.2025.3555718

Luca Colagrande;Luca Benini

{"title":"Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization","authors":"Luca Colagrande;Luca Benini","doi":"10.1109/TPDS.2025.3555718","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3555718","url":null,"abstract":"Heterogeneous multi-core architectures combine on a single chip a few large, general-purpose <italic>host cores, optimized for single-thread performance, with (many) clusters of small, specialized, energy-efficient <italic>accelerator cores for data-parallel processing. Offloading a computation to the many-core acceleration fabric implies synchronization and communication overheads which can hamper overall performance and efficiency, particularly for small and fine-grained parallel tasks. In this work, we present a detailed, cycle-accurate quantitative analysis of the offload overheads on Occamy, an open-source massively parallel RISC-V based heterogeneous MPSoC. We study how the overheads scale with the number of accelerator cores. We explore an approach to drastically reduce these overheads by co-designing the hardware and the offload routines. Notably, we demonstrate that by incorporating multicast capabilities into the Network-on-Chip of a large (200+ cores) accelerator fabric we can improve offloaded application runtimes by as much as 2.3x, restoring more than 70% of the ideally attainable speedups. Finally, we propose a quantitative model to estimate the runtime of selected applications accounting for the offload overheads, with an error consistently below 15%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1193-1205"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coordinating Computational Capacity for Adaptive Federated Learning in Heterogeneous Edge Computing Systems 异构边缘计算系统中自适应联邦学习的协调计算能力

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-28 DOI: 10.1109/TPDS.2025.3574718

Kechang Yang;Biao Hu;Mingguo Zhao

{"title":"Coordinating Computational Capacity for Adaptive Federated Learning in Heterogeneous Edge Computing Systems","authors":"Kechang Yang;Biao Hu;Mingguo Zhao","doi":"10.1109/TPDS.2025.3574718","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3574718","url":null,"abstract":"With the rapid growth of IoT technology and the rise of smart devices, edge computing, particularly federated learning (FL), has gained importance for preserving user data privacy. However, FL faces challenges like non-independent identically distributed data and device heterogeneity, leading to model disparities and reduced precision. Our research proposes a novel adaptive FL framework specifically engineered to synchronize computational capacities within heterogeneous edge computing landscapes. Building upon the proof of convergence boundaries for local aggregation model, this algorithm adapts the number of iterations for local updates by considering the resource consumption relationship between local aggregation model and the local updated model by various clients. This method exhibit adaptability within an environment where disparities in edge device computational capacities exist, effectively balancing computational prowess among diverse devices and enhancing the output performance of federated learning. Experiments on MNIST and PlantVillage datasets show that in heterogeneous environments, our algorithm outperforms existing methods, improving the loss function by at least 16.87% and the convergence speed by at least 2 times, in various environments (MobileNet, AlexNet).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1509-1523"},"PeriodicalIF":5.6,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ABSE: Adaptive Baseline Score-Based Election for Leader-Based BFT Systems ABSE：基于领导的BFT系统的自适应基线得分选举

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-22 DOI: 10.1109/TPDS.2025.3572553

Xuyang Liu;Zijian Zhang;Zhen Li;Hao Yin;Meng Li;Jiamou Liu;Mauro Conti;Liehuang Zhu

{"title":"ABSE: Adaptive Baseline Score-Based Election for Leader-Based BFT Systems","authors":"Xuyang Liu;Zijian Zhang;Zhen Li;Hao Yin;Meng Li;Jiamou Liu;Mauro Conti;Liehuang Zhu","doi":"10.1109/TPDS.2025.3572553","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3572553","url":null,"abstract":"Leader-based BFT systems face potential disruption and performance degradation from malicious leaders, with current solutions often lacking scalability or greatly increasing complexity. In this paper, we introduce ABSE, an Adaptive Baseline Score-based Election approach to mitigate the negative impact of malicious leaders on leader-based BFT systems. ABSE is fully localized and proposes to accumulate scores for processes based on their contribution to consensus advancement, aiming to bypass less reliable participants when electing leaders. We present a formal treatment of ABSE, addressing the primary design and implementation challenges, defining its generic components and rules for adherence to ensure global consistency. We also apply ABSE to two different BFT protocols, demonstrating its scalability and negligible impact on protocol complexity. Finally, by building a system prototype and conducting experiments on it, we demonstrate that ABSE-enhanced protocols can effectively minimize the disruptions caused by malicious leaders, whilst incurring minimal additional resource overhead and maintaining base performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1634-1650"},"PeriodicalIF":5.6,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144472548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DisPLOY: Target-Constrained Distributed Deployment for Network Measurement Tasks on Data Plane 数据平面网络测量任务的目标约束分布式部署

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-21 DOI: 10.1109/TPDS.2025.3572246

Mimi Qian;Lin Cui;Xiaoquan Zhang;Fung Po Tso;Yuhui Deng;Zhetao Li;Weijia Jia

{"title":"DisPLOY: Target-Constrained Distributed Deployment for Network Measurement Tasks on Data Plane","authors":"Mimi Qian;Lin Cui;Xiaoquan Zhang;Fung Po Tso;Yuhui Deng;Zhetao Li;Weijia Jia","doi":"10.1109/TPDS.2025.3572246","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3572246","url":null,"abstract":"In programmable networks, measurement tasks are placed on programmable switches to monitor network traffic at line rate. These tasks typically require substantial resources (e.g., significant SRAM), while programmable switches are constrained by limited resources due to their hardware design (e.g., Tofino ASIC), making distributed deployment essentially. Measurement tasks must monitor specific network locations or traffic flows, introducing significant complexity in deployment optimization. This target-constrained nature makes task optimization on switches (e.g., task merging) become device-dependent and order-dependent, which can lead to deployment failures or performance degradation if ignored. In this paper, we introduce <italic>DisPLOY, a novel target-constrained distributed deployment framework specifically designed for network measurement tasks on the data plane. <italic>DisPLOY enables operators to specify monitoring targets—network traffic or device/link—across multiple switches. Given the monitoring targets, <italic>DisPLOY effectively minimizes redundant operations and optimizes deployment to achieve both resource efficiency (e.g., minimizing stage consumption) and high-performance monitoring (e.g., high accuracy). We implement and evaluate <italic>DisPLOY through deployment on both P4 hardware switches (Intel Tofino ASIC) and BMv2. Experimental results show that <italic>DisPLOY significantly reduces stage consumption by up to 66% and improves ARE by up to 78.4% in flow size estimation while maintaining end-to-end performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1608-1619"},"PeriodicalIF":5.6,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation 自适应线性化表示加速稀疏张量分解

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-20 DOI: 10.1109/TPDS.2025.3553092

Jan Laukemann;Ahmed E. Helal;S. Isaac Geronimo Anderson;Fabio Checconi;Yongseok Soh;Jesmin Jahan Tithi;Teresa Ranadive;Brian J. Gravelle;Fabrizio Petrini;Jee Choi

{"title":"Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation","authors":"Jan Laukemann;Ahmed E. Helal;S. Isaac Geronimo Anderson;Fabio Checconi;Yongseok Soh;Jesmin Jahan Tithi;Teresa Ranadive;Brian J. Gravelle;Fabrizio Petrini;Jee Choi","doi":"10.1109/TPDS.2025.3553092","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553092","url":null,"abstract":"High-dimensional sparse data emerge in many critical application domains such as healthcare and cybersecurity. To extract meaningful insights from massive volumes of these multi-dimensional data, scientists employ unsupervised analysis tools based on tensor decomposition (TD) methods. However, real-world sparse tensors exhibit highly irregular shapes and data distributions, which pose significant challenges for making efficient use of modern parallel processors. This study breaks the prevailing assumption that compressing sparse tensors into coarse-grained structures (i.e., tensor slices or blocks) or along a particular dimension/mode (i.e., mode-specific) is more efficient than keeping them in a fine-grained, mode-agnostic form. Our novel sparse tensor representation, Adaptive Linearized Tensor Order (<inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>), encodes tensors in a compact format that can be easily streamed from memory and is amenable to both caching and parallel execution. In contrast to existing compressed tensor formats, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> constructs one tensor copy that is agnostic to both the mode orientation and the irregular distribution of nonzero elements. To demonstrate the efficacy of <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>, we accelerate popular TD methods that compute the Canonical Polyadic Decomposition (CPD) model across different types of sparse tensors. We propose a set of parallel TD algorithms that exploit the inherent data reuse of tensor computations to substantially reduce synchronization overhead, decrease memory footprint, and improve parallel performance. Additionally, we characterize the major execution bottlenecks of TD methods on multiple generations of the latest Intel Xeon Scalable processors, including Sapphire Rapids CPUs, and introduce dynamic adaptation heuristics to automatically select the best algorithm based on the sparse tensor characteristics. Across a diverse set of real-world data sets, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> outperforms the state-of-the-art approaches, achieving more than an order-of-magnitude speedup over the best mode-agnostic formats. Compared to the best mode-specific formats, which require multiple tensor copies, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula>achieves <inline-formula><tex-math>$5.1times$</tex-math></inline-formula> geometric mean speedup at a fraction (25% ) of their storage costs. Moreover, <inline-formula><tex-math>${sf ALTO}$</tex-math></inline-formula> obtains <inline-formula><tex-math>$8.4times$</tex-math></inline-formula> geometric mean speedup over the state-of-the-art memoization approach, which reduces computations by using extra memory, while requiring 14% of its memory consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"1025-1041"},"PeriodicalIF":5.6,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters 基于GPU集群的深度学习训练分层弹性调度系统

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-20 DOI: 10.1109/TPDS.2025.3553137

Wei Gao;Zhuoyuan Ouyang;Peng Sun;Tianwei Zhang;Yonggang Wen

{"title":"IceFrog: A Layer-Elastic Scheduling System for Deep Learning Training in GPU Clusters","authors":"Wei Gao;Zhuoyuan Ouyang;Peng Sun;Tianwei Zhang;Yonggang Wen","doi":"10.1109/TPDS.2025.3553137","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3553137","url":null,"abstract":"The high resource demand of deep learning training (DLT) workloads necessitates the design of efficient schedulers. While most existing schedulers expedite DLT workloads by considering GPU sharing and elastic training, they neglect <italic>layer elasticity, which dynamically freezes certain layers of a network. This technique has been shown to significantly speed up individual workloads. In this paper, we explore how to incorporate <italic>layer elasticity into DLT scheduler designs to achieve higher cluster-wide efficiency. A key factor that hinders the application of layer elasticity in GPU clusters is the potential loss in model accuracy, making users reluctant to enable layer elasticity for their workloads. It is necessary to have an efficient layer-elastic system, which can well balance training accuracy and speed for layer elasticity. We introduce <sc>IceFrog, the first scheduling system that utilizes layer elasticity to improve the efficiency of DLT workloads in GPU clusters. It achieves this goal with superior algorithmic designs and intelligent resource management. In particular, (1) we model the frozen penalty and layer-aware throughput to measure the effective progress metric of layer-elastic workloads. (2) We design a novel scheduler to further improve the efficiency of layer elasticity. We implement and deploy <sc>IceFrog in a physical cluster of 48 GPUs. Extensive evaluations and large-scale simulations show that <sc>IceFrog reduces average job completion times by 36-48% relative to state-of-the-art DL schedulers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 6","pages":"1071-1086"},"PeriodicalIF":5.6,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143845436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ZeroTracer: In-Band eBPF-Based Trace Generator With Zero Instrumentation for Microservice Systems 零跟踪器：带内基于ebpf的跟踪发生器与零仪表微服务系统

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-19 DOI: 10.1109/TPDS.2025.3571934

Wanqi Yang;Pengfei Chen;Kai Liu;Huxing Zhang

{"title":"ZeroTracer: In-Band eBPF-Based Trace Generator With Zero Instrumentation for Microservice Systems","authors":"Wanqi Yang;Pengfei Chen;Kai Liu;Huxing Zhang","doi":"10.1109/TPDS.2025.3571934","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3571934","url":null,"abstract":"Microservice enables agility in modern cloud-native applications but introduces challenges in fault troubleshooting due to its complex service coordination and cooperation. To tackle these challenges, distributed tracing has emerged for end-to-end request tracing and system understanding. However, existing tracing solutions often suffer from code instrumentation, trace loss and inaccuracy. To overcome these limitations, we introduce ZeroTracer, an in-kernel online distributed tracing system equipped with an eBPF-based (extended Berkeley Packet Filter) trace generator. ZeroTracer tailors for tracking HTTP requests due to its popularity in microservice systems. In our evaluations, ZeroTracer achieves remarkable trace accuracy (i.e., over 91% ) and maintains stable performance under different workload concurrency. Moreover, ZeroTracer outperforms other non-invasive approaches which fail to reconcile accurate request causality. Notably, ZeroTracer effectively tracks end-to-end requests in multi-threaded microservice applications, which is absent in existing invasive distributed tracing systems with third-party library instrumentation. Moreover, ZeroTracer introduces a negligible overhead, with latency increasing by only 0.5% –1.2% and a modest 3% –5.8% increase in CPU and memory consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1478-1494"},"PeriodicalIF":5.6,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GEREM: Fast and Precise Error Resilience Assessment for GPU Microarchitectures GPU微架构的快速和精确的错误弹性评估

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-18 DOI: 10.1109/TPDS.2025.3552679

Jingweijia Tan;Xurui Li;An Zhong;Kaige Yan;Xiaohui Wei;Guanpeng Li

{"title":"GEREM: Fast and Precise Error Resilience Assessment for GPU Microarchitectures","authors":"Jingweijia Tan;Xurui Li;An Zhong;Kaige Yan;Xiaohui Wei;Guanpeng Li","doi":"10.1109/TPDS.2025.3552679","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3552679","url":null,"abstract":"GPUs are widely used hardware acceleration platforms in many areas due to their great computational throughput. In the meanwhile, GPUs are vulnerable to transient hardware faults in the post-Moore era. Analyzing the error resilience of GPUs are critical for both hardware and software. Statistical fault injection approaches are commonly used for error resilience analysis, which are highly accurate but very time consuming. In this work, we propose GEREM, a first framework to speed up fault injection process so as to estimate the error resilience of GPU microarchitectures swiftly and precisely. We find early fault behaviors can be used to accurately predict the final outcomes of program execution. Based on this observation, we categorize the early behaviors of hardware faults into GPU Early Fault Manifestation models (EFMs). For data structures, EFMs are early propagation characteristics of faults, while for pipeline instructions, EFMs are heuristic properties of several instruction contexts. We further observe that EFMs are determined by static microarchitecture states, so we can capture them without actually simulating the program execution process under fault injections. Leveraging these observations, our GEREM framework first profiles the microarchitectural states related for EFMs at one time. It then injects faults into the profiled traces to immediately generate EFMs. For data storage structures, EFMs are directly used to predict final fault outcomes, while for pipeline instructions, machine learning is used for prediction. Evaluation results show GEREM precisely assesses the error resilience of GPU microarchitecture structures with <inline-formula><tex-math>$237times$</tex-math></inline-formula> speedup on average comparing with traditional fault injections.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"1011-1024"},"PeriodicalIF":5.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0