Noah Perryman;Sebastian Sabogal;Christopher Wilson;Alan George
{"title":"用于空间应用的AMD-Xilinx通用自适应soc的可靠DPU架构","authors":"Noah Perryman;Sebastian Sabogal;Christopher Wilson;Alan George","doi":"10.1109/TAES.2025.3527938","DOIUrl":null,"url":null,"abstract":"Space-computing platforms have considerable performance restrictions that are imposed by the limited onboard-processing capabilities provided by heritage flight computers. Conversely, there is a growing need for increased system autonomy enabled by deep learning (DL) to maximize performance and minimize the burden of ground-based processing. To address these limitations, domain-specific architectures with specialized acceleration hardware, such as the AMD-Xilinx Versal adaptive System-on-Chip (SoC), have been developed. This heterogeneous platform contains significant energy-efficient compute capabilities, but it is susceptible to radiation-induced effects. Therefore, the dependability of the device must be characterized prior to inclusion on future space-computing platforms. In addition, several popular DL models exist, but each model provides unique accuracy, performance, energy-efficiency, and dependability characteristics that must be thoroughly understood. In this research, we propose a methodology for evaluating and analyzing dependable computing on AMD-Xilinx deep learning processing unit (DPU) architectures on Versal SoCs using simulated radiation-induced single-event effects through memory-mapped data fault injection. Using our proposed methodology, we perform this fault injection on three Versal AI Core and two Versal AI Edge DPU architectures and evaluate system performance, power consumption, energy efficiency, resource utilization, and dependability on three deployed DL models. Due to innate DPU configurability, our analysis also explores adding varying degrees of triple modular redundancy (TMR) through different DPU architectural features for increased dependability. We leveraged our fault-injection methodology to demonstrate a 24.65× average reduction in critical bits of our TMR DPU architectures compared to the unmitigated baseline, showcasing a significant increase in system dependability.","PeriodicalId":13157,"journal":{"name":"IEEE Transactions on Aerospace and Electronic Systems","volume":"61 3","pages":"6629-6646"},"PeriodicalIF":5.7000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dependable DPU Architectures on AMD-Xilinx Versal Adaptive SoCs for Space Applications\",\"authors\":\"Noah Perryman;Sebastian Sabogal;Christopher Wilson;Alan George\",\"doi\":\"10.1109/TAES.2025.3527938\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Space-computing platforms have considerable performance restrictions that are imposed by the limited onboard-processing capabilities provided by heritage flight computers. Conversely, there is a growing need for increased system autonomy enabled by deep learning (DL) to maximize performance and minimize the burden of ground-based processing. To address these limitations, domain-specific architectures with specialized acceleration hardware, such as the AMD-Xilinx Versal adaptive System-on-Chip (SoC), have been developed. This heterogeneous platform contains significant energy-efficient compute capabilities, but it is susceptible to radiation-induced effects. Therefore, the dependability of the device must be characterized prior to inclusion on future space-computing platforms. In addition, several popular DL models exist, but each model provides unique accuracy, performance, energy-efficiency, and dependability characteristics that must be thoroughly understood. In this research, we propose a methodology for evaluating and analyzing dependable computing on AMD-Xilinx deep learning processing unit (DPU) architectures on Versal SoCs using simulated radiation-induced single-event effects through memory-mapped data fault injection. Using our proposed methodology, we perform this fault injection on three Versal AI Core and two Versal AI Edge DPU architectures and evaluate system performance, power consumption, energy efficiency, resource utilization, and dependability on three deployed DL models. Due to innate DPU configurability, our analysis also explores adding varying degrees of triple modular redundancy (TMR) through different DPU architectural features for increased dependability. We leveraged our fault-injection methodology to demonstrate a 24.65× average reduction in critical bits of our TMR DPU architectures compared to the unmitigated baseline, showcasing a significant increase in system dependability.\",\"PeriodicalId\":13157,\"journal\":{\"name\":\"IEEE Transactions on Aerospace and Electronic Systems\",\"volume\":\"61 3\",\"pages\":\"6629-6646\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Aerospace and Electronic Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10836953/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, AEROSPACE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Aerospace and Electronic Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10836953/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, AEROSPACE","Score":null,"Total":0}
引用次数: 0
摘要
由于传统飞行计算机提供的机载处理能力有限,空间计算平台具有相当大的性能限制。相反,人们越来越需要通过深度学习(DL)来提高系统的自主性,以最大限度地提高性能并减少地面处理的负担。为了解决这些限制,已经开发了具有专用加速硬件的特定领域架构,例如AMD-Xilinx Versal自适应片上系统(SoC)。这种异构平台包含显著的节能计算能力,但容易受到辐射诱导效应的影响。因此,在纳入未来的空间计算平台之前,必须对设备的可靠性进行鉴定。此外,存在几种流行的深度学习模型,但每个模型都提供独特的准确性、性能、能效和可靠性特征,必须彻底理解这些特征。在这项研究中,我们提出了一种评估和分析通用soc上AMD-Xilinx深度学习处理单元(DPU)架构的可靠计算的方法,该方法通过内存映射数据故障注入来模拟辐射引起的单事件效应。使用我们提出的方法,我们在三个Versal AI Core和两个Versal AI Edge DPU架构上执行了这种故障注入,并在三个部署的DL模型上评估了系统性能、功耗、能源效率、资源利用率和可靠性。由于DPU固有的可配置性,我们的分析还探讨了通过不同的DPU架构特性添加不同程度的三模块冗余(TMR)以提高可靠性。我们利用我们的故障注入方法证明,与未缓解的基线相比,我们的TMR DPU架构的关键位平均减少了24.65倍,显示了系统可靠性的显着提高。
Dependable DPU Architectures on AMD-Xilinx Versal Adaptive SoCs for Space Applications
Space-computing platforms have considerable performance restrictions that are imposed by the limited onboard-processing capabilities provided by heritage flight computers. Conversely, there is a growing need for increased system autonomy enabled by deep learning (DL) to maximize performance and minimize the burden of ground-based processing. To address these limitations, domain-specific architectures with specialized acceleration hardware, such as the AMD-Xilinx Versal adaptive System-on-Chip (SoC), have been developed. This heterogeneous platform contains significant energy-efficient compute capabilities, but it is susceptible to radiation-induced effects. Therefore, the dependability of the device must be characterized prior to inclusion on future space-computing platforms. In addition, several popular DL models exist, but each model provides unique accuracy, performance, energy-efficiency, and dependability characteristics that must be thoroughly understood. In this research, we propose a methodology for evaluating and analyzing dependable computing on AMD-Xilinx deep learning processing unit (DPU) architectures on Versal SoCs using simulated radiation-induced single-event effects through memory-mapped data fault injection. Using our proposed methodology, we perform this fault injection on three Versal AI Core and two Versal AI Edge DPU architectures and evaluate system performance, power consumption, energy efficiency, resource utilization, and dependability on three deployed DL models. Due to innate DPU configurability, our analysis also explores adding varying degrees of triple modular redundancy (TMR) through different DPU architectural features for increased dependability. We leveraged our fault-injection methodology to demonstrate a 24.65× average reduction in critical bits of our TMR DPU architectures compared to the unmitigated baseline, showcasing a significant increase in system dependability.
期刊介绍:
IEEE Transactions on Aerospace and Electronic Systems focuses on the organization, design, development, integration, and operation of complex systems for space, air, ocean, or ground environment. These systems include, but are not limited to, navigation, avionics, spacecraft, aerospace power, radar, sonar, telemetry, defense, transportation, automated testing, and command and control.