A. E. Wilson, Nathan Baker, Ethan Campbell, Michael Wirthlin
{"title":"Improving Fault Tolerance for FPGA SoCs Through Post Radiation Design Analysis","authors":"A. E. Wilson, Nathan Baker, Ethan Campbell, Michael Wirthlin","doi":"10.1145/3674841","DOIUrl":null,"url":null,"abstract":"\n FPGAs have been shown to operate reliably within harsh radiation environments by employing single-event upset (SEU) mitigation techniques such as configuration scrubbing, triple-modular redundancy, error correction coding, and radiation aware implementation techniques. The effectiveness of these techniques, however, is limited when using complex system-level designs that employ complex I/O interfaces with single-point failures. In previous work, a complex SoC system running Linux applied several of these techniques only to obtain an improvement of 14\n \n \\(\\times\\)\n \n in Mean Time to Failure (MTTF). A detailed post-radiation fault analysis found that the limitations in reliability were due to the DDR interface, the global clock network, and interconnect. This paper applied a number of design-specific SEU mitigation techniques to address the limitations in reliability of this design. These changes include triplicating the global clock, optimizing the placement of the reduction output voters and input flip-flops, and employing a mapping technique called “striping”. The application of these techniques improved MTTF of the mitigated design by a factor of 1.54\n \n \\(\\times\\)\n \n and thus provides a 22.8X\n \n \\(\\times\\)\n \n MTTF improvement over the unmitigated design. A post-radiation fault analysis using BFAT was also performed to find the remaining design vulnerabilities.\n","PeriodicalId":505501,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"103 22","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3674841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
FPGAs have been shown to operate reliably within harsh radiation environments by employing single-event upset (SEU) mitigation techniques such as configuration scrubbing, triple-modular redundancy, error correction coding, and radiation aware implementation techniques. The effectiveness of these techniques, however, is limited when using complex system-level designs that employ complex I/O interfaces with single-point failures. In previous work, a complex SoC system running Linux applied several of these techniques only to obtain an improvement of 14
\(\times\)
in Mean Time to Failure (MTTF). A detailed post-radiation fault analysis found that the limitations in reliability were due to the DDR interface, the global clock network, and interconnect. This paper applied a number of design-specific SEU mitigation techniques to address the limitations in reliability of this design. These changes include triplicating the global clock, optimizing the placement of the reduction output voters and input flip-flops, and employing a mapping technique called “striping”. The application of these techniques improved MTTF of the mitigated design by a factor of 1.54
\(\times\)
and thus provides a 22.8X
\(\times\)
MTTF improvement over the unmitigated design. A post-radiation fault analysis using BFAT was also performed to find the remaining design vulnerabilities.