F. Ramezani, Christopher M. Major, Colter Barney, Justin Williams, B. Lameres, Bradley M. Whitaker
{"title":"Identifying Patterns in Fault Recovery Techniques and Hardware Status of Radiation Tolerant Computers Using Principal Components Analysis","authors":"F. Ramezani, Christopher M. Major, Colter Barney, Justin Williams, B. Lameres, Bradley M. Whitaker","doi":"10.1109/ietc54973.2022.9796883","DOIUrl":null,"url":null,"abstract":"Fault tolerant computers have been developed in recent years to operate in the harsh radiation environment of outer space. These computers employ multiple copies of soft processors in a reconfigurable hardware environment and can automatically repair faults caused by radiation strikes. However, during certain recovery procedures, data collection and processing can be halted, and valuable scientific data can be lost. In addition, current fault recovery procedures may inadvertently make the computer more susceptible to faults or errors, for example, by introducing voltage and temperature changes. Machine learning feature extraction algorithms have the potential to reduce data loss by identifying patterns related to computational fault mitigation and recovery techniques. In this work, we will gather telemetry data from RadPC: a reconfigurable, radiation tolerant computer that has been developed over the past 12 years by Montana State University to advance high performance space computing under varying environmental conditions. RadPC has recently been configured to provide regular telemetry data to measure and communicate the performance of the radiation-tolerant computing platform. Specifically, the telemetry data includes information about data memory integrity, faults experienced, and successful repairs; as well as various measurements including voltage, current, and temperature. While RadPC has been under development for some time, the developers have never searched the telemetry data for associations between fault recovery procedures and the physical state of the hardware itself (e.g., voltage and current levels of power supplies or internal temperature). In this work, the computer will be subject to synthetic faults—emulating radiation strikes that may occur in space—and perform standard recovery procedures. The tests will be performed with the RadPC on a high-altitude balloon flight as well as inside a temperature-controlled vacuum chamber, allowing for a range of controlled external environmental conditions. The collected telemetry data will be analyzed using PCA to detect patterns in the hardware status associated with fault recovery techniques. Identifying these patterns may lead to improved fault mitigation strategies that reduce the risk of subsequent faults by considering how recovery techniques affect the physical state of the hardware.","PeriodicalId":251518,"journal":{"name":"2022 Intermountain Engineering, Technology and Computing (IETC)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Intermountain Engineering, Technology and Computing (IETC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ietc54973.2022.9796883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Fault tolerant computers have been developed in recent years to operate in the harsh radiation environment of outer space. These computers employ multiple copies of soft processors in a reconfigurable hardware environment and can automatically repair faults caused by radiation strikes. However, during certain recovery procedures, data collection and processing can be halted, and valuable scientific data can be lost. In addition, current fault recovery procedures may inadvertently make the computer more susceptible to faults or errors, for example, by introducing voltage and temperature changes. Machine learning feature extraction algorithms have the potential to reduce data loss by identifying patterns related to computational fault mitigation and recovery techniques. In this work, we will gather telemetry data from RadPC: a reconfigurable, radiation tolerant computer that has been developed over the past 12 years by Montana State University to advance high performance space computing under varying environmental conditions. RadPC has recently been configured to provide regular telemetry data to measure and communicate the performance of the radiation-tolerant computing platform. Specifically, the telemetry data includes information about data memory integrity, faults experienced, and successful repairs; as well as various measurements including voltage, current, and temperature. While RadPC has been under development for some time, the developers have never searched the telemetry data for associations between fault recovery procedures and the physical state of the hardware itself (e.g., voltage and current levels of power supplies or internal temperature). In this work, the computer will be subject to synthetic faults—emulating radiation strikes that may occur in space—and perform standard recovery procedures. The tests will be performed with the RadPC on a high-altitude balloon flight as well as inside a temperature-controlled vacuum chamber, allowing for a range of controlled external environmental conditions. The collected telemetry data will be analyzed using PCA to detect patterns in the hardware status associated with fault recovery techniques. Identifying these patterns may lead to improved fault mitigation strategies that reduce the risk of subsequent faults by considering how recovery techniques affect the physical state of the hardware.