用主成分分析识别容错计算机故障恢复技术模式和硬件状态

F. Ramezani, Christopher M. Major, Colter Barney, Justin Williams, B. Lameres, Bradley M. Whitaker
{"title":"用主成分分析识别容错计算机故障恢复技术模式和硬件状态","authors":"F. Ramezani, Christopher M. Major, Colter Barney, Justin Williams, B. Lameres, Bradley M. Whitaker","doi":"10.1109/ietc54973.2022.9796883","DOIUrl":null,"url":null,"abstract":"Fault tolerant computers have been developed in recent years to operate in the harsh radiation environment of outer space. These computers employ multiple copies of soft processors in a reconfigurable hardware environment and can automatically repair faults caused by radiation strikes. However, during certain recovery procedures, data collection and processing can be halted, and valuable scientific data can be lost. In addition, current fault recovery procedures may inadvertently make the computer more susceptible to faults or errors, for example, by introducing voltage and temperature changes. Machine learning feature extraction algorithms have the potential to reduce data loss by identifying patterns related to computational fault mitigation and recovery techniques. In this work, we will gather telemetry data from RadPC: a reconfigurable, radiation tolerant computer that has been developed over the past 12 years by Montana State University to advance high performance space computing under varying environmental conditions. RadPC has recently been configured to provide regular telemetry data to measure and communicate the performance of the radiation-tolerant computing platform. Specifically, the telemetry data includes information about data memory integrity, faults experienced, and successful repairs; as well as various measurements including voltage, current, and temperature. While RadPC has been under development for some time, the developers have never searched the telemetry data for associations between fault recovery procedures and the physical state of the hardware itself (e.g., voltage and current levels of power supplies or internal temperature). In this work, the computer will be subject to synthetic faults—emulating radiation strikes that may occur in space—and perform standard recovery procedures. The tests will be performed with the RadPC on a high-altitude balloon flight as well as inside a temperature-controlled vacuum chamber, allowing for a range of controlled external environmental conditions. The collected telemetry data will be analyzed using PCA to detect patterns in the hardware status associated with fault recovery techniques. Identifying these patterns may lead to improved fault mitigation strategies that reduce the risk of subsequent faults by considering how recovery techniques affect the physical state of the hardware.","PeriodicalId":251518,"journal":{"name":"2022 Intermountain Engineering, Technology and Computing (IETC)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Identifying Patterns in Fault Recovery Techniques and Hardware Status of Radiation Tolerant Computers Using Principal Components Analysis\",\"authors\":\"F. Ramezani, Christopher M. Major, Colter Barney, Justin Williams, B. Lameres, Bradley M. Whitaker\",\"doi\":\"10.1109/ietc54973.2022.9796883\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fault tolerant computers have been developed in recent years to operate in the harsh radiation environment of outer space. These computers employ multiple copies of soft processors in a reconfigurable hardware environment and can automatically repair faults caused by radiation strikes. However, during certain recovery procedures, data collection and processing can be halted, and valuable scientific data can be lost. In addition, current fault recovery procedures may inadvertently make the computer more susceptible to faults or errors, for example, by introducing voltage and temperature changes. Machine learning feature extraction algorithms have the potential to reduce data loss by identifying patterns related to computational fault mitigation and recovery techniques. In this work, we will gather telemetry data from RadPC: a reconfigurable, radiation tolerant computer that has been developed over the past 12 years by Montana State University to advance high performance space computing under varying environmental conditions. RadPC has recently been configured to provide regular telemetry data to measure and communicate the performance of the radiation-tolerant computing platform. Specifically, the telemetry data includes information about data memory integrity, faults experienced, and successful repairs; as well as various measurements including voltage, current, and temperature. While RadPC has been under development for some time, the developers have never searched the telemetry data for associations between fault recovery procedures and the physical state of the hardware itself (e.g., voltage and current levels of power supplies or internal temperature). In this work, the computer will be subject to synthetic faults—emulating radiation strikes that may occur in space—and perform standard recovery procedures. The tests will be performed with the RadPC on a high-altitude balloon flight as well as inside a temperature-controlled vacuum chamber, allowing for a range of controlled external environmental conditions. The collected telemetry data will be analyzed using PCA to detect patterns in the hardware status associated with fault recovery techniques. Identifying these patterns may lead to improved fault mitigation strategies that reduce the risk of subsequent faults by considering how recovery techniques affect the physical state of the hardware.\",\"PeriodicalId\":251518,\"journal\":{\"name\":\"2022 Intermountain Engineering, Technology and Computing (IETC)\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Intermountain Engineering, Technology and Computing (IETC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ietc54973.2022.9796883\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Intermountain Engineering, Technology and Computing (IETC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ietc54973.2022.9796883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

容错计算机是近年来发展起来的一种能够在外太空恶劣辐射环境下工作的计算机。这些计算机在可重新配置的硬件环境中使用多个软处理器副本,可以自动修复由辐射攻击引起的故障。然而,在某些恢复过程中,数据收集和处理可能会停止,有价值的科学数据可能会丢失。此外,当前的故障恢复程序可能会在不经意间使计算机更容易受到故障或错误的影响,例如,通过引入电压和温度变化。机器学习特征提取算法有可能通过识别与计算故障缓解和恢复技术相关的模式来减少数据丢失。在这项工作中,我们将从RadPC收集遥测数据:RadPC是蒙大拿州立大学在过去12年中开发的一种可重构、耐辐射的计算机,用于在不同环境条件下推进高性能空间计算。RadPC最近被配置为提供常规遥测数据,以测量和交流耐辐射计算平台的性能。具体地说,遥测数据包括有关数据存储器完整性、所经历的故障和成功修复的信息;以及各种测量,包括电压,电流和温度。虽然RadPC已经开发了一段时间,但开发人员从未搜索过故障恢复过程与硬件本身的物理状态(例如,电源的电压和电流水平或内部温度)之间的关联的遥测数据。在这项工作中,计算机将受到合成故障的影响——模拟可能发生在太空中的辐射打击——并执行标准的恢复程序。测试将与RadPC一起在高空气球飞行中以及在温度控制的真空室中进行,允许一系列受控的外部环境条件。收集的遥测数据将使用PCA进行分析,以检测与故障恢复技术相关的硬件状态中的模式。识别这些模式可以改进故障缓解策略,通过考虑恢复技术如何影响硬件的物理状态来降低后续故障的风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Identifying Patterns in Fault Recovery Techniques and Hardware Status of Radiation Tolerant Computers Using Principal Components Analysis
Fault tolerant computers have been developed in recent years to operate in the harsh radiation environment of outer space. These computers employ multiple copies of soft processors in a reconfigurable hardware environment and can automatically repair faults caused by radiation strikes. However, during certain recovery procedures, data collection and processing can be halted, and valuable scientific data can be lost. In addition, current fault recovery procedures may inadvertently make the computer more susceptible to faults or errors, for example, by introducing voltage and temperature changes. Machine learning feature extraction algorithms have the potential to reduce data loss by identifying patterns related to computational fault mitigation and recovery techniques. In this work, we will gather telemetry data from RadPC: a reconfigurable, radiation tolerant computer that has been developed over the past 12 years by Montana State University to advance high performance space computing under varying environmental conditions. RadPC has recently been configured to provide regular telemetry data to measure and communicate the performance of the radiation-tolerant computing platform. Specifically, the telemetry data includes information about data memory integrity, faults experienced, and successful repairs; as well as various measurements including voltage, current, and temperature. While RadPC has been under development for some time, the developers have never searched the telemetry data for associations between fault recovery procedures and the physical state of the hardware itself (e.g., voltage and current levels of power supplies or internal temperature). In this work, the computer will be subject to synthetic faults—emulating radiation strikes that may occur in space—and perform standard recovery procedures. The tests will be performed with the RadPC on a high-altitude balloon flight as well as inside a temperature-controlled vacuum chamber, allowing for a range of controlled external environmental conditions. The collected telemetry data will be analyzed using PCA to detect patterns in the hardware status associated with fault recovery techniques. Identifying these patterns may lead to improved fault mitigation strategies that reduce the risk of subsequent faults by considering how recovery techniques affect the physical state of the hardware.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信