故障-故障-观察者:一种自愈自适应容错的概念

2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS) Pub Date : 2014-07-14 DOI:10.1109/AHS.2014.6880163

Byron Navas, Johnny Öberg, I. Sander

{"title":"故障-故障-观察者:一种自愈自适应容错的概念","authors":"Byron Navas, Johnny Öberg, I. Sander","doi":"10.1109/AHS.2014.6880163","DOIUrl":null,"url":null,"abstract":"Advancing integration reaching atomic-scales makes components highly defective and unstable during lifetime. This demands paradigm shifts in electronic systems design. FPGAs are particularly sensitive to cosmic and other kinds of radiations that produce single-event-upsets (SEU) in configuration and internal memories. Typical fault-tolerance (FT) techniques combine triple-modular-redundancy (TMR) schemes with run-time-reconfiguration (RTR). However, even the most successful approaches disregard the low suitability of fine-grain redundancy in nano-scale design, poor scalability and programmability of application specific architectures, small performance-consumption ratio of board-level designs, or scarce optimization capability of rigid redundancy structures. In that context, we introduce an innovative solution that exploits the flexibility, reusability, and scalability of a modular RTR SoC approach and reuse existing RTR IP-cores in order to assemble different TMR schemes during run-time. Thus, the system can adaptively trigger the adequate self-healing strategy according to execution environment metrics and user-defined goals. Specifically the paper presents: (a) the upset-fault-observer (UFO), an innovative run-time self-test and recovery strategy that delivers FT on request over several function cores but saves the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles, (b) run-time reconfigurable TMR schemes and self-repair mechanisms, and (c) an adaptive software organization model to manage the proposed FT strategies.","PeriodicalId":428581,"journal":{"name":"2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"The upset-fault-observer: A concept for self-healing adaptive fault tolerance\",\"authors\":\"Byron Navas, Johnny Öberg, I. Sander\",\"doi\":\"10.1109/AHS.2014.6880163\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advancing integration reaching atomic-scales makes components highly defective and unstable during lifetime. This demands paradigm shifts in electronic systems design. FPGAs are particularly sensitive to cosmic and other kinds of radiations that produce single-event-upsets (SEU) in configuration and internal memories. Typical fault-tolerance (FT) techniques combine triple-modular-redundancy (TMR) schemes with run-time-reconfiguration (RTR). However, even the most successful approaches disregard the low suitability of fine-grain redundancy in nano-scale design, poor scalability and programmability of application specific architectures, small performance-consumption ratio of board-level designs, or scarce optimization capability of rigid redundancy structures. In that context, we introduce an innovative solution that exploits the flexibility, reusability, and scalability of a modular RTR SoC approach and reuse existing RTR IP-cores in order to assemble different TMR schemes during run-time. Thus, the system can adaptively trigger the adequate self-healing strategy according to execution environment metrics and user-defined goals. Specifically the paper presents: (a) the upset-fault-observer (UFO), an innovative run-time self-test and recovery strategy that delivers FT on request over several function cores but saves the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles, (b) run-time reconfigurable TMR schemes and self-repair mechanisms, and (c) an adaptive software organization model to manage the proposed FT strategies.\",\"PeriodicalId\":428581,\"journal\":{\"name\":\"2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AHS.2014.6880163\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AHS.2014.6880163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

推进集成达到原子尺度使得组件在生命周期内高度缺陷和不稳定。这就要求电子系统设计的范式转变。fpga对宇宙和其他类型的辐射特别敏感，这些辐射会在配置和内部存储器中产生单事件扰动(SEU)。典型的容错技术将三模冗余(TMR)和运行时重构(RTR)相结合。然而，即使是最成功的方法也忽视了纳米级设计中细粒度冗余的低适用性，特定应用架构的可扩展性和可编程性差，板级设计的性能消耗比小以及刚性冗余结构的缺乏优化能力。在这种情况下，我们引入了一种创新的解决方案，利用模块化RTR SoC方法的灵活性、可重用性和可扩展性，并重用现有的RTR ip核，以便在运行时组装不同的TMR方案。因此，系统可以根据执行环境指标和用户定义的目标自适应地触发适当的自修复策略。具体来说，本文提出了:(a)异常故障观测器(UFO)，一种创新的运行时自检和恢复策略，可根据多个功能核心的要求提供FT，但通过运行周期性可重构TMR扫描周期来节省冗余可扩展性成本;(b)运行时可重构TMR方案和自修复机制;(c)一个自适应软件组织模型来管理所提出的FT策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The upset-fault-observer: A concept for self-healing adaptive fault tolerance

Advancing integration reaching atomic-scales makes components highly defective and unstable during lifetime. This demands paradigm shifts in electronic systems design. FPGAs are particularly sensitive to cosmic and other kinds of radiations that produce single-event-upsets (SEU) in configuration and internal memories. Typical fault-tolerance (FT) techniques combine triple-modular-redundancy (TMR) schemes with run-time-reconfiguration (RTR). However, even the most successful approaches disregard the low suitability of fine-grain redundancy in nano-scale design, poor scalability and programmability of application specific architectures, small performance-consumption ratio of board-level designs, or scarce optimization capability of rigid redundancy structures. In that context, we introduce an innovative solution that exploits the flexibility, reusability, and scalability of a modular RTR SoC approach and reuse existing RTR IP-cores in order to assemble different TMR schemes during run-time. Thus, the system can adaptively trigger the adequate self-healing strategy according to execution environment metrics and user-defined goals. Specifically the paper presents: (a) the upset-fault-observer (UFO), an innovative run-time self-test and recovery strategy that delivers FT on request over several function cores but saves the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles, (b) run-time reconfigurable TMR schemes and self-repair mechanisms, and (c) an adaptive software organization model to manage the proposed FT strategies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)

自引率

0.00%

发文量