固有冗余故障操作系统的成本高效容错方法

2020 23rd Euromicro Conference on Digital System Design (DSD) Pub Date : 2020-08-01 DOI:10.1109/DSD51259.2020.00103

Tobias Dörr, T. Sandmann, Patrick Friederich, Arnd Leitner, J. Becker

{"title":"固有冗余故障操作系统的成本高效容错方法","authors":"Tobias Dörr, T. Sandmann, Patrick Friederich, Arnd Leitner, J. Becker","doi":"10.1109/DSD51259.2020.00103","DOIUrl":null,"url":null,"abstract":"Embedded systems in safety-critical environments are often subject to strict reliability requirements. This holds particularly true for modern fail-operational systems. In order to deliver a guaranteed minimum functionality at all times, these systems are often based on expensive fault tolerance mechanisms. In this work, we consider fail-operational systems with inherent redundancy. This property describes the presence of multiple hardware components, each of which is underutilized to a certain degree and thus able to serve as a fallback for one of the other components. We propose an off-chip fault tolerance mechanism for a pair of inherently redundant execution units that requires no further replication of these expensive resources. The key component of this concept is a lightweight proxy unit that handles faults of one execution unit by dynamically migrating the safety-critical portion of its functionality to its redundant counterpart. We present a prototypical implementation of this concept and evaluate the fault handling time of the resulting system experimentally. The results show that for an exemplary, processor-based control system with 256 bits of internal state, a cycle time of four milliseconds, and 64 bits of payload data that are read from or written to attached devices per cycle, the presented implementation is able to detect the failure of a unit, activate a fallback functionality on the complementary unit, and restore the internal state variables within five milliseconds.","PeriodicalId":128527,"journal":{"name":"2020 23rd Euromicro Conference on Digital System Design (DSD)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Approach to Cost-Efficient Fault Tolerance in Inherently Redundant Fail-Operational Systems\",\"authors\":\"Tobias Dörr, T. Sandmann, Patrick Friederich, Arnd Leitner, J. Becker\",\"doi\":\"10.1109/DSD51259.2020.00103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Embedded systems in safety-critical environments are often subject to strict reliability requirements. This holds particularly true for modern fail-operational systems. In order to deliver a guaranteed minimum functionality at all times, these systems are often based on expensive fault tolerance mechanisms. In this work, we consider fail-operational systems with inherent redundancy. This property describes the presence of multiple hardware components, each of which is underutilized to a certain degree and thus able to serve as a fallback for one of the other components. We propose an off-chip fault tolerance mechanism for a pair of inherently redundant execution units that requires no further replication of these expensive resources. The key component of this concept is a lightweight proxy unit that handles faults of one execution unit by dynamically migrating the safety-critical portion of its functionality to its redundant counterpart. We present a prototypical implementation of this concept and evaluate the fault handling time of the resulting system experimentally. The results show that for an exemplary, processor-based control system with 256 bits of internal state, a cycle time of four milliseconds, and 64 bits of payload data that are read from or written to attached devices per cycle, the presented implementation is able to detect the failure of a unit, activate a fallback functionality on the complementary unit, and restore the internal state variables within five milliseconds.\",\"PeriodicalId\":128527,\"journal\":{\"name\":\"2020 23rd Euromicro Conference on Digital System Design (DSD)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 23rd Euromicro Conference on Digital System Design (DSD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSD51259.2020.00103\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 23rd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD51259.2020.00103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在安全关键环境中的嵌入式系统通常受到严格的可靠性要求。这对于现代故障操作系统来说尤其正确。为了在任何时候都保证提供最小的功能，这些系统通常基于昂贵的容错机制。在这项工作中，我们考虑具有固有冗余的故障操作系统。该属性描述了多个硬件组件的存在，每个硬件组件在一定程度上都未得到充分利用，因此可以作为其他组件之一的后备。我们为一对固有冗余的执行单元提出了一种片外容错机制，不需要进一步复制这些昂贵的资源。这个概念的关键组件是一个轻量级代理单元，它通过动态地将其功能的安全关键部分迁移到冗余的对应部分来处理一个执行单元的错误。我们提出了这个概念的一个原型实现，并通过实验评估了系统的故障处理时间。结果表明，对于一个具有256位内部状态，周期时间为4毫秒，每个周期从或写入附加设备的64位有效载荷数据的示例性处理器控制系统，所提出的实现能够检测单元故障，激活互补单元上的回退功能，并在5毫秒内恢复内部状态变量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Approach to Cost-Efficient Fault Tolerance in Inherently Redundant Fail-Operational Systems

Embedded systems in safety-critical environments are often subject to strict reliability requirements. This holds particularly true for modern fail-operational systems. In order to deliver a guaranteed minimum functionality at all times, these systems are often based on expensive fault tolerance mechanisms. In this work, we consider fail-operational systems with inherent redundancy. This property describes the presence of multiple hardware components, each of which is underutilized to a certain degree and thus able to serve as a fallback for one of the other components. We propose an off-chip fault tolerance mechanism for a pair of inherently redundant execution units that requires no further replication of these expensive resources. The key component of this concept is a lightweight proxy unit that handles faults of one execution unit by dynamically migrating the safety-critical portion of its functionality to its redundant counterpart. We present a prototypical implementation of this concept and evaluate the fault handling time of the resulting system experimentally. The results show that for an exemplary, processor-based control system with 256 bits of internal state, a cycle time of four milliseconds, and 64 bits of payload data that are read from or written to attached devices per cycle, the presented implementation is able to detect the failure of a unit, activate a fallback functionality on the complementary unit, and restore the internal state variables within five milliseconds.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 23rd Euromicro Conference on Digital System Design (DSD)

自引率

0.00%

发文量