{"title":"A fault tolerance infrastructure for dependable computing with high-performance COTS components","authors":"A. Avizienis","doi":"10.1109/ICDSN.2000.857581","DOIUrl":null,"url":null,"abstract":"The failure rates of current COTS processors have dropped to 100 FITs (failures per 10/sup 9/ hours), indicating a potential MTTF of over 1100 years. However our recent study of Intel P6 family processors has shown that they have very limited error detection and recovery capabilities and contain numerous design faults (\"errata\"). Other limitations are susceptibility to transient faults and uncertainty about \"wearout\" that could increase the failure rate in time. Because of these limitations, an external fault tolerance infrastructure is needed to assure the dependability of a system with such COTS components. The paper describes a fault-tolerant \"infrastructure\" system of fault tolerance functions that makes possible the use of low-coverage COTS processors in a fault-tolerant, self-repairing system. The custom hardware supports transient recovery design fault tolerance, and self-repair by scaring and replacement. Fault tolerance functions are implemented by four types of hardware are processors of low complexity that are fault-tolerant. High error detection coverage, including design faults, is attained by diversity and replication.","PeriodicalId":127372,"journal":{"name":"Proceeding International Conference on Dependable Systems and Networks. DSN 2000","volume":"318 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2000-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceeding International Conference on Dependable Systems and Networks. DSN 2000","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDSN.2000.857581","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22
Abstract
The failure rates of current COTS processors have dropped to 100 FITs (failures per 10/sup 9/ hours), indicating a potential MTTF of over 1100 years. However our recent study of Intel P6 family processors has shown that they have very limited error detection and recovery capabilities and contain numerous design faults ("errata"). Other limitations are susceptibility to transient faults and uncertainty about "wearout" that could increase the failure rate in time. Because of these limitations, an external fault tolerance infrastructure is needed to assure the dependability of a system with such COTS components. The paper describes a fault-tolerant "infrastructure" system of fault tolerance functions that makes possible the use of low-coverage COTS processors in a fault-tolerant, self-repairing system. The custom hardware supports transient recovery design fault tolerance, and self-repair by scaring and replacement. Fault tolerance functions are implemented by four types of hardware are processors of low complexity that are fault-tolerant. High error detection coverage, including design faults, is attained by diversity and replication.