{"title":"A multiple copy approach for delivering messages under deadline constraints","authors":"P. Ramathan, K. Shin","doi":"10.1109/FTCS.1991.146677","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146677","url":null,"abstract":"A scheme to minimize the expected recovery cost incurred by a distributed real-time system as a result of messages failing to meet their deadline is proposed. The scheme is intended for distributed systems with point-to-point interconnection topology. The goal of minimizing the expected cost is achieved by sending multiple copies of a message through disjoint routes, thus increasing the probability of successful message delivery within the deadline. The number of copies of each message to be sent is determined by optimizing the tradeoff between the increase in the message traffic due to additional copies and the decrease in the probability of a message missing its deadline. The objective used to determine the optimal number of copies is formalized, and a numerical example is presented, showing that reductions of more than 70% can be achieved at low to moderate loads. At high loads the reductions are in the range of 10-40%.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115281012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the reconfiguration of memory arrays containing clustered faults","authors":"D. Blough","doi":"10.1109/FTCS.1991.146699","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146699","url":null,"abstract":"Reconfiguration of memory arrays using spare rows and spare columns, a useful technique for yield enhancement of memories, is considered under a compound probabilistic model that shows clustering of faults. It is shown that the total number of faulty cells that can be tolerated when clustering occurs is larger than when faults are independent. It is shown that an optimal solution to the reconfiguration problem can be found in polynomial time for a special case of the clustering model. Efficient approximation algorithms are given for the situation in which faults appear in clusters only and the situation in which faults occur both in clusters and singly. It is shown through simulation that the computation time required by this algorithm to repair large arrays containing a significant number of clustered faults is small.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127183537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The role of formal methods in the requirements analysis of safety-critical systems: a train set example","authors":"A. Saeed, R. Lemos, T. Anderson","doi":"10.1109/FTCS.1991.146704","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146704","url":null,"abstract":"A general framework for the formal specification and verification of the critical requirements in the development of safety-critical systems is presented. The framework is based on a clear separation of the mission and critical issues during requirements analysis. Analysis of the critical issues is performed in two phases. The first phase identifies those real world properties relevant to the critical requirements: the physical laws or rules of operation, and the system hazards. In the second phase, the interface between the system and its environment is identified, and the behavior required at this interface is specified. The utilization of different formal models, namely, a logical formalism (timed history logic) and a net formalism (predicate-transition nets), respectively, is proposed for the two phases. To illustrate the framework, an example based on a train set crossing is presented.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115087699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-tolerance experiments of the 'Hiten' onboard space computer","authors":"T. Takano, T. Yamada, K. Shutoh, N. Kanekawa","doi":"10.1109/FTCS.1991.146628","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146628","url":null,"abstract":"An interim report on the experimental fault-tolerance verification of the onboard space computer loaded on the artificial satellite Hiten is presented. The Hiten mission and the fault-tolerance technique, stepwise negotiating voting (SNV), on which the computer is based, are described. The intentional fault injection study, field data collection method, and observed faults and results computer behavior are also described. During the roughly seven month study period, fault tolerance masked all the faults that occurred.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121836051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recovery concepts for data sharing systems","authors":"E. Rahm","doi":"10.1109/FTCS.1991.146687","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146687","url":null,"abstract":"Crash and media recovery problems that have to be addressed in data sharing systems are addressed. Recovery is complicated by dependencies on other functions such as buffer management and concurrency control. Furthermore, a global log file is to be constructed where the modifications of committed transactions are reflected in chronological order. Logging and recovery protocols that employ the primary copy approach for concurrency/coherence control are proposed for loosely coupled data sharing systems. A comparison with existing data sharing system shows that the protocols support high performance during normal processing as well as efficient recovery that provides high availability.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"284 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115859677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-tolerant gamma interconnection networks","authors":"N. Tzeng, Po-Jen Chuang","doi":"10.1109/FTCS.1991.146674","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146674","url":null,"abstract":"Modifications to the gamma interconnection network (GIN), which is composed of 3*3 basic building blocks, with interconnecting patterns between stages following the plus-minus-2/sup i/ functions, are discussed. The gamma network is modified by altering the interconnecting patterns between stages so as to create totally disjoint paths from any source (S) to any destination (D), ensuring high terminal reliability between every (S, D) pair. The network is called fault-tolerant GIN (FTGIN) since it can tolerate an arbitrary single fault. If several building blocks (i.e. 3*3 switches) are fabricated in one VLSI chip, the layout area and pin count are smaller for the FTGIN than for its GIN counterpart as a result of the change in the interconnection patterns, offering potential cost reduction. A lower bound on the terminal reliability of the FTGIN is derived, showing significant terminal reliability improvement over the conventional gamma network.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133556548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A distributed fault tolerant architecture for nuclear reactor and other critical process control applications","authors":"M. Hecht, J. Agron, H. Hecht, K. Kim","doi":"10.1109/FTCS.1991.146702","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146702","url":null,"abstract":"A distributed fault tolerant system for process control that is based on an enhancement of the distributed recovery block (DRB) is described. Fault tolerance provisions in the system cover software faults by use of the DRB; hardware faults by means of replication and the DRB; system software faults by means of replication, loose coupling, periodic status messages, and a restart capability; and network faults by means of replication and diverse interconnection paths. Maintainability is enhanced through an automated restart capability and logging function resident on a system supervisor node. The system, called the extended distributed recovery block, or EDRB, has been implemented and integrated into a chemical processing system.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130601897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new approach to control flow checking without program modification","authors":"T. Michel, R. Leveugle, G. Saucier","doi":"10.1109/FTCS.1991.146682","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146682","url":null,"abstract":"An approach to concurrent control flow checking that avoids performance and software compatibility problems while preserving a high error coverage and a low detection latency is proposed. The approach is called watchdog direct processing. Extensions of the basic method, taking into account the characteristics of complex processors, are also considered. The architecture of a watchdog processor based on the proposed method is described. Implementation results are reported for a watchdog designed for the Intel 80386sx microprocessor.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115231999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concurrent error detection and fault-tolerance in linear digital state variable systems","authors":"A. Chatterjee, M. d'Abreu","doi":"10.1109/FTCS.1991.146652","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146652","url":null,"abstract":"The problem of error detection and correction (both transient and permanent) in linear digital state variable systems, a very large class of circuits used in digital signal processing and control, is considered. The case of single faulty modules (adders, multipliers, shifters, etc.) is studied, and general circuit data flow graphs (with and without fanout) that realize linear digital state variable systems are analyzed to determine how additional system states might be added to the data flow graph to achieve error detection and correction. It is seen that error detection and correction can be achieved by the addition of a relatively small amount of additional hardware which functions as the checking circuitry. Next, error detection under multiple faulty modules with and without fanout of the module outputs is studied. An analysis tool called the gain matrix is introduced. The problem of fault location and correction of single faults is discussed. Recursive as well as nonrecursive systems can be handled.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126152118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Stratus architecture","authors":"S. Webber, J. Beirne","doi":"10.1109/FTCS.1991.146637","DOIUrl":"https://doi.org/10.1109/FTCS.1991.146637","url":null,"abstract":"An overview is given of the architecture of the Stratus fault-tolerant computer systems, which were the first to use hardware alone to provide fault tolerance in the commercial marketplace. The power subsystem, system boards, and off-board I/O interface buses are examined in some detail. Recovery scenarios and the Stratus service approach are described.<<ETX>>","PeriodicalId":300397,"journal":{"name":"[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123102717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}