{"title":"Design fault tolerance in operating systems based on a standardization project","authors":"Akio Watanabe, K. Sakamura","doi":"10.1109/FTCS.1995.466962","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466962","url":null,"abstract":"We are exploring an MLDD (Multi-Layered Design Diversity) architecture that applies natural design diversity to an application program layer, an operating system layer, and a hardware layer based on the TRON standardization project. We have devised a backward error recovery mechanism for the operating system layer, and to implement it, we have developed a mechanism that automatically exchanges diverse operating system implementations. The paper presents an error-check generation method for the operating system layer. In this method, which is called SBACCG (Specification-Based Adaptive Consistency Checks Generation), one set of consistency checks is derived from a formal specification, and the checks are adapted to each implementation. We experimentally evaluated the effectiveness of our backward error recovery mechanism that uses the error checks generated through SBACCG.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115706006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Process allocation for load distribution in fault-tolerant multicomputers","authors":"Jong Kim, Heejo Lee, Sunggu Lee","doi":"10.1109/FTCS.1995.466985","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466985","url":null,"abstract":"In this paper, we consider a load-balancing process allocation method for fault-tolerant multicomputer systems that balances the load before as well as after faults start to degrade the performance of the system. In order to be able to tolerate a single fault, each process (primary process) is duplicated (i.e. has a backup process). The backup process executes on a different processor from the primary, checkpointing the primary process and recovering the process if the primary process fails due to the occurrence of a fault. In this paper, we first formalize the problem of load-balancing process allocation and show that it is an NP-hard problem. Next, we propose a new heuristic process allocation method and analyze the performance of the proposed allocation method. Simulations are used to compare the proposed method with a process allocation method that does not take into account the different load characteristics of the primary and backup processes. While both methods perform well before the occurrence of a fault in a primary process, only the proposed method maintains a balanced load after the occurrence of such a fault.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121389380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Babu Turumella, Aiman Kabakibo, Manjunath Bogadi, Karakunakara Menon, Shaleah Thusoo, Long Nguyen, N. Saxena, Michael Chow
{"title":"Design verification of a super-scalar RISC processor","authors":"Babu Turumella, Aiman Kabakibo, Manjunath Bogadi, Karakunakara Menon, Shaleah Thusoo, Long Nguyen, N. Saxena, Michael Chow","doi":"10.1109/FTCS.1995.466951","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466951","url":null,"abstract":"The paper provides an overview of the design verification methodology for HaL's Sparc64 processor development. This activity covered approximately two and a half years of design development time. Objectives and challenges are discussed and the verification methodology is described. Monitoring mechanisms that give high observability to internal design states, novel features that increase the simulation speed, and tools for automatic result checking are described. Also presented for the first time, is an analysis of the design defects discovered during the verification process. Such an analysis is useful in augmenting verification programs to target common design defects.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"24 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113964849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Completely asynchronous optimistic recovery with minimal rollbacks","authors":"Sean W. Smith, David B. Johnson, J. D. Tygar","doi":"10.1109/FTCS.1995.466963","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466963","url":null,"abstract":"Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message logging and replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However previous optimistic rollback recovery protocols either have required synchronization during recovery, or have permitted a failure at one process to potentially trigger an exponential number of process rollbacks. We present an optimistic rollback recovery protocol that provides completely asynchronous recovery, while also reducing the number of times a process must roll back in response to a failure to at most one. This protocol is based on comparing timestamp vectors across multiple levels of partial order time.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130212682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of software dependability based on stability test data","authors":"D. Tang, M. Hecht","doi":"10.1109/FTCS.1995.466956","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466956","url":null,"abstract":"The paper discusses a measurement-based approach to dependability evaluation of fault-tolerant, real-time software systems based on failure data collected from stability tests of an air traffic control system under development. Several dependability analysis techniques are illustrated with the data: parameter estimation, availability modeling of software from the task level, applications of the parameter estimation and model evaluation in assessing availability, identifying key problem areas, and predicting required test duration for achieving desired levels of availability and quantification of relationships between software size, the number of faults, and failure rate for a software unit. Although most discussion is focused on a typical subsystem, Sector Suite, the discussed methodology is applicable to other subsystems and the system. The study demonstrates a promising approach to measuring and assessing software availability during the development phase, which has been increasingly demanded by the project management of developing large, critical systems.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114303555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implicit signature checking","authors":"J. Ohlsson, M. Rimén","doi":"10.1109/FTCS.1995.466976","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466976","url":null,"abstract":"Proposes a control flow checking method that assigns unique initial signatures to each basic block in a program by using the block's start address. Using this strategy, implicit signature checking points are obtained at the beginning of each basic block, which results in a short error detection latency (2-5 instructions). Justifying signatures are embedded at each branch instruction, and a watchdog timer is used to detect the absence of a signature checking point. The method does not require the building of a program flow graph and it handles jumps to destinations that are not fixed at compile/link-time, e.g. subroutine calls using function pointers in the C language. This paper includes a generalized description of the control flow checking method, as well as a description and evaluation of an implementation of the method.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116004082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why optimistic message logging has not been used in telecommunications systems","authors":"Yennun Huang, Yi-Min Wang","doi":"10.1109/FTCS.1995.466953","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466953","url":null,"abstract":"Much of the literature on message logging and checkpointing in the past decade has been based on a so-called optimistic approach that places more emphasis on failure-free overhead than recovery efficiency. Our experience has shown that most telecommunications systems use a pessimistic approach because the main purpose of using message logging and checkpointing is to achieve fast and localized recovery, and the failure-free overhead of a pessimistic approach can often be made reasonably low by exploiting application-specific information.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116812608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael R. Lyu, Jinsong S. Yu, E. Keramidas, S. Dalal
{"title":"ARMOR: analyzer for reducing module operational risk","authors":"Michael R. Lyu, Jinsong S. Yu, E. Keramidas, S. Dalal","doi":"10.1109/FTCS.1995.466989","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466989","url":null,"abstract":"ARMOR (Analyzer for Reducing Module Operational Risk) is a software risk analysis tool which automatically identifies the operational risks of software program modules. ARMOR takes data directly from project database, failure database, and program development database, establishes risk models according to several risk analysis schemes, determines the risks of software programs, and displays various statistical quantities for project management and engineering decisions. Its enhanced user interface greatly simplifies the risk modeling procedures and the usage learning time. The tool can perform the following tasks during project development, testing, and operation: establish promising risk models for the project under evaluation; measure the risks of software programs within the project; identify the source of risks and indicate how to improve software programs to reduce their risk levels; and determine the validity of risk models from field data.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114292121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The ELEKTRA railway signalling system: field experience with an actively replicated system with diversity","authors":"H. Kantz, C. Koza","doi":"10.1109/FTCS.1995.466954","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466954","url":null,"abstract":"Since the beginning of the century, Alcatel Austria has been the main supplier of railway signalling products in Austria. In 1985, Alcatel Austria began developing the electronic interlocking system ELEKTRA. In order to meet the stringent safety requirements for railway interlocking applications, a two channel system based on design diversity has been developed. High availability and reliability are achieved by using actively triplicated redundancy with on-line recovery. In 1989, the first system was put into operation. About 15 railway interlocking systems are in operation and further installations are ongoing. The paper presents the fault tolerance mechanisms used for design faults as well as physical faults. The experience gained with these concepts is also discussed.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127552659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gracefully degrading systems using the bulk-synchronous parallel model with randomised shared memory","authors":"Andreas G. Savva, T. Nanya","doi":"10.1109/FTCS.1995.466969","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466969","url":null,"abstract":"The bulk-synchronous parallel model (BSPM) was proposed as a bridging model for parallel computation by Valiant (1990). By using randomised shared memory (RSM), this model offers an asymptotically optimal emulation of the PRAM. By using the BSPM with RSM, we show how a gracefully degrading massively parallel system can be obtained through: memory duplication to ensure global memory integrity, and to speed up the reconfiguration; a global reconfiguration method that restores the logical properties of the system, after a fault occurs. We assume fail-stop processors, single faults, no spare processors, and no significant loss of network throughput as a result of faults. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. The overhead of the scheme and the graceful degradation achieved depend on the program being executed. We evaluate the reconfiguration, overhead, and graceful degradation of the system experimentally.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132711519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}