{"title":"A fault-tolerant protocol for location directory maintenance in mobile networks","authors":"S. Rangarajan, K. Ratnam, A. Dahbura","doi":"10.1109/FTCS.1995.466986","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466986","url":null,"abstract":"In this paper, we present a fault-tolerant protocol for maintaining location directories in mobile networks. The protocol tolerates base station failures and also allows for consistent location information to be maintained about mobile hosts that switch off and arbitrarily reappear in some other part of the network. Further, the protocol tolerates the corruption of a logical time stamp that is part of any protocol where new location information has to be distinguished from old location information when a location directory is updated. We formally show that the protocol maintains consistent location information and does not overwrite new location information with old location information. The protocol can be hierarchically organized to reduce the message overhead incurred by location directory updates.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131159968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher P. Dingman, Joe Marshall, D. Siewiorek
{"title":"Measuring robustness of a fault tolerant aerospace system","authors":"Christopher P. Dingman, Joe Marshall, D. Siewiorek","doi":"10.1109/FTCS.1995.466945","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466945","url":null,"abstract":"In commercial literature, the meaning of the term fault tolerant has become vague. We describe a system used to measure the robustness of a fault tolerant aerospace system developed at IBM, present the data collected during the project, and report conclusions and areas for future work.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115330395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault tolerance in safety critical automotive applications: cost of agreement as a limiting factor","authors":"S. Poledna","doi":"10.1109/FTCS.1995.466996","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466996","url":null,"abstract":"The high availability and safety requirements for automotive electronics are currently almost exclusively addressed by application specific engineering solutions to fault tolerance rather than by systematic approaches. Currently, systematic approaches are ruled out because of cost. The reason for this is that a systematic approach to fault tolerance requires: replication of components; and communication between replicated components to achieve agreement despite nondeterminism. While replicated components become more and more available with the connection of different control units by means of a multiplex bus, it is shown that the cost of agreement on sensor inputs will become the limiting factor for systematic approaches to fault tolerance. For that reason a new agreement algorithm is introduced which considers the problem of agreement and sensor inputs in an integrated fashion. This algorithm takes advantage of the a priori knowledge on the maximum deviation of replicated sensor inputs. Optimality of this algorithm is shown with respect to the minimum number of bits for agreement. This algorithm allows broader application of systematic fault tolerance to automotive applications. The result of this work will be used for a prototype implementation of a safety critical automotive application.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123379617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining software-implemented and simulation-based fault injection into a single fault injection method","authors":"Jens Güthoff, V. Sieh","doi":"10.1109/FTCS.1995.466978","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466978","url":null,"abstract":"Fault/error injection has emerged as a valuable means for evaluating the dependability of a system. In particular, software-based techniques (which can be described as software-implemented and simulation-based techniques) have become very popular because of the relative simplicity of injecting faults. After discussing the advantages and drawbacks of these techniques, two approaches are introduced which try to overcome crucial problems when using software-based fault injection techniques. The first one improves the accuracy of software-implemented fault injection experiments. The second one offers detailed insights into the system dynamics in the presence of faults. With this knowledge, the number of fault injections (a major concern in simulation-based fault injection) can be significantly reduced. These approaches can be joined together, offering accuracy of fault injection results as well as transparency of the system dynamics in the presence of faults. A case study is shown in which the de facto dependability properties of a standard component, a Motorola MC88100 RISC processor, are evaluated.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129603888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interactive consistency algorithms based on voting and error-correcting codes","authors":"T. Krol","doi":"10.1109/FTCS.1995.466994","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466994","url":null,"abstract":"This paper presents a new class of synchronous deterministic non authenticated algorithms for reaching interactive consistency (Byzantine agreement). The algorithms are based on voting and error correcting codes and require considerably less data communication than the original algorithm, whereas the number of rounds and the number of modules meet the minimum bounds. These algorithms based on voting and coding are defined and proved on the basis of a class of algorithms, called the dispersed joined communication algorithms.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129076185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Saxena, Chien Chen, R. Swami, H. Osone, Shalesh Thusoo, D. Lyon, D. Chang, Anand Dharmaraj, N. Patkar, Yizhi Lu, Ben-Hau Chia
{"title":"Error detection and handling in a superscalar, speculative out-of-order execution processor system","authors":"N. Saxena, Chien Chen, R. Swami, H. Osone, Shalesh Thusoo, D. Lyon, D. Chang, Anand Dharmaraj, N. Patkar, Yizhi Lu, Ben-Hau Chia","doi":"10.1109/FTCS.1995.466952","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466952","url":null,"abstract":"The HaL SPARC64 Processor, the first 64-bit SPARC-V9 architecture implementation, uses several techniques to ensure a high degree of system reliability, error detection, and error recovery. The CPU of the multi-chip module processor has a superscalar, speculative issue unit, and an out-of-order execution datapath. These two processor components complicate the maintenance of precise state in the event of errors. By exploiting the SPARC-V9 architectural features, and the micro-architecture for speculative execution, SPARC64 maintains precise state in the event of exceptions and errors, logs and reports errors, and facilitates error detection during full system bringup. The paper presents details of error detection and handling in the CPU, the cache system, and the Memory Management Unit(MMU). The HaL R1 system also implements a fault-secure memory system design. The memory system corrects all single-bit errors, detects double bit errors, detects single address line failures, and detects all single dynamic RAM (DRAM) chip failures. Certain debug features have been added to the system that are useful during system bring-up.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130307156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards totally self-checking delay-insensitive systems","authors":"S. Piestrak, T. Nanya","doi":"10.1109/FTCS.1995.466975","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466975","url":null,"abstract":"Considers designing quasi-delay-insensitive (QDI) combinational circuits (CCs), a class of self-timed (asynchronous) circuits. The necessity of coding both inputs and outputs of any QDI CC by using unordered codes naturally leads to inverter-free realization. The analysis of behavior of a QDI CC with input errors leads to the observation that it is impossible to avoid the so-called late detection problem. The new set of correct definitions of the code-disjoint QDI CC and of the totally self-checking (TSC) QDI CC is introduced. The detailed analysis of the behavior of a faulty QDI system with internal permanent faults shows that: (1) late detection, (2) the possibility of occurrence of invalid transitions, and (3) premature completion, seem to be the inherent properties of any QDI CC, which preclude its fault-secure (hence TSC) implementation for some single stuck-at faults. The first ever self-testing code-disjoint completion checker is proposed. Finally, an extensive study of designing self-testing code-disjoint QDI CCs is presented.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"12 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120848332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal recovery point insertion for high-level synthesis of recoverable microarchitectures","authors":"D. Blough, F. Kurdahi, S. Ohm","doi":"10.1109/FTCS.1995.466979","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466979","url":null,"abstract":"The paper considers the problem of automatic insertion of recovery points in recoverable microarchitectures. Previous work on this problem provided heuristic algorithms that attempted either to minimize computation time with a bounded hardware overhead or to minimize hardware overhead with a bounded computation time. We present efficient algorithms that provide provably optimal solutions for both of these formulations of the problem. These algorithms take as their input a scheduled control-data flow graph describing the behavior of the system and they output either a minimum-time or a minimum-cost set of recovery point locations. We demonstrate the performance of our algorithms using some well-known benchmark control-data flow graphs. Over all parameter values for each of these benchmarks, our optimal algorithms are shown to perform as well as, and in many cases better than, the previously proposed heuristics.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134157141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-tolerance for off-the-shelf applications and hardware","authors":"M. Russinovich, Z. Segall","doi":"10.1109/FTCS.1995.466997","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466997","url":null,"abstract":"The concept of middleware provides a transparent way to augment and change the characteristics of a service provider as seen from a client. Fault tolerant policies are ideal candidates for middleware implementation. We have defined and implemented operating system based middleware support that provides the power and flexibility needed by diverse fault tolerant policies. This mechanism, called the sentry, has been built into the UNIX 4.3 BSD operating system server running on a Mach 3.0 kernel. To demonstrate the effectiveness of the mechanism several policies have been implemented using sentries including checkpointing and journaling. The implementation shows that complex fault tolerant policies can be efficiently and transparently implemented as middleware. Performance overhead of input journaling is less than 5% and application suspension during the checkpoint is typically under 10 seconds in length. A standard hard disk is used to store journal and checkpoint information with dedicated storage requirements of less than 20 MB.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"232 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128628626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A flexible ServerNet-based fault-tolerant architecture","authors":"W. Baker, R. Horst, D. Sonnier, W. Watson","doi":"10.1109/FTCS.1995.466982","DOIUrl":"https://doi.org/10.1109/FTCS.1995.466982","url":null,"abstract":"The paper introduces a new fault-tolerant architecture that combines the best attributes of the software fault-tolerant Tandem NonStop systems with the hardware fault-tolerant integrity systems. This architecture is based on the ServerNet System Area Network (SAN). ServerNet, formerly called TNet, is a packetized byte-serial multistage network that supports both I/O and interprocessor traffic in fault-tolerant systems. Dual-ported CPUs and VO controllers connect to independent subnetworks in a variety of different network topologies. Systems can expand either through shared or distributed memory multiprocessing. A separate maintenance system controls system initialization, online configuration changes, and error reporting. The architecture's flexibility makes it suitable for a wide range of environments with varying requirements for performance, fault tolerance, and software compatibility.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121805290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}