{"title":"Behavioral synthesis of fault secure controller/datapaths using aliasing probability analysis","authors":"G. Lakshminarayana, A. Raghunathan, N. Jha","doi":"10.1109/FTCS.1996.534618","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534618","url":null,"abstract":"We address the problem of synthesizing fault-secure controller/data path circuits from behavioral specifications. We use an iterative improvement based behavioral synthesis framework that performs module selection, clock selection, scheduling, and resource sharing with the aim of minimizing the area of the synthesized circuit, while allowing multicycling, chaining, and module pipelining. We present a dynamic comparison selection algorithm that can be used during behavioral synthesis to determine which intermediate results in the computation need to be secured in order to enable maximal resource sharing. Previous work on synthesizing fault-secure data paths has focused on ensuring that aliasing cannot occur in any part of the design. We demonstrate that such an approach can lead to unnecessarily large overheads. In order to alleviate the overheads incurred for fault security, our behavioral synthesis framework uses aliasing probability analysis (ALPS) in order to identify resource sharing configurations that reduce area, while introducing a very low probability of aliasing (of the order of 10/sup -10/ for a bitwidth of 32) in the resultant data path. We report experimental results for several behavioral descriptions that demonstrate the efficacy of our techniques in synthesizing fault-secure controller/datapaths with low overheads.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"342 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115674450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Testing of fault-tolerant and real-time distributed systems via protocol fault injection","authors":"S. Dawson, F. Jahanian, T. Mitton, T. Tung","doi":"10.1109/FTCS.1996.534626","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534626","url":null,"abstract":"As software for distributed systems becomes more complex, ensuring that a system meets its prescribed specification is a growing challenge that confronts software developers. This is particularly important for distributed applications with strict dependability and timeliness constraints. This paper reports on ORCHESTRA, a portable fault injection environment for testing implementations of distributed protocols. This tool is based on a simple yet powerful framework called script-driven probing and fault injection, for the evaluation and validation of the fault-tolerance and timing characteristics of distributed protocols. The tool, which was initially developed on the Real-Time Mach operating system and later ported to other platforms including Solaris and SunOS, has been used to conduct extensive experiments on several protocol implementations. This paper describes the design and implementation of the fault injection tool focusing on architectural features to support portability, minimizing intrusiveness on target protocols, and explicit support for testing real-time systems. The paper also describes the experimental evaluation of two protocol implementations: a real-time audio-conferencing application on Real-Time Mach, and a distributed group membership service on the Sun Solaris operating system.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"4 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116938000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Executable assertions and timed traces for on-line software error detection","authors":"C. Rabéjac, J. Blanquart, J. Queille","doi":"10.1109/FTCS.1996.534602","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534602","url":null,"abstract":"The topic of this paper is the detection of errors due to residual faults in software, particularly those with temporary effects. After positioning our approach amongst existing fault tolerance and detection techniques, we propose detection mechanisms for such errors. These mechanisms are designed to detect both data and control flow errors. They can be validated by both formal and fault-injection techniques. In particular, we propose a timed trace technique allowing one to specify the expected software behavior and to instantiate from this specification a generic control-flow checking automaton. The critical algorithms of this automaton are formally proved. To develop these mechanisms, we also propose a design and validation method based on a monitoring specification. Finally, we apply these techniques on two cases of embedded real-time software in order not only to validate them but also to estimate their efficiency and applicability.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121151321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Acevedo, L. Bahler, E. Elnozahy, V. Ratan, M. Segal
{"title":"Highly available directory services in DCE","authors":"B. Acevedo, L. Bahler, E. Elnozahy, V. Ratan, M. Segal","doi":"10.1109/FTCS.1996.534624","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534624","url":null,"abstract":"The DCE standard includes specifications for the Directory Service, a component that performs typical naming services in distributed computing environments. We list some deficiencies in these specifications that affect the naming service availability and correctness, and suggest possible solutions. We then describe an enhancement of an implementation of the Directory Service that adds support for partial replication of the name space, continuous operation of the service, and automatic fallover. Our extensions ensure the consistency of the name space data, and are transparent to application developers and end users, all without a significant performance penalty.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125879320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault diagnosis using state information","authors":"V. Boppana, I. Hartanto, W. Fuchs","doi":"10.1109/FTCS.1996.534598","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534598","url":null,"abstract":"Repeated fault diagnosis on large integrated circuits may often be computationally prohibitive due to expensive fault simulation requirements. Fault dictionaries can help alleviate this problem, but they may be infeasible to store because of their large sizes, and more importantly, they typically provide only a black box view of the circuit and hence almost no diagnostic flexibility. The problem occurs because dictionaries usually only store primary output information. A new approach to fault diagnosis based on state information is presented. The selective storage of state information is shown to significantly improve the time for diagnostic fault simulation. We also describe a method to reduce the amount of information stored by choosing only a subset of the state space. This approach is shown to be ideally suited for partial scan circuits whose simple structure is exploited to reduce storage requirements. Experiments on the ISCAS 89 benchmark circuits are performed to demonstrate the efficiency of the state information based diagnosis technique.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122206483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Kanoun, Marie Ortalo-Borrel, Thierry Morteveille, A. Peytavin
{"title":"Modeling the dependability of CAUTRA, a subset of the French air traffic control system","authors":"K. Kanoun, Marie Ortalo-Borrel, Thierry Morteveille, A. Peytavin","doi":"10.1109/FTCS.1996.534599","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534599","url":null,"abstract":"The aim of our work is to provide a quantified means helping in the definition of a new architecture for CAUTRA, a subset of the French Air Traffic Control system. To do this, we define alternative architectures for the CAUTRA whose availability is compared in order to select the architecture with the highest level of availability. Modeling is carried out following a modular and systematic approach, based on the derivation of black models at a high level of abstraction. In a second step, the blocks are replaced by their equivalent Generalized Stochastic Petri Nets to build up the detailed model of the architecture. Emphasis is placed on modeling interactions between hardware and software components.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129725847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mitigating operator-induced unavailability by matching imprecise queries","authors":"R. Maxion, Philip A. Syme","doi":"10.1109/FTCS.1996.535879","DOIUrl":"https://doi.org/10.1109/FTCS.1996.535879","url":null,"abstract":"In addition to equipment faults, human error is now recognized as a major cause of computer system unavailability. This paper considers one aspect of human error in critical situations-the inability of operators to retrieve and understand documentation needed for system diagnosis and repair. When technical information vital to recovery is missing, difficult to locate or inaccessible, downtime is lengthened, costs rise, and productivity falls. Finding the right information at the right time is complicated by the ambiguities of natural-language queries when seeking documentation or maintenance information. While the human information processor has the means for resolving ambiguities in language, computers do not. Hence, a key issue in downtime problem resolution is imprecision in human vocabulary. The vocabulary problem can be addressed through statistical mapping of user queries into databases of frequently-asked questions. This technique has been validated empirically, and shown to be effective in achieving correct mappings in 99% of cases tested; it is substantially better than keyword mapping, especially as syntactic and lexical differences grow. When information seeking is accelerated by this technique, downtime can be reduced.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123867798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating quorum systems over the Internet","authors":"Y. Amir, A. Wool","doi":"10.1109/FTCS.1996.534591","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534591","url":null,"abstract":"Quorum systems serve as a basic tool providing a uniform and reliable way to achieve coordination in a distributed system. They are useful for distributed and replicated databases, name servers, mutual exclusion, and distributed access control and signatures. Traditionally, two basic methods have been used to evaluate quorum systems: the analytical approach, and simulation. We propose a third, empirical approach. We collected 6 months' worth of connectivity and operability data of a system consisting of 14 real computers using a wide area group communication protocol. The system spanned two geographic sites and three different Internet segments. We developed a mechanism that merges the local views into a unified history of the events that took place, ordered according to an imaginary global clock. We then developed a tool called the Generic Quorum-system Evaluator (GQE), which evaluates the behavior of any given quorum system over the unified, real-life history. We compared fourteen dynamic and static quorum systems. We discovered that as predicted, dynamic quorum systems behave better than static systems. However we found that many assumptions taken by the traditional approaches are unjustified: crashes are strongly correlated, network partitions do occur even within a single Internet segment, and we even detected a brief simultaneous crash of all the participating computers.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123284102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Random pattern testing for sequential circuits revisited","authors":"L. Nachman, K. Saluja, S. Upadhyaya, R. Reuse","doi":"10.1109/FTCS.1996.534593","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534593","url":null,"abstract":"Random pattern testing methods are known to result in poor fault coverage for most sequential circuits unless costly circuit modification methods are employed. We propose a novel approach to improve the random pattern testability of sequential-circuits. We introduce the concept of holding signals at primary inputs and scan flip-flops for a certain length of time instead of applying a new random vector at each clock cycle. When a random vector is held at the primary inputs of the circuit under test or at the scan flip-flops, the system clock is applied and the primary outputs of the circuit are observed. The number of clock cycles, k, for which each random input is held at a fixed value before applying the next random vector, is determined by using testability analysis or a test pattern generator for a very small number of lines or faults in the circuit. The lines of faults that are analyzed are the primary inputs to flip-flops. The information obtained from the testability analysis or test generator is used to determine the number k of clock cycles for which each random vector is to be held constant without changing the signal values. The algorithm consists of simulating a sequential circuit systematically, possibly with partial scan, in conjunction with the hold method. The method is low cost and the results of our experiment on the benchmark circuits show that it is very effective in providing fault coverage close to the maximum obtainable fault coverage using random patterns with full scan.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121095126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of checkpoint mechanisms for massively parallel machines","authors":"T. Chiueh, Peitao Deng","doi":"10.1109/FTCS.1996.534622","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534622","url":null,"abstract":"Massively parallel machines typically contain thousands of processor units and therefore are more likely to suffer system breakdown because of component failures. This paper studies efficient diskless checkpointing mechanisms for SIMD massively parallel machines. Three checkpointing schemes: mirror checkpointing, parity checkpointing, and partial parity checkpointing are compared in terms of their checkpoint performance and storage overheads, based on empirical measurements. Mirror checkpointing and parity checkpointing schemes have been successfully implemented and tested on a DECmpp 12000 machine, without hardware or OS modifications. It has been shown that mirror checkpointing is an order of magnitude faster than parity checkpointing, but takes twice as much storage overhead. Partial parity checkpointing, although significantly reduces the storage overhead, could lead to unpredictable execution performance. This paper also examines the detailed storage/performance tradeoffs for partial parity checkpointing through manual instrumentation, and describes the implementation experience from these experiments.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123857349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}