{"title":"Design and evaluation of fault-tolerant shared file system for cluster systems","authors":"S. Sumimoto","doi":"10.1109/FTCS.1996.534596","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534596","url":null,"abstract":"The paper describes the design and evaluation of a Fault Tolerant Shared File System (FTSFS) architecture for cluster systems with shared disks. The FTSFS architecture: guarantees no file system (FS) structure crashes on processor/program failure; can be applied to any existing nonshared FS without changing the structure of the FS; and does not degrade performance on the shared FS compared with a standard non shared FS. Using the FTSFS architecture, we implemented a fault tolerant shared FS on Fujitsu's SVR4 duplex system, and evaluated the system performance. The evaluation showed that the shared FS is competitive in performance with the standard SVR4-UFS (Unix File System).","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122606025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recoverable mobile environment: design and trade-off analysis","authors":"D. Pradhan, P. Krishna, N. Vaidya","doi":"10.1109/FTCS.1996.534590","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534590","url":null,"abstract":"The mobile wireless environment poses challenging problems in designing fault-tolerant systems because of the dynamics of mobility, and limited bandwidth available on wireless links. Traditional fault-tolerance schemes, therefore, cannot be directly applied to these systems. Mobile systems are often subject to environmental conditions which can cause loss of communications or data. Because of the consumer orientation of most mobile systems, run-time faults must be corrected with minimal (if any) intervention from the user. The fault-tolerance capability must, therefore, be transparent to the user. The paper presents recovery schemes for the failure of a mobile host. It portrays the limitations of the mobile wireless environment, and their impact on recovery protocols. The adaptation of well-known recovery schemes are presented which suit the mobile environment. The performance of these schemes has been analyzed to determine those environments where a particular recovery scheme is best suited. The performance of the recovery schemes primarily depends on: the wireless bandwidth; the communication-mobility ratio of the user; and the failure rate of the mobile host.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124054119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliable broadcasting in product networks with Byzantine faults","authors":"Feng Bao, Y. Igarashi","doi":"10.1109/FTCS.1996.534612","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534612","url":null,"abstract":"The reliability of broadcasting in product networks is discussed. We assume that a network may contain faulty nodes and/or links of Byzantine type and that no nodes know any information about faults in advance. If there are n independent spanning trees rooted at the some node of a network, the network is called an n-channel graph. We first show a construction of n independent spanning trees rooted at the same node of a product network consisting of n component graphs. Then we design a broadcasting scheme in the product network so that messages are sent along the n independent spanning trees. This broadcasting scheme can tolerate up to [(n-1)/2] faults of Byzantine type even in the worst case. Broadcasting by the scheme is successful with a probability higher than 1-k/sup -[n/2]/ in any product network of order N consisting of n component graphs of order b or less if at most N/4b/sup 3/nk faulty nodes are randomly distributed in the network. Furthermore we show how to construct n/sub 1/+n/sub 2/ independent spanning trees in a product network of two graphs such that the one component graph is an n/sub 1/-channel graph and the other component graph is an n/sub 2/-channel graph. These independent spanning trees can be also used as efficient and reliable message channels for broadcasting in the product network.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"233 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123102673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The redundancy mechanisms of the Ariane 5 Operational Control Center","authors":"J. Dega","doi":"10.1109/FTCS.1996.534623","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534623","url":null,"abstract":"The Operational Control Center represents the largest component in the Ariane 5 ground segment. It handles all interface management between the Ariane launcher and ground facilities during launch preparation phases. It ensures information exchange between on-board equipment and the ground, and controls the launch count-down. The control center is a real-time system distributed on four sites and linked with an optical fiber network. For safety and availability reasons, redundancy has been applied to most of the control center's subsystems: front-end equipment, processing units, networks. The design and development of the Ariane 5 control center was a challenge for several reasons: the safety and operational constraints; the compatibility with the test benches without redundancy used at earlier stages of the development cycle; the fully distributed architecture; the need for online repair and re-insertion of failed redundant units without interrupting the launch countdown.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128689869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple fault diagnosis in sequential circuits using sensitizing sequence pairs","authors":"N. Yanagida, Hiroshi Takahashi, Y. Takamatsu","doi":"10.1109/FTCS.1996.534597","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534597","url":null,"abstract":"The paper presents an approach to multiple fault diagnosis in sequential circuits by using input sequence pairs having sensitizing input pairs. This represents an extension of our previous work dealing with combinational circuits (N. Yanagida et al., 1995). After reviewing our previous method, we introduce an input sequence pair having sensitizing input pairs to diagnose multiple faults in a sequential circuit partitioned into subcircuits. We call such an input sequence pair, the sensitizing sequence pair. Next, we extend the use of the previous method for combinational circuits to sequential circuits. From a relation between a sensitizing path generated by a sensitizing sequence pair and a subcircuit, the proposed method deduces the suspected faults for the subcircuits, one by one, based on the responses observed at primary outputs without probing any internal line. The paper provides the first experimental reports on diagnostic results of the ISCAS circuits by using our diagnostic method for sequential circuits, without probing any internal line, any fault simulation, or fault enumeration.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125465810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experimental evaluation of the fail-silent behaviour in programs with consistency checks","authors":"M. Z. Rela, H. Madeira, J. G. Silva","doi":"10.1109/FTCS.1996.534625","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534625","url":null,"abstract":"An important research topic deals with the investigation of whether a non-duplicated computer can be made fail-silent, since that behaviour is a-priori assumed in many algorithms. However, previous research has shown that in systems using a simple behaviour based error detection mechanism invisible to the programmer (e.g. memory protection), the percentage of fail-silent violations could be higher than 10%. Since the study of these errors has shown that they were mostly caused by pure data errors, we evaluate the effectiveness of software techniques capable of checking the semantics of the data, such as assertions, to detect these remaining errors. The results of injecting physical pin-level faults show that these tests can prevent about 40% of the fail-silent model violations that escape the simple hardware-based error detection techniques. In order to decouple the intrinsic limitations of the tests used from other factors that might affect its error detection capabilities, we evaluated a special class of software checks known for its high theoretical coverage: algorithm based fault tolerance (ABFT). The analysis of the remaining errors showed that most of them remained undetected due to short range control flow errors. When very simple software-based control flow checking was associated to the semantic tests, the target system, without any dedicated error detection hardware, behaved according to the fail-silent model for about 98% of all the faults injected.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127210175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fault simulation method for crosstalk faults in synchronous sequential circuits","authors":"N. Itazaki, Yasutaka Idomoto, K. Kinoshita","doi":"10.1109/FTCS.1996.534592","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534592","url":null,"abstract":"With the scaling down of VLSI size and the reducing switching time of logic gates, crosstalk faults become an important problem for testing. If a crosstalk pulse is excited by internal noise sources, the crosstalk pulse tends to be considered as harmless for synchronous sequential circuits, because generated crosstalk pulses on data lines can be eliminated by a clocking. However the crosstalk pulse generated on clock lines or reset lines can lead the circuit to erroneous operations. We analyze the crosstalk fault scheme, and contrive a fault simulator based on the scheme, in order to estimate the effect for the crosstalk fault. We consider the crosstalk fault as unexpected strong capacitive coupling between one data line and clock lines. Since we have to consider timing in addition to a logic value, a unit delay model is used in our fault simulation. Our experiments on some benchmark circuits show that fault activation rates and fault detection rates are widely varied corresponding to circuit characteristics. Up to 80% fault detection rates are obtained from our simulation with test vectors generated at random.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131361053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. G. Silva, J. Carreira, H. Madeira, D. Costa, F. Moreira
{"title":"Experimental assessment of parallel systems","authors":"J. G. Silva, J. Carreira, H. Madeira, D. Costa, F. Moreira","doi":"10.1109/FTCS.1996.534627","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534627","url":null,"abstract":"In the research reported in this paper, transient faults were injected in the nodes and in the communication subsystem (by using software fault injection) of a commercial parallel machine running several real applications. The results showed that a significant percentage of faults caused the system to produce wrong results while the application seemed to terminate normally, thus demonstrating that fault tolerance techniques are required in parallel systems, not only to assure that long-running applications can terminate but also (and more important) that the results produced are correct. Of the techniques tested to reduce the percentage of undetected wrong results only ABFT proved to be effective. For other simple error detection methods to be effective, they have to be designed in, and not added as an after thought. Faults injected in the communication subsystem proved the effectiveness of end-to-end CRCs on the data movements between processors.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122200012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new methodology for calculating distributions of reward accumulated during a finite interval","authors":"M. Qureshi, W. Sanders","doi":"10.1109/FTCS.1996.534600","DOIUrl":"https://doi.org/10.1109/FTCS.1996.534600","url":null,"abstract":"Markov reward models are an important formalism by which to obtain dependability and performability measures of computer systems and networks. In this context, it is particularly important to determine the probability distribution function of the reward accumulated during a finite interval. The interval may correspond to the mission period in a mission-critical system, the time between scheduled maintenances, or a warranty period. In such models, changes in state correspond to changes in system structure (due to faults and repairs), and the reward structure depends on the measure of interest. For example, the reward rates may represent a productivity rate while in that state, if performability is considered, or the binary values zero and one, if interval availability is of interest. We present a new methodology to calculate the distribution of reward accumulated over a finite interval. In particular, we derive recursive expressions for the distribution of reward accumulated given that a particular sequence of state changes occurs during the interval, and we explore paths one at a time. The expressions for conditional accumulated reward are new and are numerically stable. In addition, by exploring paths individually, we avoid the memory growth problems experienced when applying previous approaches to large models. The utility of the methodology is illustrated via application to a realistic fault-tolerant multiprocessor model with over half a million states.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128847265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware-efficient and highly-reconfigurable 4- and 2-track fault-tolerant designs for mesh-connected multicomputers","authors":"N. Mahapatra, S. Dutt","doi":"10.1109/FTCS.1996.535880","DOIUrl":"https://doi.org/10.1109/FTCS.1996.535880","url":null,"abstract":"We consider m-track models for constructing fault-tolerant (FT) mesh systems which have one primary and m spare tracks per row and column, switches at the intersection of these tracks, and spare processors at the boundaries. A faulty system is reconfigured by finding for each fault u a reconfiguration path from the fault to a spare in which starting from the fault u, a processor is replaced or \"covered\" by the nearest \"available\" succeeding processor on the path-a processor on the path is not available if it is faulty or is used as a \"cover\" on some other reconfiguration path. In previous work, a 1-track design that can support any set of node-disjoint straight reconfiguration paths, and a more reliable 3-track design that can support any set of node-disjoint rectilinear reconfiguration paths have been proposed. In this paper; we present: (1) A fundamental result regarding the universality of simple \"one-to-one switches\" in m-track 2-D mesh designs in terms of their reconfigurabilities. (2) A 4-track mesh design that can support any set of edge-disjoint (a much less restrictive criterion than node-disjointness) rectilinear reconfiguration paths, and that has 34% less switching overhead and significantly higher actually close-to-optimal, reconfigurability compared to the previously proposed 3-track design. (3) A new 2-track design derived from the above 4-track design that we show can support the same set of reconfiguration paths as the preview 3-track design but with 33% less wiring overhead. (4) Results on the deterministic fault tolerance capabilities (the number of faults guaranteed reconfigurable) of our 4- and 2-track designs, and the previously proposed 1- and 3-track designs.","PeriodicalId":191163,"journal":{"name":"Proceedings of Annual Symposium on Fault Tolerant Computing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117121807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}