{"title":"Recoverable distributed shared virtual memory: memory coherence and storage structures","authors":"Kun-Lung Wu, W. Fuchs","doi":"10.1109/FTCS.1989.105629","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105629","url":null,"abstract":"An examination is made of the problem of implementing rollback recovery in multicomputer distributed shared virtual memory environments, in which the shared memory is implemented in software and exists only virtually. A user-transparent checkpointing recovery scheme and a twin-page disk storage management are presented to implement a recoverable distributed shared virtual memory. The checkpointing scheme is integrated with the shared virtual memory management. The twin-page disk approach allows incremental checkpointing without an explicit 'undo' at the time of recovery. A single consistent checkpoint state is maintained on stable disk storage. The recoverable distributed shared virtual memory allows the system to restart computation from a previous checkpoint after a processor failure without a global restart.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129333970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of fault-tolerant systems with nonhomogeneous workloads","authors":"B. Aupperle, J. F. Meyer, Lu Wei","doi":"10.1109/FTCS.1989.105560","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105560","url":null,"abstract":"A methodology is presented for evaluating fault-tolerant systems when workloads and fault arrivals are not time-homogeneous. Of particular interests are systems whose environments vary considerably between different utilization phases of random duration. In such cases, evaluations of overall system performability must account for the corresponding differences in workload effects, especially with regard to fault recovery. The proposed methodology uses analytic techniques based on Markov processes and stochastic activity networks. Examples of evaluation studies, using this approach, are presented. These include evaluation of a system wherein self-exercising is varied between phases of passive and active use.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114332309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detailed modeling of fault-tolerant processor arrays","authors":"N. Lopez-Benitez, J. Fortes","doi":"10.1109/FTCS.1989.105633","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105633","url":null,"abstract":"Detailed modeling of fault-tolerant processor arrays entails not only an explosive growth in the model state space but also a difficult model construction process. The latter problem is addressed, and a systematic method to construct Markov models for evaluating the reliability of processor arrays is proposed. This method is based on the premise that the fault behavior of a processor array can be modeled by a stochastic Petri net. However, in order to obtain a more compact representation, a set of attributes is associated with each transition in the Petri net model. This set of attributes allows the construction of the corresponding Markov model as the generation of the reachability graph takes place. Included in these attributes is a discrete probability distribution such that the effect of faulty spares in the reconfiguration algorithm is captured each time a configuration change occurs. This distribution includes the probabilities of survival given that a number of components required by the reconfiguration process are faulty. Depending on the type of component and the reconfiguration scheme, probabilities of survival are determined using simulation or closed-form expressions.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124738360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A theoretical investigation of generalized voters for redundant systems","authors":"Paul R. Lorczak, A. Caglayan, D. Eckhardt","doi":"10.1109/FTCS.1989.105617","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105617","url":null,"abstract":"The authors generalize several commonly used voting techniques to arbitrary N-version systems with arbitrary output types using a metric space framework. In particular, they introduce the generalized median voter, which extends the thresholdless midvalue selection technique to arbitrary metric spaces and obviates most of the problems associated with inexact voting. They also introduce the formalized majority voter, which allows an inexact notion of equality between version outputs using a threshold. The authors then show that the median output determined by the generalized median voter will always be contained in the set of consensus outputs produced by the formalized majority voter. In addition, the authors introduce the formalized plurality voter which generalizes two-out-of-N type voters and the weighted averaging voter which generalizes dynamic voting. The performance of these voters under different postulated failure scenarios is compared.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122119504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The fault tolerance approach of the Advanced Architecture Onboard Processor","authors":"M. Iacoponi, D. Vail","doi":"10.1109/FTCS.1989.105535","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105535","url":null,"abstract":"The Advanced Architecture Onboard Processor is a fault-tolerant multiprocessor for space applications that is based on a fault-tolerant chordal skip-link ring interconnect network. Low-power self-checking circuits in each processor node are combined with distributed reconfiguration control and local rollback recovery to achieve robust fault tolerance within spacecraft weight and power constraints. A ten-processor-node breadboard has been completed. The approach to fault tolerance and the tradeoff analysis leading to the selected implementation are covered. Analytical trade study results such as redundancy overhead as a function of system partitioning for the chordal skip-link ring are discussed.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130441647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dependable onboard computer systems with a new method-stepwise negotiating voting","authors":"N. Kanekawa, H. Maejima, H. Kato, H. Ihara","doi":"10.1109/FTCS.1989.105536","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105536","url":null,"abstract":"An algorithm for software voting, called stepwise negotiating voting, which can tolerate the faults in up to N-1 subsystems is introduced. The voter behaves as if it were a majority voter if the number of remaining subsystems is sufficient for majority voting, and standby redundancy is realized if the number of remaining subsystems becomes insufficient. With this voting method, the system can survive if more than one subsystem remains. The authors introduce a method for evaluating the dependability of systems. It is based on the viewpoint that not only the hardware reliability but also the reliability of data processing is important. It is assumed that only transient faults take place in the software behavior. The author's concept can be applied to computers in critical application fields, such as space development or engine control.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130560460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Replication within atomic actions and conversations: a case study in fault-tolerance duality","authors":"L. Mancini, S. Shrivastava","doi":"10.1109/FTCS.1989.105619","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105619","url":null,"abstract":"Recently a duality mapping for fault-tolerant system structures was proposed by the authors (1985). Two canonical models of distributed fault-tolerant systems have been constructed and shown to be duals of each other. One model incorporates objects and atomic actions as the entities for program construction, whereas the second model uses communicating processes with conversations. As a consequence of the duality, techniques and mechanisms which have been developed within the domain of just one of the models can be mapped and applied to the other model. This point is illustrated by mapping some well-known object replication techniques developed within the context of an object and actions model to the communicating process model, thereby revealing some interesting process replication techniques.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"79 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114016786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An economical scan design for sequential logic test generation","authors":"K. Cheng, V. Agrawal","doi":"10.1109/FTCS.1989.105539","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105539","url":null,"abstract":"A method of partial scan design in which the selection of scan flip-flops is aimed at breaking up the cyclic structure of the circuit is presented. Experimental data are given to show that the test generation complexity may grow exponentially with the length of the cycles in the circuit. This complexity grows only linearly with sequential depth. Graph-theoretic algorithms are presented to select a minimal set of flip-flops for eliminating cycles to reduce sequential depth. Tests for the resulting circuit can be efficiently generated by a sequential logic test generator. An independent control of the scan clock allows the insertion of scan sequences within the vector sequence produced by the test generator. Experimental results on a 5000 gate circuit show that a test coverage above 98% could be obtained by scanning just 5% of the flip-flops. In addition, the authors give the design of a scan flip-flop to reduce the input pin and signal routing overheads in a single-clock design.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133144747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed syndrome decoding for regular interconnected structures","authors":"Arun Kumar Somani, V. Agarwal","doi":"10.1109/FTCS.1989.105545","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105545","url":null,"abstract":"Distributed syndrome decoding algorithms to locate faulty PEs (processing elements) in large-scale regular interconnected structures based on the concepts of system-level diagnosis are developed. These algorithms operate in a systolic manner to locate the faulty processors. The computational complexities of these algorithms are either linear or sublinear, depending on the architecture of the system. Their implementation complexities and diagnosis capabilites differ substantially. The conditions that a fault pattern should satisfy for correct and complete diagnosis and the maximum global size of fault sets which can be diagnosed successfully using these algorithms are also identified.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125637511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philip M. Thambidurai, A. Finn, R. Kieckhafer, C. Walter
{"title":"Clock synchronization in MAFT","authors":"Philip M. Thambidurai, A. Finn, R. Kieckhafer, C. Walter","doi":"10.1109/FTCS.1989.105557","DOIUrl":"https://doi.org/10.1109/FTCS.1989.105557","url":null,"abstract":"The steady-state clock synchronization algorithm of MAFT (multicomputer architecture for fault tolerance), an extremely reliable system for real-time applications, is discussed. The synchronization algorithm has been implemented in hardware and a system prototype constructed. The algorithm uses an interactive convergence approach, based on synchronized rounds of message transmission. The authors derive the maximum skew between nonfaulty clocks in terms of basic system parameters. The problem of detecting clock faults is also addressed, with attention to the minimum amount of synchronization error guaranteed to be unambiguously detected. The authors discuss the various practicalities which arise in the implementation of the algorithm as an integrated part of the whole system. Relationships between the synchronization subsystem and the total system are discussed.<<ETX>>","PeriodicalId":230363,"journal":{"name":"[1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1989-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129190168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}