{"title":"Self diagnosis of processor arrays using a comparison model","authors":"P. Maestrini, P. Santi","doi":"10.1109/RELDIS.1995.526229","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.526229","url":null,"abstract":"This paper introduces a diagnosing algorithm for bidimensional processor arrays, where processors are interconnected in horizontal and vertical meshes. For the purpose of diagnosis, the array is considered to be partitioned in square clusters of processors. The algorithm is based on interprocessor tests, using a comparison model. The algorithm, which is divided in four steps, called intracluster diagnosis, interluster diagnosis, fault-free core identification and augmentation, identifies a set of non-faulty and a set of faulty units. The diagnosis is proved to be correct in the worst case, assuming that the actual number of faulty processors is no more that T(N), an increasing function of the number N of processors. It is shown that T(N) is O(N/sup 2/3/). Although correct, the diagnosis is generally incomplete. However, using probabilistic techniques, it is shown that the diagnosis is very likely to be complete under the same limitations which ensure correctness in the worst case.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"42 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132728735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Membership and system diagnosis","authors":"M. Hiltunen","doi":"10.1109/RELDIS.1995.526228","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.526228","url":null,"abstract":"A membership service is a service in a distributed system that maintains and provides information about which sites are functioning and which have failed at any given time. System diagnosis, on the other hand, is a method for detecting faulty processing elements and distributing this information to non-faulty elements. In spite of the apparent similarity of goals, these two fields have been considered separately from their beginnings. In this paper, we attempt to compare these fields and show the fundamental differences and the similarities. We demonstrate that the problems are closely related with the major differences being the assumptions made about the failure model, the testing methods, and the type of service guarantees provided to the application. Furthermore, we demonstrate that the fields are closely enough related that some algorithms utilized in one field can easily be transformed into algorithms in the other. As examples, we derive new membership algorithms from a distributed system diagnosis algorithm and new system diagnosis algorithms from membership algorithms.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132899118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"System support for robust collaborative applications","authors":"M. Chelliah, M. Ahamad","doi":"10.1109/RELDIS.1995.526214","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.526214","url":null,"abstract":"Traditional transaction models ensure robustness for distributed applications through the properties of view and failure atomicity. It has generally been felt that such atomicity properties are restrictive for a wide range of application domains; this is particularly true for robust, collaborative applications because such applications have concurrent components that are inherently long-lived and that cooperate. Recent advances in extended transaction models can be exploited to structure long-lived and cooperative computations. Applications can use a combination of such models to achieve the desired degree of robustness; hence, we develop a system which can support a number of flexible transaction models, with correctness criteria that extend or relax serializability. We analyze two concrete CSCW applications-collaborative editor and meeting scheduler. We show how a combination of two extended transaction models, that promote split and cooperating actions, facilitates robust implementations of these collaborative applications. Thus, we conclude that a system that implements multiple transaction models provides flexible support for building robust collaborative applications.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130461469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TMR processing without explicit clock synchronisation","authors":"F. Brasileiro, P. Ezhilchelvan, N. Speirs","doi":"10.1109/RELDIS.1995.526226","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.526226","url":null,"abstract":"Replicated processing with majority voting is a well known method for achieving fault tolerance. Triple Modular Redundant (TMR) processing is the most commonly used version of that method. Replicated processing requires that the replicas reach agreement on the order in which messages are to be processed. Synchronous and deterministic ordering protocols published in the literature require that the replicas maintain an abstraction of clocks that are kept in known and bounded synchronism. We present a protocol for TMR systems that does not require this abstraction of synchronised clocks. We analyse the protocol performance and show that this protocol in practice can be at least as fast as any synchronised clock based ordering protocol. We also derive a faster protocol that has an improved performance in the absence of processor failures. We then build a TMR node and measure its performance to illustrate that the protocols developed here provide faster ordering and are easier to implement.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131337564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non blocking atomic commitment with an unreliable failure detector","authors":"R. Guerraoui, M. Larrea, A. Schiper","doi":"10.1109/RELDIS.1995.518722","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.518722","url":null,"abstract":"In a transactional system, an atomic commitment protocol ensures that for any transaction, all data manager processes agree on the same outcome (commit or abort). A non-blocking atomic commitment protocol enables an outcome to be decided at every correct process despite the failure of others. In this paper we apply, for the first time, the fundamental result of T. Chandra and S. Toueg (1991) on solving the abstract consensus problem, to non-blocking atomic commitment. More precisely, we present a non-blocking atomic commitment protocol in an asynchronous system augmented with an unreliable failure detector that can make an infinity of false failure suspicions. If no process is suspected to have failed, then our protocol is similar to a three phase commit protocol. In the case where processes are suspected, our protocol does not require any additional termination protocol: failure scenarios are handled within our regular protocol and are thus much simpler to manage.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121816814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Tilly, István Kiss, G. Román, T. Dobrowiecki, A. Várkonyi-Kóczy
{"title":"A method for the construction and interpretation of high level models for distributed fault-tolerant systems","authors":"K. Tilly, István Kiss, G. Román, T. Dobrowiecki, A. Várkonyi-Kóczy","doi":"10.1109/RELDIS.1995.526215","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.526215","url":null,"abstract":"Traditional solutions for achieving fault-tolerance are intended for use at design time and they generally capture system information at a very low (hardware or machine instruction) level. Increasing reliability of complex information systems containing many (perhaps many thousands) of autonomous components requires different solutions. This article presents a new methodology for the implementation of large scale, distributed fault-tolerant systems. System models are formed of objects describing requirements, services and resources organized into high level top-down hierarchical decomposition structures. Since redundancy is a natural property of any large scale system, by using such models it is possible to achieve fault tolerant behaviour by finding multiple appropriate mappings between requirements and available services, and to support the required services by available resources. The distributed system is extended with dedicated components, called diagnostic centres, which manage distinct parts of the system model, continuously observe the operation of the distributed system, and find alternative requirement-service mappings, if some services fail to fulfil their associated requirements. The elements and the structure of the proposed system modelling method are presented, an appropriate fault model is defined, and the algorithms for model interpretation are described.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127160731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Costa, F. Moreira, H. Madeira, M. Z. Rela, J. G. Silva
{"title":"Experimental evaluation of the impact of processor faults on parallel applications","authors":"D. Costa, F. Moreira, H. Madeira, M. Z. Rela, J. G. Silva","doi":"10.1109/RELDIS.1995.518719","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.518719","url":null,"abstract":"This paper addresses the problem of processor faults in distributed memory parallel systems. It shows that transient faults injected at the processor pins of one node of a commercial parallel computer, without any particular fault-tolerant techniques, can cause erroneous application results for up to 43% of the injected faults (depending on the application). In addition to these very subtle faults, up to 19% of the injected faults (almost independent on the application) caused the system to hang up. These results show that fault-tolerant techniques are absolutely required in parallel systems, not only to ensure the completion of long-run applications but, and more important, to achieve confidence in the application results. The benefits of including some fairly simple behaviour based error detection mechanisms in the system were evaluated together with Algorithm Based Fault Tolerance (ABFT) techniques. The inclusion of such Mechanisms in parallel systems seems to be very important for detecting most of those subtle errors without greatly affecting the performance and the cost of these systems.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133725197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A synchronization strategy for a time-triggered multicluster real-time system","authors":"H. Kopetz, A. Krüger, D. Millinger, A. Schedl","doi":"10.1109/RELDIS.1995.526223","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.526223","url":null,"abstract":"The provision of a system-wide global time base with a good precision and sufficient accuracy is a fundamental prerequisite for the design of a multicluster distributed real-time system. We investigate the issues of clock synchronization in a multicluster system, where every node can have a different oscillator. Based on the parameter of a typical automotive distributed system we show that a precision and accuracy in the second range is achievable without undue effort.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122789521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Configurable highly available distributed services","authors":"C. Karamanolis, J. Magee","doi":"10.1109/RELDIS.1995.526219","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.526219","url":null,"abstract":"The paper addresses the problem of providing highly available services in distributed systems. In particular, we examine the situation where a service may be used by a large continuously changing set of clients. The requirements for providing services in this environment are analysed and an architecture and partial implementation for a replicated server group meeting a range of client requirements is presented. The architecture facilitates the dynamic configuration management of the replicated server group, while maintaining the service. Dynamic configuration management is required in order to replace failed replicas, upgrade the server implementation, or change the availability characteristics of the service. The paper reports on initial implementation results.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"601 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123193195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Designing masking fault-tolerance via nonmasking fault-tolerance","authors":"A. Arora, S. Kulkarni","doi":"10.1109/RELDIS.1995.526225","DOIUrl":"https://doi.org/10.1109/RELDIS.1995.526225","url":null,"abstract":"Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where programs continually (re)satisfy their specification. In this paper, we show that a practical method to design masking fault-tolerance is to first design nonmasking fault-tolerance and to then transform the nonmasking fault-tolerant program minimally so as to achieve masking fault-tolerance. We demonstrate this method by designing novel fully distributed programs for termination detection, mutual exclusion, and leader election, that are masking tolerant of any finite number of process fail-stops and/or repairs.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129303136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}