Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.最新文献_第3页

Progress in real-time fault tolerance 实时容错的进展

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353010

P. Melliar-Smith, L. Moser

引用次数: 12

Skewed checkpointing for tolerating multi-node failures 允许多节点故障的倾斜检查点

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353012

Hiroshi Nakamura, T. Hayashida, Masaaki Kondo, Yuya Tajima, Masashi Imai, T. Nanya

{"title":"Skewed checkpointing for tolerating multi-node failures","authors":"Hiroshi Nakamura, T. Hayashida, Masaaki Kondo, Yuya Tajima, Masashi Imai, T. Nanya","doi":"10.1109/RELDIS.2004.1353012","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353012","url":null,"abstract":"Large cluster systems have become widely utilized because they achieve a good performance/cost ratio especially in high performance computing. Although these cluster systems are distributed memory systems, coordinated checkpointing is a promising way to maintain high availability because the computing nodes are tightly connected to one another. However, as the number of computing nodes gets larger, the probability of multi-node failures increases. To tolerate multi-node failures, a large degree of redundancy is required in checkpointing, but this leads to performance degradation. Thus, we propose a new coordinated checkpointing called skewed checkpointing. In this method, checkpointing is skewed every time. Although each checkpointing itself contains only one degree of redundancy, this skewed checkpointing ensures /spl lfloor/log/sub 2/N/spl rfloor/ degrees of redundancy when the number of nodes is N. In this paper, we present the proposed method and an analysis of the performance overhead. Then, this method is applied to a cluster system and compared with other conventional checkpointing schemes. The results reveal the superiority of our method, especially for large cluster systems.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134503667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

The design and evaluation of a defense system for Internet worms 网络蠕虫防御系统的设计与评价

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353017

R. Scandariato, J. Knight

引用次数: 8

An integrated architecture for dependable embedded systems 可靠嵌入式系统的集成体系结构

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353016

H. Kopetz

{"title":"An integrated architecture for dependable embedded systems","authors":"H. Kopetz","doi":"10.1109/RELDIS.2004.1353016","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353016","url":null,"abstract":"Summary form only given. A federated architecture is characterized in that every major function of an embedded system is allocated to a dedicated hardware unit. In a distributed control system this implies that adding a new function is tantamount to adding a new node. This has led to a order to achieve some functional coordination. Adding fault-tolerance to a federated architecture, e.g., by the provision of triple modular redundancy (TMR) leads to a further significant increase in the number of nodes and networks. The major advantages of a dedicated architecture are the physical encapsulation of the nearly autonomous subsystems, their outstanding fault containment and their clear-cut complexity management (independent development) in case the subsystems are nearly autonomous. An integrated distributed architecture for mixed-criticality applications must be based on a core design that supports the safety requirements of the highest considered criticality class. This is of particular importance in safety-critical applications, where the physical structure of the integrated system is determined to a significant extent by the independence requirement of fault-containment regions. The central part of an integrated distributed architecture for time-critical systems must provide the following core services: deterministic and timely transport of messages; fault tolerant clock synchronization; strong fault isolation with respect to arbitrary node failures; and consistent diagnosis of failing nodes. Any architecture that provides these core services can be used as a base architecture for an integrated distributed embedded system architecture. An example of such an integrated architecture is the time-triggered architecture (TTA). In this contribution we describe the structure and the services of the TTA that has been developed during the last twenty years and is deployed in a number of safety-critical applications in the transport sector.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116579246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

An efficient checkpointing protocol for the minimal characterization of operational rollback-dependency trackability 一个有效的检查点协议，用于最小化操作回滚依赖可跟踪性的特征

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353013

Islene C. Garcia, L. E. Buzato

引用次数: 15

Using program analysis to identify and compensate for nondeterminism in fault-tolerant, replicated systems 使用程序分析来识别和补偿容错复制系统中的不确定性

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353026

Joseph G. Slember, P. Narasimhan

引用次数: 11

Low latency probabilistic broadcast in wide area networks 广域网中的低延迟概率广播

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353030

J. Pereira, L. Rodrigues, A. Pinto, R. Oliveira

引用次数: 18

Model-based validation of an intrusion-tolerant information system 基于模型的入侵容忍信息系统验证

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353019

F. Stevens, T. Courtney, Sankalp Singh, A. Agbaria, J. F. Meyer, W. Sanders, P. Pal

引用次数: 79

Nested objects in a Byzantine quorum-replicated system 拜占庭仲裁复制系统中的嵌套对象

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-10-18 DOI: 10.1109/RELDIS.2004.1353005

Charles P. Fry, M. Reiter

引用次数: 14

Performance comparison of a rotating coordinator and a leader based consensus algorithm 旋转协调器与基于leader的共识算法的性能比较

Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004. Pub Date : 2004-03-30 DOI: 10.1109/RELDIS.2004.1352999

P. Urbán, Naohiro Hayashibara, A. Schiper, T. Katayama

{"title":"Performance comparison of a rotating coordinator and a leader based consensus algorithm","authors":"P. Urbán, Naohiro Hayashibara, A. Schiper, T. Katayama","doi":"10.1109/RELDIS.2004.1352999","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1352999","url":null,"abstract":"Protocols that solve agreement problems are essential building blocks for fault tolerant distributed systems. While many protocols have been published, little has been done to analyze their performance, especially the performance of their fault tolerance mechanisms. In this paper, we compare two well-known asynchronous consensus algorithms. In both algorithms, a leader process tries to impose a decision, and another leader retries if the leader fails doing so. The algorithms elect leaders differently: the Chandra-Toueg algorithm has a rotating leader, whereas processes in the Paxos algorithm elect leaders directly. We investigate the performance implications of this difference. In the system under study, processes send atomic broadcasts to each other. Consensus is used to decide the delivery order of messages. We evaluate the steady state latency in (1) runs with neither crashes nor suspicions, (2) runs with crashes and (3) runs with no crashes in which correct processes are wrongly suspected to have crashed, as well as the transient latency after (4) one crash and (5) multiple correlated crashes. The results show that the Paxos algorithm tolerates frequent wrong suspicions (3) and correlated crashes (5) better, while the performance is comparable in all other scenarios.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124787171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31