{"title":"On the progress in fault-tolerant real-time computing","authors":"P. Ezhilchelvan","doi":"10.1109/RELDIS.2004.1353008","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353008","url":null,"abstract":"Measuring progress in terms of industrial take-up, the paper takes a view that the progress in FT RT computing has been significant, and that the new dominant application domains do not allow the 'ingredients' attributable to past success to be re-used at the same level they were once used. Consequently, FT RT computing is acquiring new faces in the form of adaptive and autonomic computing.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121556034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Token-based atomic broadcast using unreliable failure detectors","authors":"Richard Ekwall, A. Schiper, P. Urbán","doi":"10.1109/RELDIS.2004.1353003","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353003","url":null,"abstract":"Many atomic broadcast algorithms have been published in the last twenty years. Token-based algorithms represent a large class of these algorithms. Interestingly, all the token-based atomic broadcast algorithms rely on a group membership service, i.e., none of them uses unreliable failure detectors directly. The paper presents the first token-based atomic broadcast algorithm that uses an unreliable failure detector - the new failure detector denoted by /spl Rscr/ - instead of a group membership service. The failure detector /spl Rscr/ is compared with <>V and <>S. In order to make it easier to understand the atomic broadcast algorithm, the paper derives the atomic broadcast algorithm from a token-based consensus algorithm that also uses the failure detector /spl Rscr/.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115353950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. P. Saggese, C. Basile, L. Romano, Z. Kalbarczyk, R. Iyer
{"title":"Hardware support for high performance, intrusion- and fault-tolerant systems","authors":"G. P. Saggese, C. Basile, L. Romano, Z. Kalbarczyk, R. Iyer","doi":"10.1109/RELDIS.2004.1353020","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353020","url":null,"abstract":"The paper proposes a combined hardware/software approach for realizing high performance, intrusion- and fault-tolerant services. The approach is demonstrated for (yet not limited to) an attribute authority server, which provides a compelling application due to its stringent performance and security requirements. The key element of the proposed architecture is an FPGA-based, parallel crypto-engine providing (1) optimally dimensioned RSA Processors for efficient execution of computationally intensive RSA signatures and (2) a KeyStore facility used as tamper-resistant storage for preserving secret keys. To achieve linear speed-up (with the number of RSA Processors) and deadlock-free execution in spite of resource-sharing and scheduling/synchronization issues, we have resorted to a number of performance enhancing techniques (e.g., use of different clock domains, optimal balance between internal and external parallelism) and have formally modeled and mechanically proved our crypto-engine with the Spin model checker. At the software level, the architecture combines active replication and threshold cryptography, but in contrast with previous work, the code of our replicas is multithreaded so it can efficiently use an attached parallel crypto-engine to compute an attribute authority partial signature (as required by threshold cryptography). Resulting replicated systems that exhibit nondeterministic behavior, which cannot be handled with conventional replication approaches. Our architecture is based on a preemptive deterministic scheduling algorithm to govern scheduling of replica threads and guarantee strong replica consistency.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115711519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Run-time monitoring for dependable systems: an approach and a case study","authors":"Sérgio Ricardo Rota, J. R. Almeida","doi":"10.1109/RELDIS.2004.1353002","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353002","url":null,"abstract":"This paper describes a run-time monitoring system designed for same functionality systems installed in different places that use equivalent hardware configurations, but with slightly different implementations. These systems exhibit common characteristics. They are large software systems, they depend on hardware to execute their functions, and they are usually adjusted to meet new user needs. In this scenario it is unreasonable to assume that software testing will uncover all latent errors. Besides gathering information about a target program as it executes the run-time monitoring system proposed provides information about the target operating system and the target hardware in order to improve availability by reducing time to diagnose failures and provide a system with the reactive capability of reconfiguring and reinitializing after the occurrence of a failure. A case study for an automatic teller machine system is discussed as an application of the run-time monitoring system and the results from this application are presented.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116455799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dependable pervasive systems","authors":"B. Randell","doi":"10.1109/RELDIS.2004.1352998","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1352998","url":null,"abstract":"Summary form only given. Present trends indicate that huge networked computer systems are likely to become pervasive, as information technology is embedded into virtually everything, and to be required to function essentially continuously. I believe that even today's (underused) \"best practice\" regarding the achievement of high dependability - reliability, availability, security, safety, etc. - from large networked computer systems will not suffice for future pervasive systems. I will give my perspective on the current state of research into the four basic dependability technologies: (i) fault prevention (to avoid the occurrence or introduction of faults), (ii) fault removal (through validation and verification), (iii) fault tolerance (so that failures do not necessarily occur even if faults remain), and (iv) fault forecasting (the means of assessing progress towards achieving adequate dependability). I will then argue that much further research is required on all four dependability technologies in order to cope with pervasive systems, identify some priorities, and discuss how this research could best be aimed at making system dependability into a \"commodity\" that industry can value and from which it can profit.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127188379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A stability-oriented approach to improving BGP convergence","authors":"Hongwei Zhang, A. Arora, Zhijun Liu","doi":"10.1109/RELDIS.2004.1353006","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353006","url":null,"abstract":"This paper shows that the elimination of fault-agnostic instability, the instability caused by fault-agnostic distributed control, substantially improves BGP convergence speed. To this end, we first classify BGP convergence instability into two categories: fault-agnostic instability and distribution-inherent instability; secondly, we prove the impossibility of eliminating all distribution-inherent instability in distributed routing protocols; thirdly, we design the grapevine border gateway protocol (G-BGP) to show that all fault-agnostic instability can be eliminated. G-BGP eliminates all fault-agnostic instability under different fault and routing policy scenarios by (i) piggybacking onto BGP UPDATE messages fine-grained information about faults to the nodes affected by the faults, (ii) quickly resolving the uncertainty between link and node failure as well as the uncertainty of whether a node has changed route, and (iii) rejecting obsolete fault information. We have evaluated G-BGP by both analysis and simulation. Analytically, we prove that, by eliminating fault-agnostic instability, G-BGP achieves optimal convergence speed in several scenarios where BGP convergence is severely delayed (e.g., when a node or a link fail-stops), and when the shortest-path-first policy is used, G-BGP asymptotically improves BGP convergence speed except in scenarios where BGP convergence speed is already optimal (e.g., when a node or a link joins). By simulating networks with up to 115 autonomous systems, we observe that G-BGP improves BGP convergence stability and speed by an order of magnitude.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134320220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proactive hot spot avoidance for Web server dependability","authors":"P. Felber, T. Kaldewey, S. Weiss","doi":"10.1109/RELDIS.2004.1353031","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353031","url":null,"abstract":"Flash crowds, which result from the sudden increase in popularity of some online content, are among the most important problems that plague today's Internet. Affected servers are overloaded with requests and quickly become \"hot spots.\" They usually suffer from severe performance failures or stop providing service altogether, as there are scarcely any effective techniques to scalably deliver content under hot spot conditions to all requesting clients. In this paper, we propose and evaluate collaborative techniques to detect and proactively avoid the occurrence of hot spots. Using our mechanisms, groups of small- to medium-sized Web servers can team up to withstand unexpected surges of requests in a cost-effective manner. Once a Web server detects a sudden increase in request traffic, it replicates on-the-fly the affected content on other Web servers; subsequent requests are transparently redirected to the copies to offload the primary server. Each server acts both as a primary source for its own content, and as a secondary source for other servers' content in the event of a flash-crowd; scalability and dependability are therefore achieved in a peer-to-peer fashion, with each peer contributing to, and benefiting from, the service. Our proactive hot spot avoidance techniques are implemented as a module for the popular Apache Web server. We have conducted a comprehensive experimental evaluation, which demonstrates that our techniques are effective at dealing with flash crowds and scaling to very high request loads.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122832307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A signal processing approach to global predicate monitoring","authors":"N. Ghafari, R. Seviora","doi":"10.1109/RELDIS.2004.1353014","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353014","url":null,"abstract":"Global predicate evaluation is a fundamental problem in distributed systems. This paper views it from a different perspective, namely that of the signals and systems area of electrical engineering. It adapts a signal processing approach to address this problem in the context of monitoring of 'health' of a software system. The global state of the system is viewed as a 'state' signal which evolves over time. The distributed processes are assumed to possess roughly synchronized clocks. The states of individual processes are periodically sampled and reported to a global monitor. The observed system state constructed by the global monitor is viewed as being composed of two components - the consistent global states and an error signal due to the messages in transit and differences in the local clocks. The global monitor removes the error signal by processing the observed global signal through a low-pass filter. It evaluates the predicates on the filtered signal. The approach presented is applicable to distributed systems which are semi-stationary, i.e. whose internal states of interest remain stable over comparatively long intervals of time. The paper presents the relevant signal processing concepts (p-spectrum and p-filtering), outlines an architecture for global predicate monitoring and describes the signal processing done in the global monitor. The paper then summarizes an evaluation of the approach presented on a small computer aided vehicle dispatch system. The evaluation experiments are described and the results are presented and analyzed.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126570244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balancing the tradeoffs between data accessibility and query delay in ad hoc networks","authors":"Liangzhong Yin, G. Cao","doi":"10.1109/RELDIS.2004.1353029","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353029","url":null,"abstract":"In mobile ad hoc networks, nodes move freely and link/node failures are common. This leads to frequent network partitions, which may significantly degrade the performance of data access in ad hoc networks. When the network partition occurs, mobile nodes in one network are not able to access data hosted by nodes in other networks. In this paper, we deal with this problem by applying data replication techniques. Existing data replication solutions in both wired or wireless networks aim at either reducing the query delay or improving the data accessibility. As both metrics are important for mobile nodes, we propose schemes to balance the tradeoffs between data accessibility and query delay under different system settings and requirements. Simulation results show that the proposed schemes can achieve a balance between these two metrics and provide satisfying system performance.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126588252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Di Ferdinando, P. Ezhilchelvan, I. Mitrani
{"title":"Design and evaluation of a QoS-adaptive system for reliable multicasting","authors":"Antonio Di Ferdinando, P. Ezhilchelvan, I. Mitrani","doi":"10.1109/RELDIS.2004.1353001","DOIUrl":"https://doi.org/10.1109/RELDIS.2004.1353001","url":null,"abstract":"This paper presents and studies a reliable multicast protocol whose objective is to deliver a message to all intended destinations, despite possible crashes of the sender and other processes, and communication failures. The protocol enables QoS metrics such as absolute and relative latencies and the probability of reliable delivery, to be negotiated prior to service provisioning. Moreover, it adapts certain parameters dynamically in order to minimize the message traffic required to achieve the negotiated QoS metrics. The performance of the protocol is analyzed mathematically under simplifying assumptions. The accuracy of the approximations is evaluated by simulations.","PeriodicalId":142327,"journal":{"name":"Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130419266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}