{"title":"Failure handling in a reliable multicast protocol for improving buffer utilization and accommodating heterogeneous receivers","authors":"G. Khanna, S. Bagchi, J. Rogers","doi":"10.1109/PRDC.2004.1276548","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276548","url":null,"abstract":"Reliable multicast protocols are an important class of protocols for reliably disseminating information from a sender to multiple receivers in the face of node and link failures. A tree-based reliable multicast protocol (TRAM) provides scalable reliable multicast by grouping receivers in hierarchical repair groups and using a selective acknowledgment mechanism. We present an improvement to TRAM to minimize the resource utilization at intermediate hosts and to localize the effect of slow or malicious receivers on normal receivers. We present an evaluation of TRAM and TRAM++ on a campus-wide WAN without errors and with message errors. The evaluation brings out that, given a constraint on the buffer availability at intermediate hosts, TRAM++ can tolerate the constraint at the expense of increasing the end-to-end latency for the normal receivers by only 3.2% compared to TRAM in error-free cases. When slow or faulty receivers are present, TRAM++ is able to provide the same uninterrupted quality of service to the normal nodes while localizing the effect of the faulty ones without incurring any additional memory overhead.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121165994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expected-reliability analysis for wireless CORBA with imperfect components","authors":"Xinyu Chen, Michael R. Lyu","doi":"10.1109/PRDC.2004.1276571","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276571","url":null,"abstract":"Reliability analysis has long been an important area of research for wired networks. However, little reliability analysis has been conducted on wireless networks. Wireless networks, such as wireless CORBA, inherit the unique handoff characteristic which leads to different communication structures with various types and numbers of components and links. Therefore, the traditional definition of two-terminal reliability is not applicable any more. We propose a new term, two-terminal expected-reliability, to integrate those different communication structures into one metric, which includes not only the failure parameters but also the service parameters. Nevertheless, the two-terminal expected-reliability is still a monotonically decreasing function of time t. The expected-reliability and the corresponding MTTF are evaluated quantitatively in different communication schemes. To observe the gains in reliability improvement, the reliability importances of imperfect components are also evaluated. The results show that the failure parameters of different components take different effects on the MTTF and on the reliability importance. With different expected working times of a system, the focus of reliability improvement should be transferred to different components. Although our analysis is conducted on wireless CORBA platforms, it is extensible to generic wireless network systems.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116864612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hybrid approach for building eventually accurate failure detectors","authors":"A. Mostéfaoui, D. Powell, M. Raynal","doi":"10.1109/PRDC.2004.1276553","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276553","url":null,"abstract":"Unreliable failure detectors introduced by Chandra and Toueg are abstract mechanisms that provide information about process crashes. On the one hand, failure detectors allow a statement of the minimal requirements on process failures that allow solutions to problems that cannot otherwise be solved in purely asynchronous systems. However, on the other hand, they cannot be implemented in such systems: their implementation requires that the underlying distributed system be enriched with additional assumptions. Classic failure detector implementations rely on additional synchrony assumptions such as partial synchrony. More recently, a new approach for implementing failure detectors has been proposed: it relies on behavioral properties on the flow of messages exchanged. This shows that these approaches are not antagonistic and can be advantageously combined. A hybrid protocol (the first to our knowledge) implementing failure detectors with eventual accuracy properties is presented. Interestingly, this protocol benefits from the best of both worlds in the sense that it converges (i.e., provides the required failure detector) as soon as either the system behaves synchronously or the required message exchange pattern is satisfied. This shows that, to expedite convergence, it can be interesting to consider that the underlying system can satisfy several alternative assumptions.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121916899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dependability analysis of a class of probabilistic Petri nets","authors":"H. Yen, Lien-Po Yu","doi":"10.1109/PRDC.2004.1276593","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276593","url":null,"abstract":"Verification of various properties associated with concurrent/distributed systems is critical in the process of designing and analyzing dependable systems. While techniques for the automatic verification of finite-state systems are relatively well studied, one of the main challenges in the domain of verification is concerned with the development of new techniques capable of coping with problems beyond the finite state framework. We investigate a number of problems closely related to dependability analysis in the context of probabilistic infinite-state systems modelled by probabilistic conflict-free Petri nets. Using a valuation method, we are able to demonstrate effective procedures for solving the termination with probability 1, the self-stabilization with probability 1, and the controllability with probability 1 problems in a unified framework.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"20 9-10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123584494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An algorithmic approach to identifying link failures","authors":"Mohit Lad, Akash Nanavati, D. Massey, Lixia Zhang","doi":"10.1109/PRDC.2004.1276549","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276549","url":null,"abstract":"Due to the Internet's sheer size, complexity, and various routing policies, it is difficult if not impossible to locate the causes of large volumes of BGP update messages that occur from time to time. To provide dependable global data delivery we need diagnostic tools that can pinpoint the exact connectivity changes. We describe an algorithm, called MVSChange that can pin down the origin of routing changes due to any single link failure or link restoration. Using a simplified model of BGP, called simple path vector protocol (SPVP), and a graph model of the Internet, MVSChange takes as input the SPVP update messages collected from multiple vantage points and accurately locates the link that initiated the routing changes. We provide theoretical proof for the correctness of the design.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"293 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132208162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kamilla Klonowska, L. Lundberg, H. Lennerstad, Charlie Svahnberg
{"title":"Using modulo rulers for optimal recovery schemes in distributed computing","authors":"Kamilla Klonowska, L. Lundberg, H. Lennerstad, Charlie Svahnberg","doi":"10.1109/PRDC.2004.1276564","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276564","url":null,"abstract":"Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down the load on these computers must be redistributed to other computers in the cluster. The redistribution is determined by the recovery scheme. The recovery scheme should keep the load as evenly distributed as possible even when the most unfavorable combinations of computers break down, i.e. we want to optimize the worst-case behavior. We define recovery schemes, which are optimal for a larger number of computers down than in previous results. We also show that the problem of finding optimal recovery schemes for a cluster with n computers corresponds to the mathematical problem of finding the longest sequence of positive integers for which the sum of the sequence and the sums of all subsequences modulo n are unique.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130093146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards dependable Web services","authors":"Markus Debusmann, K. Geihs","doi":"10.1109/PRDC.2004.1276547","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276547","url":null,"abstract":"Web services are the key technology for implementing distributed enterprise level applications such as B2B and grid computing. An important goal is to provide dependable quality guarantees for client-server interactions. Therefore, service level management (SLM) is gaining more and more significance for clients and providers of Web services. The first step to control service level agreements is a proper instrumentation of the application code in order to monitor the service performance. However, manual instrumentation of Web services is very costly and error-prone and thus not very efficient. Our goal was to develop a systematic and automated, tool-supported approach for Web services instrumentation. We present a dual approach for efficiently instrumenting Web services. It consists of instrumenting the frontend Web services platform as well as the backend services. Although the instrumentation of the Web services platform necessarily is platform-specific, we have found a general, reusable approach. On the backend-side aspect-oriented programming techniques are successfully applied to instrument backend services. We present experimental studies of performance instrumentation using the application response measurement (ARM) API and evaluate the efficiency of the monitoring enhancements. Our results point the way to systematically gain better insights into the behaviour of Web services and thus how to build more dependable Web-based applications.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121920356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shubhendu S. Mukherjee, J. Emer, T. Fossum, S. Reinhardt
{"title":"Cache scrubbing in microprocessors: myth or necessity?","authors":"Shubhendu S. Mukherjee, J. Emer, T. Fossum, S. Reinhardt","doi":"10.1109/PRDC.2004.1276550","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276550","url":null,"abstract":"Transient faults from neutron and alpha particle strikes in large SRAM caches have become a major problem for microprocessor designers. To protect these caches, designers often use error correcting codes (ECC), which typically provide single-bit error correction and double-bit error detection (SECDED). Unfortunately, two separate strikes could still flip two different bits in the same ECC-protected word. This we call a temporal double-bit error. SECDED ECC can only detect, not correct such errors. We show how to compute the mean time to failure for temporal double-bit errors. Additionally, we show how fixed-interval scrubbing - in which error checkers periodically access cache blocks and remove single-bit errors - can mitigate such errors in processor caches. Our analysis using current soft error rates shows that only very large caches (e.g., hundreds of megabytes to gigabytes) need scrubbing to reduce the temporal double-bit error rate to a tolerable range.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132545633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Ciocca, I. Koren, Z. Koren, C. M. Krishna, D. Katz
{"title":"Application-level fault tolerance in the orbital thermal imaging spectrometer","authors":"E. Ciocca, I. Koren, Z. Koren, C. M. Krishna, D. Katz","doi":"10.1109/PRDC.2004.1276551","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276551","url":null,"abstract":"Systems that operate in extremely volatile environments, such as orbiting satellites, must be designed with a strong emphasis on fault tolerance. Rather than rely solely on the system hardware, it may be beneficial to entrust some of the fault handling to software at the application level, which can utilize semantic information and software communication channels to achieve fault tolerance with considerably less power and performance overhead. We show the implementation and evaluation of such a software-level approach, application-level fault tolerance and detection (ALFTD) into the orbital thermal imaging spectrometer (OTIS).","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116634116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantifying the variance in application reliability","authors":"S. Gokhale","doi":"10.1109/PRDC.2004.1276562","DOIUrl":"https://doi.org/10.1109/PRDC.2004.1276562","url":null,"abstract":"A notable drawback of the existing architecture-based reliability assessment techniques is that they only obtain a point estimate of application reliability and do not attempt to quantify the variance in the estimate. The variance in the reliability estimate of an application represents the risk associated with the estimate. Ideally, the variance should be zero, but in practice it is inevitable due to the following two factors: (i) variances in the reliability estimates of components comprising the application, and (ii) architectural characteristics of the application. Quantifying the variance in the reliability estimate of an application provides an indication of the degree of risk associated with the estimate, and can also suggest an appropriate variance reduction strategy. We present a technique to quantify the variance in the reliability estimate of an application based on its architecture. Our technique generates analytical functions which express the mean and variance of application reliability in terms of the means and variances of the component reliabilities as well as the architectural characteristics of the application. Through a case study, we illustrate how the analytical functions generated using our technique could be used to evaluate the impact of individual components on the mean and the variance in the application reliability.","PeriodicalId":383639,"journal":{"name":"10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings.","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129733567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}