{"title":"Scaling and Continuous Availability in Database Server Clusters through Multiversion Replication","authors":"Kaloian Manassiev, C. Amza","doi":"10.1109/DSN.2007.86","DOIUrl":"https://doi.org/10.1109/DSN.2007.86","url":null,"abstract":"In this paper, we study replication techniques for scaling and continuous operation for a dynamic content server. Our focus is on supporting transparent and fast reconfiguration of its database tier in case of overload or failures. We show that the data persistence aspects can be decoupled from reconfiguation of the database CPU. A lightweight in-memory middleware tier captures the typically heavyweight read-only requests to ensure flexible database CPU scaling and fail-over. At the same time, updates are handled by an on-disk database back-end that is in charge of making them persistent. Our measurements show instantaneous, seamless reconfiguration in the case of single node failures within the flexible in-memory tier for a web site running the most common, shopping, workload mix of the industry-standard e- commerce TPC-W benchmark. At the same time, a 9-node in-memory tier improves performance during normal operation over a stand-alone InnoDB on-disk database back- end. Throughput scales by factors of 14.6, 17.6 and 6.5 for the browsing, shopping and ordering mixes of the TPC-W benchmark, respectively.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131688985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault Tolerant Approaches to Nanoelectronic Programmable Logic Arrays","authors":"Wenjing Rao, A. Orailoglu, R. Karri","doi":"10.1109/DSN.2007.49","DOIUrl":"https://doi.org/10.1109/DSN.2007.49","url":null,"abstract":"Programmable logic arrays (PLA), which can implement arbitrary logic functions in a two-level logic form, are promising as platforms for nanoelectronic logic due to their highly regular structure compatible with the nano crossbar architectures. Reliability is an important challenge as far as nanoelectronic devices are concerned. Consequently, it is necessary to focus on the fault tolerance aspects of nanoelectronic PLAs to ensure their viability as a foundation for nanoelectronic systems. In this paper, we investigate two types of fault tolerance techniques for nanoelectronic device based PLAs, focusing at the online faults occurring at the cross-points of nano devices. We develop a scheme to precisely locate the faults online, as this is a crucial step for efficient online reconfiguration based fault tolerance schemes. We also propose a tautology based fault masking scheme. We demonstrate that these two types of fault tolerance schemes developed for nano PLAs significantly improve at low hardware cost the reliability of the high fault occurrence nanoelectronic environment.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"146 3-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132062595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge Connectivity vs. Synchrony Requirements for Fault-Tolerant Agreement in Unknown Networks","authors":"F. Greve, S. Tixeuil","doi":"10.1109/DSN.2007.61","DOIUrl":"https://doi.org/10.1109/DSN.2007.61","url":null,"abstract":"In self-organizing systems, such as mobile ad-hoc and peer-to-peer networks, consensus is a fundamental building block to solve agreement problems. It contributes to coordinate actions of nodes distributed in an ad-hoc manner in order to take consistent decisions. It is well known that in classical environments, in which entities behave asynchronously and where identities are known, consensus cannot be solved in the presence of even one process crash. It appears that self-organizing systems are even less favorable because the set and identity of participants are not known. We define necessary and sufficient conditions under which fault-tolerant consensus become solvable in these environments. Those conditions are related to the synchrony requirements of the environment, as well as the connectivity of the knowledge graph constructed by the nodes in order to communicate with their peers.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126615278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Viswanathan Subramanian, M. Bezdek, N. D. Avirneni, Arun Kumar Somani
{"title":"Superscalar Processor Performance Enhancement through Reliable Dynamic Clock Frequency Tuning","authors":"Viswanathan Subramanian, M. Bezdek, N. D. Avirneni, Arun Kumar Somani","doi":"10.1109/DSN.2007.90","DOIUrl":"https://doi.org/10.1109/DSN.2007.90","url":null,"abstract":"Synchronous circuits are typically clocked considering worst case timing paths so that timing errors are avoided under all circumstances. In the case of a pipelined processor, this has special implications since the operating frequency of the entire pipeline is limited by the slowest stage. Our goal, in this paper, is to achieve higher performance in superscalar processors by dynamically varying the operating frequency during run time past worst case limits. The key objective is to see the effect of overclocking on superscalar processors for various benchmark applications, and analyze the associated overhead, in terms of extra hardware and error recovery penalty, when the clock frequency is adjusted dynamically. We tolerate timing errors occurring at speeds higher than what the circuit is designed to operate at by implementing an efficient error detection and recovery mechanism. We also study the limitations imposed by minimum path constraints on our technique. Experimental results show that an average performance gain up to 57% across all benchmark applications is achievable.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127099585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olivier Rütti, Sergio Mena, Richard Ekwall, A. Schiper
{"title":"On the Cost of Modularity in Atomic Broadcast","authors":"Olivier Rütti, Sergio Mena, Richard Ekwall, A. Schiper","doi":"10.1109/DSN.2007.69","DOIUrl":"https://doi.org/10.1109/DSN.2007.69","url":null,"abstract":"Modularity is a desirable property of complex software systems, since it simplifies code reuse, verification, maintenance, etc. However, the use of loosely coupled modules introduces a performance overhead. This overhead is often considered negligible, but this is not always the case. This paper aims at casting some light on the cost, in terms of performance, that is incurred when designing a relevant group communication protocol with modularity in mind: atomic broadcast. We conduct our experiments using two versions of atomic broadcast: a modular version and a monolithic one. We then measure the performance of both implementations under different system loads. Our results show that the overhead introduced by modularity is strongly related to the level of stress to which the system is subjected, and in the worst cases, reaches approximately 50%.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127775867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeonghee Shin, V. Zyuban, Zhigang Hu, J. Rivers, P. Bose
{"title":"A Framework for Architecture-Level Lifetime Reliability Modeling","authors":"Jeonghee Shin, V. Zyuban, Zhigang Hu, J. Rivers, P. Bose","doi":"10.1109/DSN.2007.8","DOIUrl":"https://doi.org/10.1109/DSN.2007.8","url":null,"abstract":"This paper tackles the issue of modeling chip lifetime reliability at the architecture level. We propose a new and robust structure-aware lifetime reliability model at the architecture-level, where devices only vulnerable to failure mechanisms and the effective stress condition of these devices are taken into account for the failure rate of microarchitecture structures. In addition, we present this reliability analysis framework based on a new concept, called the FIT of reference circuit or FORC, which allows architects to quantify failure rates without having to delve into low-level circuit- and technology-specific details of the implemented architecture. This is done through a onetime characterization of a reference circuit needed to quantify the reference FITs for each class of modeled failure mechanisms for a given technology and implementation style. With this new reliability modeling framework, architects are empowered to proceed with architecture-level reliability analysis independent of technological and environmental parameters.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115774485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Insights into the Sensitivity of the BRAIN (Braided Ring Availability Integrity Network)--On Platform Robustness in Extended Operation","authors":"M. Paulitsch, B. Hall","doi":"10.1109/DSN.2007.60","DOIUrl":"https://doi.org/10.1109/DSN.2007.60","url":null,"abstract":"Low-cost fault-tolerant systems design presents a continual trade-off between improving fault-tolerant properties and accommodating cost constraints. With limited hardware options and to justify the system design rationale, it is necessary to formulate a fault hypothesis to bound failure assumptions. The system must be built on a foundation of real-world relevance and the assumption of coverage of the fault hypothesis. This paper discusses a study that examines the sensitivity of a BRAIN (braided ring availability integrity network) design to different fault types and failure rates in a safety-relevant application. It presents a Markov-based model (using ASSIST, SURE, and STEM analysis tools) and a series of experiments that were run to analyze the overall dependability of the BRAIN approach. The study evaluates the mission reliability and safety in the context of a hypothetical automotive integrated x-by-wire architecture on top of the BRAIN. Drawing from experience in the aerospace domain, the authors investigate the possibility of continued operation for a limited period after a detected critical electronic failure. Continued operation would allow a driver to reach repair facilities rather than stopping the vehicle to call for roadside assistance or \"limping home.\"","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116664351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher LaFrieda, Engin Ipek, José F. Martínez, R. Manohar
{"title":"Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor","authors":"Christopher LaFrieda, Engin Ipek, José F. Martínez, R. Manohar","doi":"10.1109/DSN.2007.100","DOIUrl":"https://doi.org/10.1109/DSN.2007.100","url":null,"abstract":"Aggressive CMOS scaling will make future chip multiprocessors (CMPs) increasingly susceptible to transient faults, hard errors, manufacturing defects, and process variations. Existing fault-tolerant CMP proposals that implement dual modular redundancy (DMR) do so by statically binding pairs of adjacent cores via dedicated communication channels and buffers. This can result in unnecessary power and performance losses in cases where one core is defective (in which case the entire DMR pair must be disabled), or when cores exhibit different frequency/leakage characteristics due to process variations (in which case the pair runs at the speed of the slowest core). Static DMR also hinders power density/thermal management, as DMR pairs running code with similar power/thermal characteristics are necessarily placed next to each other on the die. We present dynamic core coupling (DCC), an architectural technique that allows arbitrary CMP cores to verify each other's execution while requiring no static core binding at design time or dedicated communication hardware. Our evaluation shows that the performance overhead of DCC over a CMP without fault tolerance is 3% on SPEC2000 benchmarks, and is within 5% for a set of scalable parallel scientific and data mining applications with up to eight threads (16 processors). Our results also show that DCC has the potential to significantly outperform existing static DMR schemes.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128349311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. V. Ruitenbeek, T. Courtney, W. Sanders, F. Stevens
{"title":"Quantifying the Effectiveness of Mobile Phone Virus Response Mechanisms","authors":"E. V. Ruitenbeek, T. Courtney, W. Sanders, F. Stevens","doi":"10.1109/DSN.2007.78","DOIUrl":"https://doi.org/10.1109/DSN.2007.78","url":null,"abstract":"Viruses that infect smartphones are emerging as a new front in the fight against computer viruses. In this paper, we model the propagation of mobile phone viruses in order to study their impact on the dependability of mobile phones. We propose response mechanisms and use the models to obtain insight on the effectiveness of these virus mitigation techniques. In particular, we consider the effects of multimedia messaging system (MMS) viruses that spread by sending infected messages to other phones. The virus model is implemented using the Mobius software tool and is highly parameterized, enabling representation of a wide range of potential MMS virus behavior. Using the model, we present the results of four illustrative MMS virus scenarios simulated with and without response mechanisms. By measuring the propagation rate and the extent of virus penetration in the simulation phone population, we quantitatively compare the effectiveness of mobile phone virus response mechanisms.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121461106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced Reliability Modeling of RAID Storage Systems","authors":"J. Elerath, M. Pecht","doi":"10.1109/DSN.2007.41","DOIUrl":"https://doi.org/10.1109/DSN.2007.41","url":null,"abstract":"A flexible model for estimating reliability of RAID storage systems is presented. This model corrects errors associated with the common assumption that system times to failure follow a homogeneous Poisson process. Separate generalized failure distributions are used to model catastrophic failures and usage dependent data corruptions for each hard drive. Catastrophic failure restoration is represented by a three-parameter Weibull, so the model can include a minimum time to restore as a function of data transfer rate and hard drive storage capacity. Data can be scrubbed as a background operation to eliminate corrupted data that, in the event of a simultaneous catastrophic failure, results in double disk failures. Field-based times to failure data and mathematic justification for a new model are presented. Model results have been verified and predict between 2 to 1,500 times as many double disk failures as that estimated using the current mean time to data loss method.","PeriodicalId":405751,"journal":{"name":"37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121664598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}