{"title":"A Dependability Solution for Homogeneous MPSoCs","authors":"Xiao Zhang, H. Kerkhoff","doi":"10.1109/PRDC.2011.16","DOIUrl":"https://doi.org/10.1109/PRDC.2011.16","url":null,"abstract":"Nowadays highly dependable electronic devices are demanded by many safety-critical applications. Dependability attributes such as reliability and availability/maintainability of a many-processor system-on-chip (MPSoC) should already be examined at the design phase. Design for dependability approaches such as using available fault-free processor-cores and introducing a dependability manager infrastructural IP for self-test and evaluation can greatly enhance the dependability of an MPSoC. This is further supported by subsequent software-based repair. Design choices such as test fault coverage, test and repair time are examined to optimize the dependability attributes. Utilizing existing infrastructures like a network-on-chip (NoC) and tile-wrappers are needed to ensure a test can be performed at application run-time. An example design following the proposed design for dependability approach is shown. The MPSoC has been processed and measurement results have validated the proposed dependability approach.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115237645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Horst Schirmeier, J. Neuhalfen, Ingo Korb, O. Spinczyk, M. Engel
{"title":"RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers","authors":"Horst Schirmeier, J. Neuhalfen, Ingo Korb, O. Spinczyk, M. Engel","doi":"10.1109/PRDC.2011.20","DOIUrl":"https://doi.org/10.1109/PRDC.2011.20","url":null,"abstract":"Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64-based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127926076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Augmenting Functional Broadside Tests for Transition Fault Coverage with Bounded Switching Activity","authors":"I. Pomeranz","doi":"10.1109/PRDC.2011.14","DOIUrl":"https://doi.org/10.1109/PRDC.2011.14","url":null,"abstract":"For most purposes, it is sufficient for a low-power test set to ensure that the power dissipation during test application will not exceed that possible during functional operation. This is guaranteed for the fast functional capture cycles of functional broadside tests. This paper describes a procedure that generates broadside test sets with bounded switching activity during fast functional capture cycles based on the maximum switching activity of a functional broadside test set, targeting transition faults in full-scan circuits. The procedure first generates a compact functional broadside test set. It then extends the test set in steps in order to increase its fault coverage to that of an arbitrary broadside test set (a test set that includes non-functional broadside tests). During these steps, the maximum switching activity of the functional broadside test set is used for bounding the switching activity.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122406062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian Oboril, M. Tahoori, V. Heuveline, D. Lukarski, Jan-Philipp Weiss
{"title":"Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers","authors":"Fabian Oboril, M. Tahoori, V. Heuveline, D. Lukarski, Jan-Philipp Weiss","doi":"10.1109/PRDC.2011.26","DOIUrl":"https://doi.org/10.1109/PRDC.2011.26","url":null,"abstract":"As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an Algorithm-based Fault Tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is \"pay as you go\", meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime benefit of the proposed approach, good fault tolerance and no occurrence of silent data corruption.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127505033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Self-Stabilizing Synchronization Protocol for Arbitrary Digraphs: A Self-Stabilizing Distributed Clock Synchronization Protocol For Arbitrary Digraphs","authors":"M. Malekpour","doi":"10.1109/PRDC.2011.37","DOIUrl":"https://doi.org/10.1109/PRDC.2011.37","url":null,"abstract":"This paper presents a self-stabilizing distributed clock synchronization protocol in the absence of faults in the system. It is focused on the distributed clock synchronization of an arbitrary, non-partitioned digraph ranging from fully connected to 1-connected networks of nodes while allowing for differences in the network elements. This protocol does not rely on assumptions about the initial state of the system, other than the presence of at least one node, and no central clock or a centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. There is no theoretical limit on the maximum number of participating nodes. The only constraint on the behavior of the node is that the interactions with other nodes are restricted to defined links and interfaces. This protocol deterministically converges within a time bound that is a linear function of the self-stabilization period. We present an outline of a deductive proof of the correctness of the protocol. A bounded model of the protocol was mechanically verified for a variety of topologies. Results of the mechanical proof of the correctness of the protocol are provided. The model checking results have verified the correctness of the protocol as they apply to the networks with unidirectional and bidirectional links. In addition, the results confirm the claims of determinism and linear convergence. As a result, we conjecture that the protocol solves the general case of this problem. We also present several variations of the protocol and discuss that this synchronization protocol is indeed an emergent system.","PeriodicalId":254760,"journal":{"name":"2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127379955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}