{"title":"Hybrid, recursive, nested monitoring of control systems using Petri nets and particle filters","authors":"L. Zouaghi, A. Wagner, E. Badreddin","doi":"10.1109/DSNW.2010.5542616","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542616","url":null,"abstract":"In this paper we propose a recursive and nested hybrid monitoring and diagnosis architecture for systems with Recursive Nested Behaviour-based Control structure. Such systems consist of behavioural levels, which use several models on different levels of abstractions. In this architecture scheme, the monitors of each subsystem use recursively the output of the monitors of the next lower level in order to get an estimate of the global status of the system at each time and having the advantage of low dimensionality for each level. The proposed monitoring structure is based on hybrid Petri Nets and particle filters. The advantages of the approach are illustrated by an example for a Heating Control System.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115383583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pair and swap: An approach to graceful degradation for dependable chip multiprocessors","authors":"Masashi Imai, Tomohide Nagai, T. Nanya","doi":"10.1109/DSNW.2010.5542608","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542608","url":null,"abstract":"In this paper, we propose a processor-level fault tolerance technique called “Pair and Swap (P&S)” for a multi-core chip. In the P&S system, a 2n-cores-CMP (Chip Multiprocessor) which contains 2n processor cores composes n pairs. Two identical copies of a given task are executed on each pair of two processor cores and the results are compared repeatedly. If a fault is detected by a mismatch, partners of the mismatched pair are swapped with another pair and the mismatched task is re-executed from the latest checkpoint. Then, it is decided whether the fault is transient or permanent. If it is permanent, the faulty core is identified and isolated to reconfigure the entire system. P&S enables graceful degradation and tolerates both permanent and transient faults. We evaluate the performance of the proposed P&S and traditional triple module redundancy (TMR) using the Markov chains. The mean computation to failure of the P&S is about 1.4 times larger than that of dynamic TMR scheme.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123411286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oscar Bailan, U. Rossi, Anne Wantens, J. Daveau, Salvatore Nappi, P. Roche
{"title":"Verification of soft error detection mechanism through fault injection on hardware emulation platform","authors":"Oscar Bailan, U. Rossi, Anne Wantens, J. Daveau, Salvatore Nappi, P. Roche","doi":"10.1109/DSNW.2010.5542611","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542611","url":null,"abstract":"In this paper we describe one of the verification activities performed on a dual core 32-bit System-on-Chip designed for Automotive Safety applications and the consequent implementation of a methodology to verify the functionality of one of the safety mechanisms of the device. The Safety standards recommend the usage of fault-injection techniques to give evidence of the failure robustness of the electronic devices designed for Functional Safety. In this case we verified the robustness of the SoC processing subsystem to the Single Event Upset through the usage of some hardware emulation platforms where the device RTL was mapped, properly instrumented to allow the modification of Flip-Flop status during application runtime, thus modeling the SEUs effects. The main novelty of our work is therefore the definition of a methodology to verify the robustness of a SoC to SEUs; additionally we show that the same methodology can be used also to perform thorough measurements of the SER masking effect on a System on Chip.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116040199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"See applications run and throughput jump: The case for redundant computing in HPC","authors":"R. Riesen, Kurt B. Ferreira, Jon Stearley","doi":"10.1109/DSNW.2010.5542625","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542625","url":null,"abstract":"For future parallel-computing systems with as few as twenty-thousand nodes we propose redundant computing to reduce the number of application interrupts. The frequency of faults in exascale systems will be so high that traditional checkpoint/restart methods will break down. Applications will experience interruptions so often that they will spend more time restarting and recovering lost work, than computing the solution. We show that redundant computation at large scale can be cost effective and allows applications to complete their work in significantly less wall-clock time. On truly large systems, redundant computing can increase system throughput by an order of magnitude.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128289500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RAVE: Replicated antivirus engine","authors":"Carlos Silva, Paulo Sousa, P. Veríssimo","doi":"10.1109/DSNW.2010.5542598","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542598","url":null,"abstract":"Antivirus is a fundamental presence in every computer infrastructure nowadays. The exponential growth of Internet usage with increasing higher bandwidth led to situations where virus (as well as worms and other type of malicious content) had constant outbreaks with impressive amounts of infected computers across the entire world. Email has been the preferred choice for several of these malicious content outbreaks. This paper describes the design, implementation, and evaluation of RAVE, a Replicated AntiVirus Engine for email infrastructures. Based on fault/intrusion tolerance concepts, this system allows to increase the detection capability of anti-malware solutions for email infrastructures by having different engines working in parallel, allowing at the same time arbitrary faults in a predefined number of replicas.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115387816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A translation of State Machines to temporal fault trees","authors":"Nidhal Mahmud, Y. Papadopoulos, M. Walker","doi":"10.1109/DSNW.2010.5542620","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542620","url":null,"abstract":"State Machines (SMs) are increasingly being used to gain a better understanding of the failure behaviour of safety-critical systems. In dependability analysis, SMs are translated to other models, such as Generalized Stochastic Petri Nets (GSPNs) or combinatorial fault trees. The former does not enable qualitative analysis, whereas the second allows it but can lead to inaccurate or erroneous results, because combinatorial fault trees do not capture the temporal semantics expressed by SMs. In this paper, we discuss the problem and propose a translation of SMs to temporal fault trees using Pandora, a recent technique for introducing temporal logic to fault trees, thus preserving the significance of the temporal sequencing of faults and allowing full qualitative analysis. Since dependability models inform the design of condition monitoring and failure prevention measures, improving the representation and analysis of dynamic effects in such models can have a positive impact on proactive failure avoidance.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124148179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gate input reconfiguration for combating soft errors in combinational circuits","authors":"Warin Sootkaneung, K. Saluja","doi":"10.1109/DSNW.2010.5542610","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542610","url":null,"abstract":"Many techniques to relieve soft error problem, such as making the circuit larger, called upsizing, have been developed under tight limitation in circuit performance but they all call for a tradeoff between performance and soft error resilience. In this paper, we present a soft error reduction technique, called gate input reconfiguration, to combat soft errors in digital circuits without additional overhead. Substantiated by SPICE simulations, our device level experiments disclose that gate inputs and transistor positions in a gate have a profound impact on circuit probability of failure due to soft errors. The detailed study on soft error vulnerabilities of several types of logic gates lead us to develop a gate input reconfiguration technique in order to improve the reliability of large combinational circuits. This overhead-free technique rearranges gate input pins such that soft error rate of that gate is minimized. Experimental results reveal that the proposed technique provides considerable decrease in the probability of failure due to soft errors of benchmark circuits. We observed this decrease to be as much as 45% in some circuits. Next, we combine the use of gate input reconfiguration technique with upsizing technique to reduce the failure due to soft errors even further. The combination of these two techniques achieves very impressive reliability improvements.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125056316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CacheMind: Fast performance recovery using a virtual machine monitor","authors":"Kenichi Kourai","doi":"10.1109/DSNW.2010.5542614","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542614","url":null,"abstract":"The reboot of an operating system is a final but powerful recovery technique. However, the system performance is largely degraded just after the reboot due to losing the file cache. For fast performance recovery, we propose a new reboot mechanism called the warm-cache reboot. The warm-cache reboot preserves the file cache on main memory during the reboot and enables an operating system to restore the file cache after the reboot. A virtual machine monitor (VMM) underlying an operating system guarantees that the reused file cache is consistent with the corresponding files on disks. We have implemented the warm-cache reboot mechanism in the Xen VMM and the Linux operating system. From our experimental results, the warm-cache reboot decreases performance degradation just after the reboot. In addition, we confirmed that the file cache corrupted by faults was not reused.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133732383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive monitoring in microkernel OSs","authors":"Domenico Cotroneo, Domenico Di Leo, R. Natella","doi":"10.1109/DSNW.2010.5542619","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542619","url":null,"abstract":"The microkernel architecture has been investigated by both industries and the academia for the development of dependable Operating Systems (OSs). This work copes with a relevant issue for this architecture, namely unresponsive components because of deadlocks and infinite loops. In particular, a monitor sends heartbeat messages to a component that should reply within a timeout. The timeout choice is tricky, since it should be dynamically adapted to the load conditions of the system. Therefore, our approach is based on an adaptive heartbeat mechanism, in which the timeout is estimated from past response times. We implement and compare three estimation algorithms for the choice of the timeout in the context of the Minix 3 OS. From the analysis we derive useful guidelines for choosing the best algorithm with respect to system requirements.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128033630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua Thompson, D. Dreisigmeyer, T. Jones, M. Kirby, Joshua Ladd
{"title":"Accurate fault prediction of BlueGene/P RAS logs via geometric reduction","authors":"Joshua Thompson, D. Dreisigmeyer, T. Jones, M. Kirby, Joshua Ladd","doi":"10.1109/DSNW.2010.5542626","DOIUrl":"https://doi.org/10.1109/DSNW.2010.5542626","url":null,"abstract":"This investigation presents two distinct and novel approaches for the prediction of system failures occurring in Oak Ridge National Laboratory's Blue Gene/P supercomputer. Each technique uses raw numeric and textual subsets of large data logs of physical system information such as fan speeds and CPU temperatures. This data is used to develop models of the system capable of sensing anomalies, or deviations from nominal behavior. Each algorithm predicted event log reported anomalies in advance of their occurrence and one algorithm did so without false positives. Both algorithms predicted an anomaly that did not appear in the event log. It was later learned that the fault missing from the log but predicted by both algorithms was confirmed to have occurred by the system administrator.","PeriodicalId":124206,"journal":{"name":"2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128053201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}