Jianbo Dong, Lei Zhang, Yinhe Han, Guihai Yan, Xiaowei Li
{"title":"Variation-Aware Scheduling for Chip Multiprocessors with Thread Level Redundancy","authors":"Jianbo Dong, Lei Zhang, Yinhe Han, Guihai Yan, Xiaowei Li","doi":"10.1109/PRDC.2009.12","DOIUrl":"https://doi.org/10.1109/PRDC.2009.12","url":null,"abstract":"Thread-Level Redundancy in Chip Multiprocessors(TLR-CMP) is efficient for soft error tolerance. Process variation causes core-to-core (C2C) performance asymmetry across a chip, which should be taken into consideration for application scheduling. In this paper, two types of variations beyond C2C are introduced, i.e., inter-pair and intra-pair variation in TLR-CMP. Intra-pair performance asymmetry can affect the performance of applications differently. Based on the above observation, we firstly formalize the variation aware scheduling in TLR-CMP as a 0-1 programming problem,to maximize the system weighted throughput. An efficient scheduling algorithm, named IntraVarF&AppSen, is then proposed to tackle this problem, which can be proved to be optimal when the number of applications to be scheduled is equal to the number of core pairs. Simulation on a 64-core CMP shows 2.8%-4% improvement in weighted throughput when compared to prior VarF&AppIPC algorithm.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125163295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Test Vector Compression/Decompression Scheme Based on Logic Operation between Adjacent Bits (LOBAB) Coding","authors":"Huaguo Liang, Wenfa Zhan, Q. Luo, Cuiyun Jiang","doi":"10.1109/PRDC.2009.11","DOIUrl":"https://doi.org/10.1109/PRDC.2009.11","url":null,"abstract":"A new test vector compression/decompression scheme, namely a scheme of Logic Operation between Adjacent Bits (LOBAB) is presented, which is based on bitwise logic operation between itself and its previous bit. It turns all kinds of series including continuous series, such as a series of all 0s and all 1s, and reversal series, such as a series of 01 and 10, into series of all 0s by logic operation between adjacent bits. On one hand, the two kinds of series, continuous series and reversal series, are both taken into account, which decreases the number of division to the original test data. On the other hand, all series are turned into series of all 0s, which eases the process of encoding and decoding. Compared with other already known schemes this scheme has some characteristics, such as high compression ratio, easy control and implementation. The performance of the algorithm is mathematically analyzed and its merits are experimentally confirmed on the larger examples of the ISCAS89 benchmark circuits.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"12 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131866319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quiescent Leader Election in Crash-Recovery Systems","authors":"M. Larrea, Cristian Martín","doi":"10.1109/PRDC.2009.58","DOIUrl":"https://doi.org/10.1109/PRDC.2009.58","url":null,"abstract":"This work addresses the leader election problem in distributed systems where processes can crash and recover. More precisely, it focuses on implementing the Omega failure detector class, which provides a leader election functionality, in the crash-recovery failure model. The concepts of quiescence and near-quiescence for an algorithm implementing Omega are defined. Depending on the use or not of stable storage, the property satisfied by unstable processes, i.e., those that crash and recover infinitely often, varies. Two algorithms implementing Omega are presented. In the first algorithm, which is quiescent and uses stable storage, eventually and permanently unstable processes agree on the leader with correct processes. In the second algorithm, which is near-quiescent and does not use stable storage, unstable processes agree on the leader with correct processes after receiving a first message from a correct process. An adaptation of this second algorithm that avoids the disagreement of unstable processes by providing instability awareness is also presented.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133167478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors","authors":"Brian T. Gold, B. Falsafi, J. Hoe","doi":"10.1109/PRDC.2009.39","DOIUrl":"https://doi.org/10.1109/PRDC.2009.39","url":null,"abstract":"Distributed shared-memory (DSM) multiprocessors provide a scalable hardware platform, but lack the necessary redundancy for mainframe-level reliability and availability. Chip-level redundancy in a DSM server faces a key challenge: the increased latency to check results among redundant components. To address performance overheads, we propose a checking filter that reduces the number of checking operations impeding the critical path of execution. Furthermore, we propose to decouple checking operations from the coherence protocol, which simplifies the implementation and permits reuse of existing coherence controller hardware. Our simulation results of commercial workloads indicate average performance overhead is within 4% (9% maximum) of tightly coupled DMR solutions.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134333137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Approach to Automated Redundancy Reduction for Test Sequences","authors":"Huai-kou Miao, Pan Liu, Jia Mei, Hong-wei Zeng","doi":"10.1109/PRDC.2009.23","DOIUrl":"https://doi.org/10.1109/PRDC.2009.23","url":null,"abstract":"The problem of redundancy among test sequences derived from different FSM-based test coverage criteria often emerges in practice, resulting in the increasing of test cost of software. To solve this problem, a novel approach by way of string matching to eliminating redundancy among test sequences is presented in the paper. Four types of redundancies of test sequences are described and the corresponding reduction rules are also designed. To ensure the effectiveness of redundancy reduction, a transformation rule to convert the ineffective test segments into the effective test sequences is proposed. And then a novel algorithm for redundancy reduction is designed and implemented with Java language. Finally an example is illustrated for the achievement of our approach. Comparing with the existing researches about redundancy reduction, our approach not only eliminates most redundancies among test sequences, but also promotes the application of FSM-based test coverage criteria in practice.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117129942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliability Analysis of Single Bus Communication with Real-Time Requirements","authors":"M. Sebastian, R. Ernst","doi":"10.1109/PRDC.2009.10","DOIUrl":"https://doi.org/10.1109/PRDC.2009.10","url":null,"abstract":"Due to continuous technology downscaling modern embedded real-time systems become more and more susceptible to the occurrence of errors. The usage of appropriate countermeasures is necessary to prevent a system failure. In this paper we present a new reliability estimation technique for such systems. As a key novelty a formal analysis method will be introduced that approximates the probability of failure of a priority driven bus during a period of time, enabling fast and accurate reliability calculation. It removes the major drawbacks of existing approaches, e.g. random-based Monte-Carlo simulation that requires long runtimes. However Monte-Carlo simulation serves as reference method to demonstrate the accuracy of our approach by comparing analysis and simulation results. Finally we consider the design of mixed-criticality systems which combine different safety requirements on a single component.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120921099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Zapmem: A Framework for Testing the Effect of Memory Corruption Errors on Operating System Kernel Reliability","authors":"Roberto Jung Drebes, T. Nanya","doi":"10.1109/PRDC.2009.53","DOIUrl":"https://doi.org/10.1109/PRDC.2009.53","url":null,"abstract":"While monolithic operating system kernels are composed of many subsystems, during runtime they all share a common address space, making fault propagation a serious issue. The code quality of each subsystem is different, as OS development is a complex task commonly divided by different groups with different degrees of expertise. Since the memory space into which this code runs is shared, the occurrence of bugs or errors in one of the subsystems may propagate to others and affect general OS reliability. It is necessary, then, to test how errors propagate between the different kernel subsystems and how they affect reliability. This work presents a simple new technique to inject memory corruption faults and Zapmem, a fault injection tool which uses such technique to test the effect on reliability from memory corruption of statically allocated kernel data. Zapmem associates the runtime memory addresses to the corresponding high level (source code) memory structure definitions, which indicate which kernel subsystem allocated that memory region, and the tool has minimal intrusiveness, as our technique does not require kernel instrumentation. The efficacy of our approach and preliminary results are also presented.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121285396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures","authors":"Songjun Pan, Yu Hu, Xiaowei Li","doi":"10.1109/PRDC.2009.61","DOIUrl":"https://doi.org/10.1109/PRDC.2009.61","url":null,"abstract":"Soft Errors have emerged as a key challenge to microprocessor design. Traditional soft error tolerance techniques (such as redundant multithreading and instruction duplication) can achieve high fault coverage but at the cost of significant performance degradation. Prior research reports that soft errors can be masked at the architecture level, and the degree of such masking, named as architecture vulnerability factor (AVF), can vary significantly across workloads and individual structures, hence strict redundant execution may not be necessary for soft error tolerance. In this work, we exploit the AVF varying feature to adaptively tune reliability and performance. We present an infrastructure to online compute and predict AVF for three microprocessor structures (IQ, ROB, and LSQ), guiding when the protection scheme should be activated to improve reliability. Experimental results show that our method can efficiently compute the AVF for different structures independent of hardware configurations. The average differences between our method and a prior offline AVF computing method are 0.10, 0.01, and 0.039 for IQ, ROB, and LSQ, respectively.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132047738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Process Migration Method for MPI Applications","authors":"Tiantian Liu, Zhongmin Ma, Zhonghong Ou","doi":"10.1109/PRDC.2009.46","DOIUrl":"https://doi.org/10.1109/PRDC.2009.46","url":null,"abstract":"Though a lot of research has been done on fault tolerance for MPI applications, process migration has not gained widespread use because the complexity of the requirement that the knowledge about the location of a migrated process has to be made known to every other process in the MPI application. In this paper, we present a novel and effective process migration method for MPI application. We implement a prototype called LAM/Migration which based on LAM/MPI + BLCR to provide transparent process migration for MPI application and the migration mechanism is built into LAM/MPI. All processes in MPI application including mpirun and MPI processes can be migrated to any different set of spare nodes in cluster under user specified in case of nodes failure in our method. Performance evaluation results showed that the checkpoint overhead is similar to plain LAM/MPI + BLCR, and the migration method is feasible and promising for overcoming nodes failure in large-scale parallel computing. By using LAM/Migration, the high availability and reliability of parallel computation can be achieved.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131994541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-Tolerant Event Detection Using Two Thresholds in Wireless Sensor Networks","authors":"S. Yim, Yoon-Hwa Choi","doi":"10.1109/PRDC.2009.59","DOIUrl":"https://doi.org/10.1109/PRDC.2009.59","url":null,"abstract":"This paper presents a fault-tolerant event detection scheme for wireless sensor networks. Unlike others using a single threshold, the proposed scheme employs two thresholds to cope with the trade-off between event detection accuracy and false alarm rate. An extremely low false alarm rate can be achieved by using a high threshold, while high detection accuracy is obtained by using a low threshold. A sensor node is determined to be in an event region if it passes the high threshold. It can also be determined to be in the region, as long as it passes the low threshold and has a neighbor that passes the high threshold. The dissemination of a local decision to neighboring nodes is made only once to minimize the communication overhead. A moving average filter with a threshold is employed to reduce the impact of transient faults in sensor readings. Computer simulation shows that the proposed scheme also achieves acceptable performance in detecting event regions without computational overhead.","PeriodicalId":356141,"journal":{"name":"2009 15th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125407240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}