{"title":"Toward efficient check-pointing and rollback under on-demand SBST in chip multi-processors","authors":"M. Skitsas, C. Nicopoulos, M. Michael","doi":"10.1109/IOLTS.2015.7229842","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229842","url":null,"abstract":"In-field on-line testing techniques have recently been proposed for permanent fault detection caused by wear-out/aging-related defects manifesting during the lifetime of a system. Selective Software-Based Self-Testing (SBST) is one such paradigm focusing primarily on the recently stressed functional units of a multicore system at a sub-core granularity, in an attempt to reduce the application performance penalty caused by periodically testing the entire system. In this work, we complement our O/S-enabled framework DeamonGuard for on-demand (selective) SBST to support fault recovery capabilities. Towards this goal, we propose an efficient check pointing and rollback recovery mechanism which, upon fault detection, can restore the system to the most recently valid correct state and resume the normal operation assuming disabling of the faulty core, thereby leading to a healthy (but degraded) system. The work in this paper concentrates on reducing the number of stored checkpoints required when testing at a sub-core granularity, and minimizing the recovery penalty of such framework. We evaluate and demonstrate the overhead of the proposed recovery mechanism, and our results indicate a practical reduction in the number of stored checkpoints as well as a significant improvement in recovery latency for the cases where the faults are correlated with the stressed units.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127525816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-time on-chip supply voltage sensor and its application to trace-based timing error localization","authors":"Miho Ueno, M. Hashimoto, T. Onoye","doi":"10.1109/IOLTS.2015.7229857","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229857","url":null,"abstract":"This paper presents an all-digital on-chip supply voltage sensor that captures one-shot voltage fluctuation every clock cycle. The proposed sensor was implemented on ASIC in 65nm process and FPGA. The obtained voltage resolution was 3.9mV and 29mV, respectively. This sensor is suitable for providing voltage information to trace-based error localization system. We experimentally show that the proposed sensor contributes to the facilitation of error localization.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126847743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-tolerant system for catastrophic faults in AMR sensors","authors":"Andreina Zambrano, H. Kerkhoff","doi":"10.1109/IOLTS.2015.7229834","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229834","url":null,"abstract":"Anisotropic Magnetoresistance angle sensors are widely used in automotive applications considered to be safety-critical applications. Therefore dependability is an important requirement and fault-tolerant strategies must be used to guarantee the correct operation of the sensors even in case of failures. AMR sensors are configured with two Wheatstones bridges where catastrophic (hard) as well as parametric faults can ocurr. Catastrophic faults are mainly related to the conditions of the bridge resistances. If a hard fault occurs at any of the resistances, the sensor must be taken out of operation because the angle can not longer be reliably computed. Previous proposed fault-tolerant systems are based on physical redundancy with hardware duplication, which usually implies higher production cost and can be restricted by the mechanical construction of the sensor. However the proposed fault-tolerant system is based on analytical redundancy to obtain the required voltages to calculate the angle. Results indicate the fault-tolerant system can handle catastrophic fault at any of the bridge resistances, which is especially useful in safety-critical applications. Future research will be focused on parametric faults.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123340491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Failure mitigation in linear, sesquilinear and bijective operations on integer data streams via numerical entanglement","authors":"M. A. Anam, Y. Andreopoulos","doi":"10.1109/IOLTS.2015.7229844","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229844","url":null,"abstract":"A new roll-forward technique is proposed that recovers from any single fail-stop failure in M integer data streams (M ≥ 3) when undergoing linear, sesquilinear or bijective (LSB) operations, such as: scaling, additions/subtractions, inner or outer vector products and permutations. In the proposed approach, the M input integer data streams are linearly superimposed to form M numerically entangled integer data streams that are stored inplace of the original inputs. A series of LSB operations can then be performed directly using these entangled data streams. The output results can be extracted from any M-1 entangled output streams by additions and arithmetic shifts, thereby guaranteeing robustness to a fail-stop failure in any single stream computation. Importantly, unlike other methods, the number of operations required for the entanglement, extraction and recovery of the results is linearly related to the number of the inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor (Haswell architecture with AVX2 support) via convolution operations. Our analysis and experiments reveal that the proposed approach incurs only 1.8% to 2.8% reduction in processing throughput in comparison to the failure-intolerant approach. This overhead is 9 to 14 times smaller than that of the equivalent checksum-based method. Thus, our proposal can be used in distributed systems and unreliable processor hardware, or safety-critical applications, where robustness against fail-stop failures becomes a necessity.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126419622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Workload characterization and prediction: A pathway to reliable multi-core systems","authors":"Monir Zaman, A. Ahmadi, Y. Makris","doi":"10.1109/IOLTS.2015.7229843","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229843","url":null,"abstract":"As a result of technology scaling, power density of multi-core chips increases and leads to temperature hot-spots which accelerate device aging and chip failure. Moreover, intense efforts to reduce power consumption by employing low-power techniques decrease the reliability of new design generations. Traditionally, reactive thermal/power management techniques have been used to take appropriate action when the temperature reaches a threshold. However, these approaches do not always balance temperature and, as a result, may degrade system reliability. Therefore, to distribute temperature evenly across all cores, a proactive mechanism is needed to forecast future workload characteristics and the corresponding temperature, in order to make decisions before hot spots occur. Such proactive methods rely on an engine to precisely predict future workload characteristics. In this work, we first discuss the state-of-the-art methods for predicting workload dynamics and we compare their performance. We, then, introduce a prediction method based on Support Vector Regression (SVR), which accurately predicts the workload behavior several steps ahead. To evaluate the effectiveness of our approach, we use several programs from the PARSEC benchmark suite on an UltraSPARC T1 processor running the Sun Solaris operating system and we extract architectural traces. Then, the extracted traces are used to generate power and thermal profiles for each core using the McPAT and Hot-Spot simulators. Our results show that the proposed method forecasts workload dynamics and power very accurately and outperforms previous prediction techniques.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129965239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault modeling and testing of through silicon via interconnections","authors":"V. Gerakis, Leonidas Katselas, A. Hatzopoulos","doi":"10.1109/IOLTS.2015.7229824","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229824","url":null,"abstract":"The case of a defected TSV that has been cracked at the point where an impurity or a void hole originally had been is analyzed in this study. A lumped analytical electrical circuit that models the behavior of this defect is proposed. TSV fault modeling offers assistance in developing new test methods that would improve the reliability of the 3D ICs. The structure is simulated using a commercial 3D resistance, capacitance and inductance extraction tool. A test method that determines the possible characteristics of the defect is presented.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"12 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114029171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient multilevel formal analysis and estimation of design vulnerability to Single Event Transients","authors":"Ghaith Bany Hamad, O. Mohamed, Y. Savaria","doi":"10.1109/IOLTS.2015.7229818","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229818","url":null,"abstract":"The progressive shrinking of device size in advanced technologies leads to miniaturization and performance improvements. However, ultra-deep sub-micron technologies are more vulnerable to soft errors. Error analysis of a complex system with a sufficiently large sample of vulnerable nodes takes a large amount of time. In this paper we propose RASVAS, a hierarchical statistical method to model, analyze, and estimate the behavior of a system in the presence of Single Event Transients (SETs) modeled at different abstraction levels. Gate level propagation tables are developed to abstract SET propagation conditions and probabilities from gate level models. At RTL, these tables are utilized to model the underlying probabilistic behavior as Markov Decision Process (MDP) models. Experimental results demonstrate that RASVAS is orders of magnitude faster than contemporary techniques and also handle designs as large as 256-bit adders while maintaining accuracy.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128100688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flip-flop SEU reduction through minimization of the temporal vulnerability factor (TVF)","authors":"A. Evans, Enrico Costenaro, A. Bramnik","doi":"10.1109/IOLTS.2015.7229851","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229851","url":null,"abstract":"The effects of soft-errors in flip-flops remains a concern in large designs. There exist many radiation hardened flip-flops, however, these are custom cells and not available to all designers. In this paper, we explore a technique for the mitigation of flip-flop soft-errors through an optimization of the temporal vulnerability factor (TVF). By selectively inserting delay on the input or output of flip-flops, the probability of propagation of single event upsets (SEUs) can be minimized. The selection of where to insert the added delay is formulated as a linear programming problem. In this way, the flip-flop soft-error rate (SER) can be minimized subject to overhead constraints.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126918010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New byte error correcting codes with simple decoding for reliable cache design","authors":"Lake Bu, M. Karpovsky, Zhen Wang","doi":"10.1109/IOLTS.2015.7229859","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229859","url":null,"abstract":"Most cache designs support single or double bit-level error detection and correction in cache lines. However, a single error may distort a whole byte or even more, resulting in much higher decoding complexity than that of bit-level distortions. Thereby this paper proposes a new group testing based error correcting code (GTB code) for byte-level error locating and correcting which provides much stronger protection for memories. This new class of non-binary GTB codes is generated from binary superimposed codes. Since it is encoded and decoded by binary matrices, no complicated Galois Field computations in GF(Q) such as multiplications and inversions are involved. Comparing with popular non-binary error correcting codes (ECC) such as Hamming, Reed-Solomon and interleaved codes, the GTB codes achieves up to 42% reduction of the decoding complexity (hardware cost × latency) for single-byte error correction, and up to 98% reduction for double-byte error correction. Moreover, given the length of codewords (e.g. 512 bits for cache lines), as the size of each Q-ary digit (byte) increases, the saving increases.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122253869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Alexandrescu, A. Evans, Enrico Costenaro, M. Glorieux
{"title":"A call for cross-layer and cross-domain reliability analysis and management","authors":"D. Alexandrescu, A. Evans, Enrico Costenaro, M. Glorieux","doi":"10.1109/IOLTS.2015.7229821","DOIUrl":"https://doi.org/10.1109/IOLTS.2015.7229821","url":null,"abstract":"For many applications, reliability, availability and trustability are key factors, requiring careful design to meet the end users expectations. The complex ASICs, which are now ubiquitous, often embed tens of millions of flip-flops, hundreds of megabits of embedded SRAM, and hundreds of millions of combinatorial cells. These designs integrate IP from multiple providers and are implemented in advanced process technologies, making it challenging to evaluate their reliability. Initiatives such as RIIF (Reliability Information Interchange Format) allow the formalization, specification and modeling of extra-functional, reliability properties for technology, circuits and systems. Continuing these efforts, we propose RAFT (Reliability Architect Framework and Toolset) - a reliability-centric framework including reliability data and models, methodologies and tools allowing system reliability exploration and optimization using mathematical models and high-level tools. The proposed approach can be combined with performance management methodologies aiming at reducing the engineering effort devoted to reliability analysis and improvement.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129466199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}