{"title":"A Runtime Fault-Tolerant Routing Scheme for Partially Connected 3D Networks-on-Chip","authors":"A. Coelho, A. Charif, N. Zergainoh, R. Velazco","doi":"10.1109/DFT.2018.8602971","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602971","url":null,"abstract":"Three-dimensional Networks-on-Chip (3D-NoC) have emerged as an effective solution to the scalability and latency issues in modern complex System-On-Chips. Through-Silicon Via (TSV) is usually adopted as a viable technology enabling vertical connection among NoC layers. However, TSV-based architectures typically exhibit high vulnerability to transient and permanent faults, calling for robust routing solutions capable of sustaining operation under unpredictable failure patterns. In this paper, we introduce a complete routing solution that guarantees 100% packet delivery under an unconstrained set of runtime and permanent vertical link failures. This scheme features a baseline fully-connected low-latency deadlock-free routing algorithm, and a runtime mechanism to dynamically and progressively reconfigure the network without any packet loss. Simulation results demonstrate the effectiveness of our approach in terms of performance and reliability when compared with the state-of-the-art. Furthermore, the hardware synthesis performed using commercial 28nm technology library shows a reasonable area and power overhead with respect to the non-fault-tolerant baseline.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126500985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mark Wilkening, Fritz G. Previlon, D. Kaeli, S. Gurumurthi, Steven E. Raasch, Vilas Sridharan
{"title":"Evaluating the Resilience of Parallel Applications","authors":"Mark Wilkening, Fritz G. Previlon, D. Kaeli, S. Gurumurthi, Steven E. Raasch, Vilas Sridharan","doi":"10.1109/DFT.2018.8602987","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602987","url":null,"abstract":"Reliability is a significant design constraint for supercomputers and large-scale data centers. Modeling the effects of faults on applications targeted to such systems allows system architects and software designers to provision resilience features, that improve fidelity of results and reduce runtimes. In this paper, we propose mechanisms to improve existing techniques to model the effect of transient faults on realistic applications. First, we extend the existing Program Vulnerability Factor metric to model multi-threaded applications. Then we demonstrate how to measure the multi-threaded PVF of an application in simulation and introduce the ability to account for software detection of hardware faults, differentiating faults that cause detected, uncorrected errors (DUE) from faults that cause silent data corruption (SDC).","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131592504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zois-Gerasimos Tasoulas, Ryan Guss, Iraklis Anagnostopoulos
{"title":"Performance-Based and Aging-Aware Resource Allocation for Concurrent GPU Applications","authors":"Zois-Gerasimos Tasoulas, Ryan Guss, Iraklis Anagnostopoulos","doi":"10.1109/DFT.2018.8602850","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602850","url":null,"abstract":"GPUs are an important part in the effort to overcome performance thresholds and unlock the true potential of computing as they offer increased computational capabilities and are cost efficient. Until now, GPUs are designed to execute one application at a time so the field of concurrent GPU applications is not exhaustively explored. When multiple applications that belong to different types, e.g., compute or memory intensive, are executed on the same platform concurrently, significant performance degradation and imbalances in terms of component aging may occur. These imbalances can lead to weak system reliability, further performance degradation and acceleration of failure time. In this paper, we propose a resource allocating algorithm that mitigates the aging imbalances without inserting overhead during the execution, limiting aging imbalance among Streaming Multiprocessors (SMs) to a standard deviation of 0.4%. Additionally, the proposed algorithm improves SM allocation for each application, achieving up to 33% higher throughput.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116688998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Method to Model Statistical Path Delays for Accurate Defect Coverage","authors":"Pavan Kumar Javvaji, S. Tragoudas","doi":"10.1109/DFT.2018.8602962","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602962","url":null,"abstract":"The statistical delay of a path is traditionally modeled as a Gaussian random variable assuming that the path is always sensitized by a test pattern. Its sensitization in various circuit instances varies among its test patterns and the pattern induced delay is non-Gaussian. It is modeled using probability mass functions. The defect coverage is improved by test pattern selection using machine learning. Experimental results demonstrate accuracy in defect coverage when comparing to existing methods.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122227783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Markus Schütz, A. Steininger, F. Huemer, J. Lechner
{"title":"State Recovery for Coarse-Grain TMR Designs in FPGAs Using Partial Reconfiguration","authors":"Markus Schütz, A. Steininger, F. Huemer, J. Lechner","doi":"10.1109/DFT.2018.8602984","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602984","url":null,"abstract":"The operation of field-programmable gate arrays (FPGAs) in harsh environments like space entails the need for suitable fault-tolerance techniques of which Triple-Modular Redundancy (TMR) is most commonly deployed. While TMR is undoubtedly effective in masking faults, state recovery remains a problematic issue: Fine-grain TMR allows safe recovery, but incurs prohibitive area and performance penalties. In contrast, coarse-grain TMR has little overhead, but cannot safely provide recovery without roll-back or reset. We use the dynamic reconfiguration feature of modern FPGAs to augment an initially coarse-grain TMR with the ability of temporarily loading a fine-grain TMR design for forward-state-recovery. Therefore, we can seamlessly resume correct (fully redundant) operation in case of data-as well as configuration faults that occurred in the FPGA. As a proof of concept, the paper presents a showcase design and discusses distinctive properties of this new approach.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124793558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effects of Voltage and Temperature Variations on the Electrical Masking Capability of Sub-65 nm Combinational Logic Circuits","authors":"Semiu A. Olowogemo, W. H. Robinson, D. Limbrick","doi":"10.1109/DFT.2018.8602975","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602975","url":null,"abstract":"Single Event Transients (SETs) induced from radiation strikes on an integrated circuit (IC) can be masked electrically by logic gates while propagating through the circuit towards a storage element (e.g., flip-flop). With the continuous scaling of CMOS technology, there are simultaneous reductions in voltage, cell size, and internal capacitances that impact the properties of the gates. The combined impact causes a reduction in the electrical masking capability of the gates. The reduction in electrical masking means that transients are more likely to reach the storage elements. In addition, variations in voltage and temperature could enhance the propagation of transient towards the storage elements. This paper describes the effects of temperature and voltage variations on the electrical masking of sub-65 nm combinational logic circuits. The worst-case temperature increases the SET pulsewidth by 57.6%. The worst-case voltage increases the SET pulsewidth by 51.2%. The pulses are therefore less likely to be masked electrically.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"361 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115935128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Dynamic Device Authentication Based on Lorenz Chaotic Systems","authors":"Lake Bu, Hai Cheng, M. Kinsy","doi":"10.1109/DFT.2018.8602986","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602986","url":null,"abstract":"Chaotic systems, such as Lorenz systems or logistic functions, are known for their rapid divergence property. Even the smallest change in the initial condition will lead to vastly different outputs. This property renders the short-term behavior, i.e., output values, of these systems very hard to predict. Because of this divergence feature, lorenz systems are often used in cryptographic applications, particularly in key agreement protocols and encryptions. Yet, these chaotic systems do exhibit long-term deterministic behaviors-i.e., fit into a known shape over time. In this work, we propose a fast dynamic device authentication scheme that leverages both the divergence and convergence features of the Lorenz systems. In the scheme, a device proves its legitimacy by showing authentication tags belonging to a predetermined trajectory of a given Lorenz chaotic system. The security of the proposed technique resides in the fact that the short-range function output values are hard for an attacker to predict, but easy for a verifier to validate because the function is deterministic. In addition, in a multi-verifier scenario such as a mobile phone switching among base stations, the device does not have to re-initiate a separate authentication procedure each time. Instead, it just needs to prove the consistency of its chaotic behavior in an iterative manner, making the procedure very efficient in terms of execution time and computing resources.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116980197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Postprocessing Procedure for Reducing the Faulty Switching Activity of a Low-Power Test Set","authors":"I. Pomeranz","doi":"10.1109/DFT.2018.8602967","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602967","url":null,"abstract":"Low-power test generation procedures reduce the switching activity during functional capture cycles of scan-based tests in order to avoid overtesting of delay faults. The switching activity that these procedures address is the one in the fault-free circuit. Recently it was shown that excessive switching activity in faulty circuits can potentially cause test escapes. To avoid such situations, this paper describes a postprocessing procedure that reduces the switching activity of a low-power test set in faulty circuits. The main challenge that this procedure needs to address is the large number of faulty circuits for which the switching activity may be excessive. This challenge is addressed in this paper by reducing the fault-free switching activity in order to create a safety margin for an increased faulty switching activity. The safety margin is computed for every test individually. Experimental results for benchmark circuits demonstrate the ability of the procedure to eliminate excessive faulty switching activity for low-power test sets.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117175641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple Fault Detection in Nano Programmable Logic Arrays","authors":"P. Junsangsri, F. Lombardi","doi":"10.1109/DFT.2018.8602985","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602985","url":null,"abstract":"This paper presents a new method for testing on a go-nogo basis nano programmable logic arrays; the basic configuration of an array made of passive and active interconnect resources (lines and switches) on two connected planes (AND and OR) is analyzed under a comprehensive multiple fault model. This model is applicable to production testing at nano manufacturing and considers faults (such as stuck-at and bridging faults) in the passive interconnect line structure as well as programming faults in the active resources (switching or crosspoint faults). The proposed method achieves full coverage in fault detection by configuring the array multiple times using a four-step procedure; as the complexity of testing such chip is largely dependent on the number of configuration rounds (also often referred to as programming phases) that the chip must undergo, then at production the proposed method achieves a substantial reduction in test time compared with previous techniques. Different from previous techniques that have a complexity as function of array size (i.e. quadratic with the dimension of the planes in the array), it is shown that the proposed technique has a complexity linear with the largest dimension of a plane in the nano array. Simulation results are provided to show that 100% detection is achieved and for detection, the average number of configuration rounds is significantly less than the upper bound predicted by the presented theory.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126658951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MATS**: An On-Line Testing Approach for Reconfigurable Embedded Memories","authors":"Ludovica Bozzoli, L. Sterpone","doi":"10.1109/DFT.2018.8602934","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602934","url":null,"abstract":"Modern Field Programmable Gate Arrays (FPGAs) embed dedicated blocks for Memories (BRAMs), digital signal processing (DSPs) and hardwired microprocessors merged with the reconfigurable logic array. This trend, coupled with Error Correction Code (ECC) mechanism and Dynamic Partial Reconfiguration (DPR), makes these devices ideal candidates for mission critical applications where high reliability is a strict requirement. Therefore, efficient and in-field testing became a major concern. Unfortunately, typical on-line memory testing approaches are not fully optimized for the reconfigurable scenario. In fact, a suitable fault model should be considered in order to enhance the fault coverage and reduce the test redundancy. In this work, we proposed the MATS** algorithm, which is able to reduce the execution time and optimize the fault coverage with respect to most popular embedded memories March Tests. Furthermore, MATS** results to be highly suitable to be executed, even partially, in brief time slots available within the device mission. Experimental results show that our approach is around 30% faster than state-of-the-art solutions while achieving the optimal fault coverage.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"502 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131479558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}