A. Bosio, Ian O’Connor, G. Rodrigues, F. Kastensmidt, E. Vatajelu, G. D. Natale, L. Anghel, S. Nagarajan, M. Fieback, S. Hamdioui
{"title":"Rebooting Computing: The Challenges for Test and Reliability","authors":"A. Bosio, Ian O’Connor, G. Rodrigues, F. Kastensmidt, E. Vatajelu, G. D. Natale, L. Anghel, S. Nagarajan, M. Fieback, S. Hamdioui","doi":"10.1109/DFT.2019.8875270","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875270","url":null,"abstract":"Today's computer architectures and semiconductor technologies are facing major challenges making them incapable to deliver the required features (such as computer efficiency) for emerging applications. Alternative architectures are being under investigation in order to continue deliver sustainable benefits for the foreseeable future society at affordable cost. These architectures are not only changing the traditional computing paradigm (e.g., in terms of programming models, compilers, circuit design), but also setting up new challenges and directions on the way these architectures should be tested to guarantee the required quality and reliability levels. This paper highlights the major open questions regarding test and reliability of three emerging computing paradigms being approximate computing, computation-in-memory and neuromorphic computing.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133336705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Low Capture Power Oriented X-filling Method Using Partial MaxSAT Iteratively","authors":"Toshinori Hosokawa, Hiroshi Yamazaki, Kenichiro Misawa, Masayoshi Yoshimura, Yuki Hirama, Masavuki Arai","doi":"10.1109/DFT.2019.8875434","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875434","url":null,"abstract":"High power dissipation can occur by high launch-induced switching activity when the response to a test vector is captured by flip-flops (FFs) in at-speed scan testing, resulting in excessive IR drop. Since excessive IR-drop significantly increases path delay, and thus might result in timing errors, such testing induces unnecessary yield loss in the deep sub-micron era. It is known that test modification methods using X-identification and X-filling are effective to reduce power dissipation in the capture cycle. Conventional low capture power oriented X-filling methods assign logic values to unspecified bits in test cubes to reduce the number of transitions on FFs. However, our goal is to reduce the number of transitions on internal signal lines. In this paper, we propose a low capture power oriented X-filling method iteratively using a Partial MaxSAT Solver which reduces the number of transitions on as many internal signal lines as possible. Experimental results show that our proposed method reduced the numbers of capture-unsafe test vectors and unsafe faults compared with conventional methods.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"133 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120981780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"State Encoding with Stochastic Numbers for Transient Fault Tolerant Linear Finite State Machines","authors":"H. Ichihara, Y. Maeda, T. Iwagaki, Tomoo Inoue","doi":"10.1109/DFT.2019.8875383","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875383","url":null,"abstract":"Stochastic computing (SC) has attractive characteristics, compared with deterministic (or general binary) computing, such as smaller area of the implemented circuits, higher fault tolerance and so on. This study focuses on the transient fault tolerance of SC circuits with linear finite state machines (linear FSMs). To improve the transient fault tolerability of linear-FSM-based SC circuits, we propose a scheme for encoding the states of the FSM with stochastic numbers (SNs). Moreover, we discuss approximating state transition of the FSM so as to reduce the area overhead. The proposed SC circuits are modeled as Markov processes to clarify their behaviors when any transient fault occurs. Experimental results clarify the improvement in the fault tolerability of the SC circuits based on the proposed state encoding with SNs.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126050204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefano Di Mascio, A. Menicucci, E. Gill, G. Furano, C. Monteleone
{"title":"On the Criticality of Caches in Fault-Tolerant Processors for Space","authors":"Stefano Di Mascio, A. Menicucci, E. Gill, G. Furano, C. Monteleone","doi":"10.1109/DFT.2019.8875424","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875424","url":null,"abstract":"This paper analyzes the contribution of caches to failures at processor level due to soft errors. In order to do this, approximated methodologies to estimate the percentage of the total Sensitive Area (SA) of a processor for each unit during early design exploration are proposed. Then, to identify the most vulnerable units, a metric called Relative Soft Error Vulnerability (RSEV) is defined. The analysis shows that caches are the most vulnerable units of state-of-the-art processors and that, even when considering higher-frequency and more complex pipelines representative of next-generation processors for space applications, the final in-orbit failure rate is dominated by failures caused by upsets in cache arrays. Even when protecting memory arrays with information redundancy, the large fraction of upsets occurring in caches is potentially the biggest threat to processor availability and reliability, especially if errors are modelled with invalid assumptions and are not properly handled when detected.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114901094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucas Matana Luza, Alexandre Besser, V. Gupta, A. Javanainen, A. Mohammadzadeh, L. Dilillo
{"title":"Effects of Heavy Ion and Proton Irradiation on a SLC NAND Flash Memory","authors":"Lucas Matana Luza, Alexandre Besser, V. Gupta, A. Javanainen, A. Mohammadzadeh, L. Dilillo","doi":"10.1109/DFT.2019.8875475","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875475","url":null,"abstract":"Space applications frequently use flash memories for mass storage data. However, the technology applied in the memory array and peripheral circuity are not inherently radiation tolerant. This work introduces the results of radiation test campaigns with heavy ions and protons on a SLC NAND Flash. Static tests showed different failures types. Single events upsets and raw error cross sections were presented, as well as an evaluation of the occurrences of the events. Characterization of effects on the embedded data registers was also performed.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116996882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Chapman, Rohan Thomas, Klinsmann J. Coelho Silva Meneses, Bifei Huang, Hao Yang, I. Koren, Z. Koren
{"title":"Detecting SEUs in Noisy Digital Imagers with small pixels","authors":"G. Chapman, Rohan Thomas, Klinsmann J. Coelho Silva Meneses, Bifei Huang, Hao Yang, I. Koren, Z. Koren","doi":"10.1109/DFT.2019.8875486","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875486","url":null,"abstract":"Camera sensors are susceptible to the same transient (non-permanent) errors that occur in standard digital semiconductors, known as Single Event Upsets (SEUs). These result from the charge deposited by cosmic ray particles on the semiconductor. In a camera sensor, SEUs manifest themselves as one or more brighter pixels in a dark-frame image during long exposure times. Since the value of brighter pixels is related directly to the deposited charge, SEU analysis of digital imagers provides essential information about the nature and amount of charge deposited by particle hits, their occurrence rate, and the charge spread area. In this paper we describe an experimental approach to collect this information from pixels of size of $7mumathbf{m}$ (DSLR cameras) down to $1.2mu mathbf{m}$ (cell phone cameras). High gain (ISO) images allow us to detect lower energy SEUs but at the cost of a noisier background. The smaller pixels $(1.2mu mathrm{m})$ are more sensitive to lower energy SEUs, but have considerably noisier background levels. It is important to observe the SEU information over a range of gains (ISOs) and pixel sizes, to obtain the energy and spatial distribution of the SEUs, which is valuable for understanding the nature of SEUs in other circuits. The problem is that SEUs, by their transient nature, appear randomly in both time and location in a series of images. It is important to separate those from the noisy imager random excursions above the background level. We implement a new algorithm that is more effective in separating SEUs from random noise by leveraging thousands of images to obtain the noise distribution of each individual pixel.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128336174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comprehensive Evaluation of the Effects of Input Data on the Resilience of GPU Applications","authors":"Fritz G. Previlon, Charu Kalra, D. Kaeli, P. Rech","doi":"10.1109/DFT.2019.8875269","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875269","url":null,"abstract":"While GPUs are being aggressively deployed in a growing number of computing domains, their resilience to transient faults remains a subject of concern. To gain a better understanding of the inherent vulnerability of GPU applications to transient faults, researchers perform extensive fault injection experiments. However, the conclusions reached based on the results of these fault injection experiments tend to be dependent on the specific input used during the experiments. The dependence of program resilience on changes in program input has not been thoroughly studied for GPU workloads. This paper addresses this issue, presenting extensive analysis on the effects of changes in program input and the resulting GPU reliability. Our work extends and challenges previous studies which reported that input data values do not affect reliability. Our analysis demonstrates that input sizes, as well as biased input values (input with a small set of dominant values) can have a significant impact on application reliability. For applications studied, we can expect a change of as much as 30% in the probability for a fault to cause a failure. Furthermore, we provide guidance on how to predict changes in resilience without repeating exhaustive fault injection experiments,","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114436785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining Cluster Sampling and ACE analysis to improve fault-injection based reliability evaluation of GPU-based systems","authors":"Alessandro Vallero, S. Carlo","doi":"10.1109/DFT.2019.8875392","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875392","url":null,"abstract":"Computing capability demand has grown massively in recent years. Modern GPU chips are designed to deliver extreme performance for graphics and for data-parallel general purpose computing workloads (GPGPU computing) as well. Many GPGPU applications require high reliability, thus reliability evaluation has become a crucial step during their design. State-of-the-art techniques to assess the reliability of a system are fault injection and ACE analysis. The former can produce accurate results despite eternal time while the latter is very fast but it lacks accuracy of the results. In this paper we introduce a new sampling methodology based on cluster sampling that enables the exploitation of ACE analysis to accelerate the fault injection process. In our experiments we demonstrate that state-of-the-art fault injection techniques, generating random faults according to a uniform distribution, is outperformed by the proposed sampling technique, thus enabling several advantages in terms of accuracy and evaluation time. To quantify the introduced benefits we analyzed the micro-architecture reliability of an AMD Southern Islands GPU in presence of single bit upset affecting the vector register file for 6 benchmarks. One of the most important achievements is that considering all the benchmarks, on average, we are one order of magnitude faster/more accurate than uniform-sampling-based techniques in case of non exhaustive fault injection campaigns, while more than two orders of magnitude in case of exhaustive campaigns.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130648261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Mousavi, H. Pourshaghaghi, H. Corporaal, Akash Kumar
{"title":"Scatter Scrubbing: A Method to Reduce SEU Repair Time in FPGA Configuration Memory","authors":"M. Mousavi, H. Pourshaghaghi, H. Corporaal, Akash Kumar","doi":"10.1109/DFT.2019.8875431","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875431","url":null,"abstract":"SRAM-based FPGAs are widely used in many critical systems in which dependability is an essential factor. However, SRAM-based FPGAs are sensitive to Single Event Upsets (SEUs), especially when they are used in space. Scrubbing is an effective technique to protect FPGA Configuration Memory (CM) against SEUs. One major hurdle in read-back scrubbing techniques is that they suffer from long Mean Time To Repair (MTTR). In this paper, we propose scatter scrubbing, a new method that reduces MTTR by exploiting the locality of SEUs sensitive bits in CM. It is based on 1) splitting FPGA CM into several partitions based on how critical the CM bits are for proper operation of the FPGA circuit, and 2) deriving a smart schedule for scrubbing the partitions. Finding an optimal partition and scheduling has non-polynomial complexity; therefore we rely on clever heuristics, especially for the first step. However, for small designs, we developed an accelerated brute-force method giving the optimal solution, which we can use as a reference. The experimental results show, for real FPGA designs, up to 64% reduction in MTTR compared to state-of-the-art techniques.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132152158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding of GPU Architectural Vulnerability for Deep Learning Workloads","authors":"Danny Santoso, Hyeran Jeon","doi":"10.1109/DFT.2019.8875404","DOIUrl":"https://doi.org/10.1109/DFT.2019.8875404","url":null,"abstract":"Deep learning has proved its effectiveness for various problems including object detection, speech recognition, stock price forecasting and so on. Among various accelerators, GPU is one of the most favorable platforms for deep learning that provides faster neuron processing with massive parallelism. Recently, there have been extensive studies for better performance and power consumption of deep learning computing. However, reliability of deep learning has not been thoroughly studied yet. Though there have been a few studies that evaluated reliability of GPU architectures for general-purpose applications, there have not been many studies that showed the architectural vulnerability (AVF) of core algorithms and optimization techniques of deep learning workloads. In this paper, we evaluate AVF of GPU architectures while running various deep learning workloads and provide in-depth analysis by comparing the AVF of deep learning workloads and the other GPU applications. We also provide the reliability impact of various optimization techniques of deep learning workloads.","PeriodicalId":415648,"journal":{"name":"2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115951513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}