Seyed Mohammad Seyedzadeh, R. Maddah, A. Jones, R. Melhem
{"title":"Leveraging ECC to Mitigate Read Disturbance, False Reads and Write Faults in STT-RAM","authors":"Seyed Mohammad Seyedzadeh, R. Maddah, A. Jones, R. Melhem","doi":"10.1109/DSN.2016.28","DOIUrl":null,"url":null,"abstract":"Designing reliable systems using scaled Spin-Transfer Torque Random Access Memory (STT-RAM) has become a significant challenge as the memory technology feature size is scaled down. The introduction of a more prominent read disturbance is a key contributor in this reliability challenge. However, techniques to address read disturbance are often considered in a vacuum that assumes other concerns like transient read errors (false reads) and write faults do not occur. This paper studies several techniques that leverage ECC to mitigate persistent errors resulting from read disturbance and write faults of STT-RAM while still considering the impact of transient errors of false reads. In particular, we study three policies to enable better-than-conservative read disturbance mitigation. The first policy, write after error (WAE), uses ECC to detect errors and write back data to clear persistent errors. The second policy, write after persistent error (WAP), filters out false reads by reading a second time when an error is detected leading to trade-off between write and read energy. The third policy, write after error threshold (WAT), leaves cells with incorrect data behind (up to a threshold) when the number of errors is less than the ECC capability. To evaluate the effectiveness of the different schemes and compare with the simple previously proposed scheme of writing after every read (WAR), we model these policies using Markov processes. This approach allows the determination of appropriate bit error rates in the context of both persistent and transient errors to accurately estimate the system reliability and the energy consumption of different error correction approaches. Our evaluations show that each of these policies provides benefits for different error scenarios. Moreover some approaches can save energy by an average of 99.5%, while incurring the same reliability as other approaches.","PeriodicalId":102292,"journal":{"name":"2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2016.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
Designing reliable systems using scaled Spin-Transfer Torque Random Access Memory (STT-RAM) has become a significant challenge as the memory technology feature size is scaled down. The introduction of a more prominent read disturbance is a key contributor in this reliability challenge. However, techniques to address read disturbance are often considered in a vacuum that assumes other concerns like transient read errors (false reads) and write faults do not occur. This paper studies several techniques that leverage ECC to mitigate persistent errors resulting from read disturbance and write faults of STT-RAM while still considering the impact of transient errors of false reads. In particular, we study three policies to enable better-than-conservative read disturbance mitigation. The first policy, write after error (WAE), uses ECC to detect errors and write back data to clear persistent errors. The second policy, write after persistent error (WAP), filters out false reads by reading a second time when an error is detected leading to trade-off between write and read energy. The third policy, write after error threshold (WAT), leaves cells with incorrect data behind (up to a threshold) when the number of errors is less than the ECC capability. To evaluate the effectiveness of the different schemes and compare with the simple previously proposed scheme of writing after every read (WAR), we model these policies using Markov processes. This approach allows the determination of appropriate bit error rates in the context of both persistent and transient errors to accurately estimate the system reliability and the energy consumption of different error correction approaches. Our evaluations show that each of these policies provides benefits for different error scenarios. Moreover some approaches can save energy by an average of 99.5%, while incurring the same reliability as other approaches.
随着存储技术特征尺寸的缩小,设计可靠的自旋传递扭矩随机存取存储器(STT-RAM)系统已成为一个重大挑战。在这种可靠性挑战中,引入一个更突出的读干扰是一个关键因素。然而,解决读干扰的技术通常是在真空中考虑的,假设没有发生其他问题,如瞬态读错误(误读)和写错误。本文研究了几种利用ECC来减轻STT-RAM读干扰和写错误导致的持续错误的技术,同时仍然考虑错误读的瞬态错误的影响。特别是,我们研究了三种策略来实现优于保守的读干扰缓解。第一种策略是WAE (write after error),使用ECC检测错误,并回写数据以清除持久错误。第二个策略是持久错误后写入(WAP),当检测到错误时,通过第二次读取来过滤错误读取,从而在写和读能量之间进行权衡。第三个策略是在错误阈值之后写入(WAT),当错误数量少于ECC能力时,将不正确数据的单元留在后面(直到一个阈值)。为了评估不同方案的有效性,并与之前提出的简单的每次读取后写入(WAR)方案进行比较,我们使用马尔可夫过程对这些策略进行建模。这种方法允许在持久错误和瞬态错误的情况下确定适当的误码率,以准确地估计系统可靠性和不同纠错方法的能耗。我们的评估表明,这些策略中的每一个都为不同的错误场景提供了好处。此外,一些方法可以平均节省99.5%的能源,同时产生与其他方法相同的可靠性。