{"title":"用于减少静默数据损坏的低成本程序级检测器","authors":"S. Hari, S. Adve, Helia Naeimi","doi":"10.1109/DSN.2012.6263960","DOIUrl":null,"url":null,"abstract":"With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.","PeriodicalId":236791,"journal":{"name":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"137","resultStr":"{\"title\":\"Low-cost program-level detectors for reducing silent data corruptions\",\"authors\":\"S. Hari, S. Adve, Helia Naeimi\",\"doi\":\"10.1109/DSN.2012.6263960\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.\",\"PeriodicalId\":236791,\"journal\":{\"name\":\"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"137\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN.2012.6263960\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2012.6263960","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 137
摘要
随着技术的发展,暂态故障对硬件可靠性的威胁日益严重。商品系统必须通过非常低成本的弹性解决方案来应对这些现场故障。软件级症状检测技术已经成为一种有前途的低成本和有效的解决方案。虽然目前这些技术的用户可见的静默数据损坏(SDC)率相对较低,但消除或显着降低SDC率对于这些解决方案的实际成功至关重要。识别和理解导致SDCs的程序部分对于以具有成本效益的方式减少(或消除)SDCs至关重要。本文提供了对我们所研究的六个应用程序中产生超过90%的sdc的代码段的详细分析。这种分析促进了程序级检测器的开发,这些检测器可以捕获大量累积或长时间活跃的错误,从而摊销检测成本。这些低成本检测器显著降低了对基于冗余的技术的依赖,并在性能与可靠性权衡曲线上提供了更实用、更灵活的选择点。例如,对于基线SDC率平均降低90%、99%或100%,我们的方法与单独冗余的平均执行开销分别为12% vs 30%、19% vs 43%和27% vs 51%。
Low-cost program-level detectors for reducing silent data corruptions
With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.