Harnessing Soft Computations for Low-Budget Fault Tolerance

D. Khudia, S. Mahlke
{"title":"Harnessing Soft Computations for Low-Budget Fault Tolerance","authors":"D. Khudia, S. Mahlke","doi":"10.1109/MICRO.2014.33","DOIUrl":null,"url":null,"abstract":"A growing number of applications from various domains such as multimedia, machine learning and computer vision are inherently fault tolerant. However, for these soft workloads, not all computations are fault tolerant (e.g., A loop trip count). In this paper, we propose a compiler-based approach that takes advantage of soft computations inherent in the aforementioned class of workloads to bring down the cost of software-only transient fault detection. The technique works by identifying a small subset of critical variables that are necessary for correct macro-operation of the program. Traditional duplication and comparison are used to protect these variables. For the remaining variables and temporaries that only affect the micro-operation of the program, strategic expected value checks are inserted into the code. Intuitively, a computation-chain result near the expected value is either correct or close enough to the correct result so that it does not matter for non-critical variables. Overall, the proposed solution has, on average, only 19.5% performance overhead and reduces the number of silent data corruptions from 15% down to 7.3% and user-visible silent data corruptions from 3.4% down to 1.2% in comparison to an unmodified application. This unacceptable silent data corruption rate is even lower than a traditional full duplication scheme that has, on average, 57% overhead.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"24 1","pages":"319-330"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 53

Abstract

A growing number of applications from various domains such as multimedia, machine learning and computer vision are inherently fault tolerant. However, for these soft workloads, not all computations are fault tolerant (e.g., A loop trip count). In this paper, we propose a compiler-based approach that takes advantage of soft computations inherent in the aforementioned class of workloads to bring down the cost of software-only transient fault detection. The technique works by identifying a small subset of critical variables that are necessary for correct macro-operation of the program. Traditional duplication and comparison are used to protect these variables. For the remaining variables and temporaries that only affect the micro-operation of the program, strategic expected value checks are inserted into the code. Intuitively, a computation-chain result near the expected value is either correct or close enough to the correct result so that it does not matter for non-critical variables. Overall, the proposed solution has, on average, only 19.5% performance overhead and reduces the number of silent data corruptions from 15% down to 7.3% and user-visible silent data corruptions from 3.4% down to 1.2% in comparison to an unmodified application. This unacceptable silent data corruption rate is even lower than a traditional full duplication scheme that has, on average, 57% overhead.
利用软计算实现低预算容错
越来越多来自多媒体、机器学习和计算机视觉等各个领域的应用都具有固有的容错性。然而,对于这些软工作负载,并不是所有的计算都是容错的(例如,环路行程计数)。在本文中,我们提出了一种基于编译器的方法,该方法利用了上述工作负载类别中固有的软计算来降低仅软件瞬时故障检测的成本。该技术的工作原理是识别程序的正确宏观操作所必需的一小部分关键变量。传统的复制和比较用于保护这些变量。对于仅影响程序的微观操作的剩余变量和临时变量,将在代码中插入策略期望值检查。直观地说,接近期望值的计算链结果要么是正确的,要么是足够接近正确的结果,因此对于非关键变量来说,这无关紧要。总的来说,与未修改的应用程序相比,所建议的解决方案平均只有19.5%的性能开销,并且将静默数据损坏的数量从15%降低到7.3%,将用户可见的静默数据损坏从3.4%降低到1.2%。这种不可接受的静默数据损坏率甚至低于传统的完全复制方案,后者的平均开销为57%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信