Leveraging the short-term memory of hardware to diagnose production-run software failures

Joy Arulraj, Guoliang Jin, Shan Lu
{"title":"Leveraging the short-term memory of hardware to diagnose production-run software failures","authors":"Joy Arulraj, Guoliang Jin, Shan Lu","doi":"10.1145/2541940.2541973","DOIUrl":null,"url":null,"abstract":"Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once. This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than maintaining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically locate failure root causes using LBR and LCR. Our evaluation uses 31 real-world sequential and concurrency bug failures from 18 representative open-source software. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3% run-time overhead. As our system does not rely on sampling,","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"342 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2541940.2541973","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24

Abstract

Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once. This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than maintaining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically locate failure root causes using LBR and LCR. Our evaluation uses 31 real-world sequential and concurrency bug failures from 18 representative open-source software. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3% run-time overhead. As our system does not rely on sampling,
利用硬件的短期记忆来诊断生产运行的软件故障
软件错误导致的故障在生产运行中普遍存在,给最终用户造成了严重的损失。不幸的是,诊断生产运行故障是具有挑战性的。现有工作不能同时满足隐私、运行时开销、诊断能力和诊断延迟需求。本文设计了一个基于双观测的低开销、低延迟、保密性的生产运行故障诊断系统。首先,程序执行的短期记忆对于故障诊断通常是足够的,因为许多错误的传播距离很短。其次,维护执行的短期记忆比维护整个执行的记录要便宜得多。根据这些观察结果,我们首先确定一个现有的硬件单元,最后分支记录(Last Branch Record, LBR),它记录了最后几个分支,以帮助诊断顺序错误。然后,我们提出了一个简单的硬件扩展,最后缓存一致性记录(LCR),以记录具有指定一致性状态的最后几次缓存访问,从而帮助诊断并发错误。最后,我们设计了LBRA和LCRA,利用LBR和LCR自动定位故障根源。我们的评估使用了来自18个代表性开源软件的31个真实世界的顺序和并发错误失败。结果表明,仅使用16个记录条目,LBR和LCR就可以使我们的系统自动定位31个故障中的27个故障的根本原因,而运行时开销不到3%。由于我们的系统不依赖于抽样,
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信