Production-Run Software Failure Diagnosis via Adaptive Communication Tracking

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2016-06-18 DOI:10.1145/3007787.3001175

Mohammad Mejbah Ul Alam, A. Muzahid

{"title":"Production-Run Software Failure Diagnosis via Adaptive Communication Tracking","authors":"Mohammad Mejbah Ul Alam, A. Muzahid","doi":"10.1145/3007787.3001175","DOIUrl":null,"url":null,"abstract":"Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"94 1","pages":"354-366"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Software failure diagnosis techniques work either by sampling some events at production-run time or by using some bug detection algorithms. Some of the techniques require the failure to be reproduced multiple times. The ones that do not require such, are not adaptive enough when the execution platform, environment or code changes. We propose ACT, a diagnosis technique for production-run failures, that uses the machine intelligence of neural hardware. ACT learns some invariants (e.g., data communication invariants) on-the-fly using the neural hardware and records any potential violation of them. Since ACT can learn invariants on-the-fly, it can adapt to any change in execution setting or code. Since it records only the potentially violated invariants, the postprocessing phase can pinpoint the root cause fairly accurately without requiring to observe the failure again. ACT works seamlessly for many sequential and concurrency bugs. The paper provides a detailed design and implementation of ACT in a typical multiprocessor system. It uses a three stage pipeline for partially configurable one hidden layer neural networks. We have evaluated ACT on a variety of programs from popular benchmarks as well as open source programs. ACT diagnoses failures caused by 16 bugs from these programs with accurate ranking. Compared to existing learning and sampling based approaches, ACT has better diagnostic ability. For the default configuration, ACT has an average execution overhead of 8.2%.

查看原文本刊更多论文

基于自适应通信跟踪的生产运行软件故障诊断

软件故障诊断技术通过在生产运行时对某些事件进行采样或使用某些错误检测算法来工作。有些技术需要多次重现失败。当执行平台、环境或代码发生变化时，那些不需要这样做的代码就不能足够适应。我们提出了一种基于神经硬件的机器智能的生产运行故障诊断技术ACT。ACT使用神经硬件实时学习一些不变量(例如，数据通信不变量)，并记录任何潜在的违反。由于ACT可以动态地学习不变量，因此它可以适应执行设置或代码中的任何更改。由于它只记录可能违反的不变量，因此后处理阶段可以相当准确地查明根本原因，而无需再次观察故障。ACT可以无缝地解决许多顺序和并发错误。本文给出了一个典型的多处理器系统中ACT的详细设计和实现。对于部分可配置的单隐层神经网络，采用三级管道。我们已经在各种流行的基准测试和开源程序中对ACT进行了评估。ACT对这些程序中的16个错误进行了准确的排序诊断。与现有的基于学习和抽样的方法相比，ACT具有更好的诊断能力。对于默认配置，ACT的平均执行开销为8.2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量