Localizing transient faults using dynamic bayesian networks

Susmit Jha, Wenchao Li, S. Seshia
{"title":"Localizing transient faults using dynamic bayesian networks","authors":"Susmit Jha, Wenchao Li, S. Seshia","doi":"10.1109/HLDVT.2009.5340170","DOIUrl":null,"url":null,"abstract":"Transient faults are a major concern in today's deep sub-micron semiconductor technology. These faults are rare but they have been known to cause catastrophic system-level failures. Transient errors often occur due to physical effects on deployed systems and hence, diagnosis of transient errors must be performed over manufactured chips or systems assembled from black-box components where arbitrary instrumentation of the system is not possible and hence, the system state is only partially observable. Further, these systems are often composed of components that are third party IP which further adds opaqueness to the system. In this paper, we propose a probabilistic approach to localize transient faults in space and time for such partially observable systems. From a set of correct traces and a failure trace, we seek to locate the faulty component and the cycle of operation at which the fault occurred. Our technique uses correct system traces over monitored components of the system to learn a dynamic Bayesian network (DBN) summarizing the temporal dependencies across the monitored components. This DBN is augmented with different error hypotheses allowed by the fault model. The most probable explanation (MPE) among these hypotheses corresponds to the most likely location of the error. We evaluated the effectiveness of our technique on a set of ISCAS89 benchmarks and a router design used in on-chip networks in a multi-core design.","PeriodicalId":153879,"journal":{"name":"2009 IEEE International High Level Design Validation and Test Workshop","volume":"38 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International High Level Design Validation and Test Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HLDVT.2009.5340170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

Transient faults are a major concern in today's deep sub-micron semiconductor technology. These faults are rare but they have been known to cause catastrophic system-level failures. Transient errors often occur due to physical effects on deployed systems and hence, diagnosis of transient errors must be performed over manufactured chips or systems assembled from black-box components where arbitrary instrumentation of the system is not possible and hence, the system state is only partially observable. Further, these systems are often composed of components that are third party IP which further adds opaqueness to the system. In this paper, we propose a probabilistic approach to localize transient faults in space and time for such partially observable systems. From a set of correct traces and a failure trace, we seek to locate the faulty component and the cycle of operation at which the fault occurred. Our technique uses correct system traces over monitored components of the system to learn a dynamic Bayesian network (DBN) summarizing the temporal dependencies across the monitored components. This DBN is augmented with different error hypotheses allowed by the fault model. The most probable explanation (MPE) among these hypotheses corresponds to the most likely location of the error. We evaluated the effectiveness of our technique on a set of ISCAS89 benchmarks and a router design used in on-chip networks in a multi-core design.
基于动态贝叶斯网络的暂态故障定位
瞬态故障是当今深亚微米半导体技术的一个主要问题。这些故障很少见,但已知它们会导致灾难性的系统级故障。由于对已部署系统的物理影响,瞬态错误经常发生,因此,瞬态错误的诊断必须在制造芯片或由黑盒组件组装的系统上执行,其中不可能对系统进行任意仪器检测,因此,系统状态只能部分观察到。此外,这些系统通常由第三方IP组件组成,这进一步增加了系统的不透明性。在本文中,我们提出了一种概率方法来在空间和时间上定位这类部分可观测系统的暂态故障。从一组正确的跟踪和故障跟踪中,我们试图定位故障组件和故障发生时的操作周期。我们的技术在系统的被监视组件上使用正确的系统跟踪来学习动态贝叶斯网络(DBN),总结被监视组件之间的时间依赖性。该DBN扩展了故障模型允许的不同误差假设。这些假设中的最可能解释(MPE)对应于误差最可能出现的位置。我们在一组ISCAS89基准测试和用于多核设计的片上网络的路由器设计上评估了我们的技术的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信