基于神经网络的概率断言，用于检测并行程序中的静默数据损坏

2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS) Pub Date : 2016-09-01 DOI:10.1109/SRDS.2016.016

T. E. Thomas, Anmol J. Bhattad, S. Mitra, S. Bagchi

{"title":"基于神经网络的概率断言，用于检测并行程序中的静默数据损坏","authors":"T. E. Thomas, Anmol J. Bhattad, S. Mitra, S. Bagchi","doi":"10.1109/SRDS.2016.016","DOIUrl":null,"url":null,"abstract":"The size and complexity of supercomputing clusters are rapidly increasing to cater to the needs of complex scientific applications. At the same time, the feature size and operating voltage level of the internal components are decreasing. This dual trend makes these machines extremely vulnerable to soft errors or random bit flips. For complex parallel applications, these soft errors can lead to silent data corruption which could lead to large inaccuracies in the final computational results. Hence, it is important to determine the presence and severity of such errors early on, so that proper counter measures can be taken. In this paper, we introduce a tool called Sirius, which can accurately identify silent data corruptions based on the simple insight that there exist spatial and temporal locality within most variables in such programs. Spatial locality means that values of the variable at nodes that are close by in a network sense, are also close numerically. Similarly, temporal locality means that the values change slowly and in a continuous manner with time. Sirius uses neural networks to learn such locality patterns, separately for each critical variable, and produces probabilistic assertions which can be embedded in the code of the parallel program to detect silent data corruptions. We have implemented this technique on parallel benchmark programs - LULESH and CoMD. Our evaluations show that Sirius can detect silent errors in the code with much higher accuracy compared to previously proposed methods. Sirius detected 98% of the silent data corruptions with a false positive rate of less than 0.02 as compared to the false positive rate 0.06 incurred by the state of the art acceleration based prediction (ABP) based technique.","PeriodicalId":165721,"journal":{"name":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Sirius: Neural Network Based Probabilistic Assertions for Detecting Silent Data Corruption in Parallel Programs\",\"authors\":\"T. E. Thomas, Anmol J. Bhattad, S. Mitra, S. Bagchi\",\"doi\":\"10.1109/SRDS.2016.016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The size and complexity of supercomputing clusters are rapidly increasing to cater to the needs of complex scientific applications. At the same time, the feature size and operating voltage level of the internal components are decreasing. This dual trend makes these machines extremely vulnerable to soft errors or random bit flips. For complex parallel applications, these soft errors can lead to silent data corruption which could lead to large inaccuracies in the final computational results. Hence, it is important to determine the presence and severity of such errors early on, so that proper counter measures can be taken. In this paper, we introduce a tool called Sirius, which can accurately identify silent data corruptions based on the simple insight that there exist spatial and temporal locality within most variables in such programs. Spatial locality means that values of the variable at nodes that are close by in a network sense, are also close numerically. Similarly, temporal locality means that the values change slowly and in a continuous manner with time. Sirius uses neural networks to learn such locality patterns, separately for each critical variable, and produces probabilistic assertions which can be embedded in the code of the parallel program to detect silent data corruptions. We have implemented this technique on parallel benchmark programs - LULESH and CoMD. Our evaluations show that Sirius can detect silent errors in the code with much higher accuracy compared to previously proposed methods. Sirius detected 98% of the silent data corruptions with a false positive rate of less than 0.02 as compared to the false positive rate 0.06 incurred by the state of the art acceleration based prediction (ABP) based technique.\",\"PeriodicalId\":165721,\"journal\":{\"name\":\"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SRDS.2016.016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SRDS.2016.016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

为了满足复杂科学应用的需要，超级计算集群的规模和复杂性正在迅速增加。同时，内部元件的特征尺寸和工作电压水平也在不断减小。这种双重趋势使这些机器极易受到软错误或随机位翻转的影响。对于复杂的并行应用程序，这些软错误可能导致静默数据损坏，从而导致最终计算结果出现很大的不准确性。因此，及早确定此类错误的存在和严重程度非常重要，以便采取适当的应对措施。在本文中，我们介绍了一个名为Sirius的工具，该工具可以根据在此类程序中的大多数变量中存在空间和时间局域性的简单见解，准确识别静默数据损坏。空间局部性意味着在网络意义上相邻的节点上的变量值在数值上也很接近。类似地，时间局部性意味着值随时间缓慢而连续地变化。Sirius使用神经网络来学习这种局部模式，分别针对每个关键变量，并产生概率断言，这些断言可以嵌入到并行程序的代码中，以检测无声的数据损坏。我们已经在并行基准程序LULESH和CoMD上实现了这种技术。我们的评估表明，与以前提出的方法相比，Sirius可以以更高的精度检测代码中的静默错误。Sirius检测到98%的无声数据损坏，假阳性率低于0.02，而目前基于加速预测(ABP)技术的假阳性率为0.06。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Sirius: Neural Network Based Probabilistic Assertions for Detecting Silent Data Corruption in Parallel Programs

The size and complexity of supercomputing clusters are rapidly increasing to cater to the needs of complex scientific applications. At the same time, the feature size and operating voltage level of the internal components are decreasing. This dual trend makes these machines extremely vulnerable to soft errors or random bit flips. For complex parallel applications, these soft errors can lead to silent data corruption which could lead to large inaccuracies in the final computational results. Hence, it is important to determine the presence and severity of such errors early on, so that proper counter measures can be taken. In this paper, we introduce a tool called Sirius, which can accurately identify silent data corruptions based on the simple insight that there exist spatial and temporal locality within most variables in such programs. Spatial locality means that values of the variable at nodes that are close by in a network sense, are also close numerically. Similarly, temporal locality means that the values change slowly and in a continuous manner with time. Sirius uses neural networks to learn such locality patterns, separately for each critical variable, and produces probabilistic assertions which can be embedded in the code of the parallel program to detect silent data corruptions. We have implemented this technique on parallel benchmark programs - LULESH and CoMD. Our evaluations show that Sirius can detect silent errors in the code with much higher accuracy compared to previously proposed methods. Sirius detected 98% of the silent data corruptions with a false positive rate of less than 0.02 as compared to the false positive rate 0.06 incurred by the state of the art acceleration based prediction (ABP) based technique.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS)

自引率

0.00%

发文量