Fast Fault Injection and Sensitivity Analysis for Collective Communications

Kun Feng, Manjunath Gorentla Venkata, Dong Li, Xian-He Sun
{"title":"Fast Fault Injection and Sensitivity Analysis for Collective Communications","authors":"Kun Feng, Manjunath Gorentla Venkata, Dong Li, Xian-He Sun","doi":"10.1109/CLUSTER.2015.31","DOIUrl":null,"url":null,"abstract":"The collective communication operations, which are widely used in parallel applications for global communication and synchronization are critical for application's performance and scalability. However, how faulty collective communications impact the application and how errors propagate between the application processes is largely unexplored. One of the critical reasons for this situation is the lack of fast evaluation method to investigate the impacts of faulty collective operations. The traditional random fault injection methods relying on a large amount of fault injection tests to ensure statistical significance require a significant amount of resources and time. These methods result in prohibitive evaluation cost when applied to the collectives. In this paper, we introduce a novel tool named Fast Fault Injection and Sensitivity Analysis Tool (FastFIT) to conduct fast fault injection and characterize the application sensitivity to faulty collectives. The tool achieves fast exploration by reducing the exploration space and predicting the application sensitivity using Machine Learning (ML) techniques. A basis for these techniques are implicit correlations between MPI semantics, application context, critical application features, and application responses to faulty collective communications. The experimental results show that our approach reduces the fault injection points and tests by 97% for representative benchmarks (NAS Parallel Benchmarks (NPB)) and a realistic application (Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)) on a production supercomputer. Further, we statistically generalize the application sensitivity to faulty collective communications for these workloads, and present correlation between application features and the sensitivity.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"223 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.31","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

The collective communication operations, which are widely used in parallel applications for global communication and synchronization are critical for application's performance and scalability. However, how faulty collective communications impact the application and how errors propagate between the application processes is largely unexplored. One of the critical reasons for this situation is the lack of fast evaluation method to investigate the impacts of faulty collective operations. The traditional random fault injection methods relying on a large amount of fault injection tests to ensure statistical significance require a significant amount of resources and time. These methods result in prohibitive evaluation cost when applied to the collectives. In this paper, we introduce a novel tool named Fast Fault Injection and Sensitivity Analysis Tool (FastFIT) to conduct fast fault injection and characterize the application sensitivity to faulty collectives. The tool achieves fast exploration by reducing the exploration space and predicting the application sensitivity using Machine Learning (ML) techniques. A basis for these techniques are implicit correlations between MPI semantics, application context, critical application features, and application responses to faulty collective communications. The experimental results show that our approach reduces the fault injection points and tests by 97% for representative benchmarks (NAS Parallel Benchmarks (NPB)) and a realistic application (Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)) on a production supercomputer. Further, we statistically generalize the application sensitivity to faulty collective communications for these workloads, and present correlation between application features and the sensitivity.
集体通信快速故障注入与灵敏度分析
在并行应用程序中广泛应用于全局通信和同步的集体通信操作对应用程序的性能和可伸缩性至关重要。然而,错误的集体通信如何影响应用程序,以及错误如何在应用程序进程之间传播,在很大程度上是未知的。造成这种情况的关键原因之一是缺乏快速评估方法来调查错误的集体作业的影响。传统的随机故障注入方法依靠大量的故障注入测试来保证统计显著性,需要耗费大量的资源和时间。这些方法在应用于集体时导致过高的评估成本。本文引入了快速故障注入和灵敏度分析工具(FastFIT)来进行快速故障注入,并表征应用程序对故障集合的灵敏度。该工具通过减少探索空间和使用机器学习(ML)技术预测应用程序的敏感性来实现快速探索。这些技术的基础是MPI语义、应用程序上下文、关键应用程序特性和应用程序对错误集体通信的响应之间的隐式关联。实验结果表明,在典型基准测试(NAS并行基准测试(NPB))和实际应用(大规模原子/分子大规模并行模拟器(LAMMPS))上,我们的方法减少了97%的故障注入点和测试。此外,我们统计概括了这些工作负载的应用程序对错误集体通信的敏感性,并给出了应用程序特性与敏感性之间的相关性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信