Online Monitoring System for Performance Fault Detection

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI:10.1109/IPDPSW.2014.165

R. Gioiosa, Gokcen Kestor, D. Kerbyson

{"title":"Online Monitoring System for Performance Fault Detection","authors":"R. Gioiosa, Gokcen Kestor, D. Kerbyson","doi":"10.1109/IPDPSW.2014.165","DOIUrl":null,"url":null,"abstract":"To achieve the exaFLOPS performance within a contained power budget, next generation supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution. In this work we propose the design and implementation of a monitoring system that continuously inspects the evolution of running applications and the health of the system. To achieve minimum runtime overhead while maintaining the desired level of flexibility, we propose a decoupled approach in which accurate monitoring is performed at kernel-level while performance anomaly disambiguation and corrective actions are performed at user-level. We evaluate our monitoring system on a 32-core AMD Interlagos compute node: First, we show that the runtime overhead of the monitoring system is negligible (0-2%). Then we show how our system can be used to precisely identify performance faults in two different scenarios. In the first, we inject OS noise while in the second we simulate the execution of a data analytics application next to a scientific simulation.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2014.165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

To achieve the exaFLOPS performance within a contained power budget, next generation supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution. In this work we propose the design and implementation of a monitoring system that continuously inspects the evolution of running applications and the health of the system. To achieve minimum runtime overhead while maintaining the desired level of flexibility, we propose a decoupled approach in which accurate monitoring is performed at kernel-level while performance anomaly disambiguation and corrective actions are performed at user-level. We evaluate our monitoring system on a 32-core AMD Interlagos compute node: First, we show that the runtime overhead of the monitoring system is negligible (0-2%). Then we show how our system can be used to precisely identify performance faults in two different scenarios. In the first, we inject OS noise while in the second we simulate the execution of a data analytics application next to a scientific simulation.

查看原文本刊更多论文

性能故障检测在线监控系统

为了在有限的功率预算内实现exaFLOPS的性能，下一代超级计算机将具有数亿个在低电压和接近阈值电压下工作的组件。由于在应用程序执行期间这些组件中至少有一个失败的可能性接近于确定性，因此期望科学应用程序的任何运行都不会遇到一些性能错误似乎是不现实的。我们认为需要新一代轻量级的性能和调试工具，即使在并行应用程序的生产运行期间也可以在线使用，并且可以识别应用程序执行期间的性能异常。在这项工作中，我们提出了一个监测系统的设计和实现，该系统可以持续检查正在运行的应用程序的演变和系统的健康状况。为了在保持所需的灵活性水平的同时实现最小的运行时开销，我们提出了一种解耦的方法，其中在内核级别执行精确的监视，而在用户级别执行性能异常消歧和纠正操作。我们在32核AMD Interlagos计算节点上评估我们的监控系统:首先，我们表明监控系统的运行时开销可以忽略不计(0-2%)。然后，我们将展示如何使用我们的系统来精确识别两种不同场景中的性能故障。在第一个测试中，我们注入了操作系统噪声，而在第二个测试中，我们模拟了一个数据分析应用程序的执行，旁边是一个科学模拟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Parallel & Distributed Processing Symposium Workshops

自引率

0.00%

发文量