{"title":"性能故障检测在线监控系统","authors":"R. Gioiosa, Gokcen Kestor, D. Kerbyson","doi":"10.1109/IPDPSW.2014.165","DOIUrl":null,"url":null,"abstract":"To achieve the exaFLOPS performance within a contained power budget, next generation supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution. In this work we propose the design and implementation of a monitoring system that continuously inspects the evolution of running applications and the health of the system. To achieve minimum runtime overhead while maintaining the desired level of flexibility, we propose a decoupled approach in which accurate monitoring is performed at kernel-level while performance anomaly disambiguation and corrective actions are performed at user-level. We evaluate our monitoring system on a 32-core AMD Interlagos compute node: First, we show that the runtime overhead of the monitoring system is negligible (0-2%). Then we show how our system can be used to precisely identify performance faults in two different scenarios. In the first, we inject OS noise while in the second we simulate the execution of a data analytics application next to a scientific simulation.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Online Monitoring System for Performance Fault Detection\",\"authors\":\"R. Gioiosa, Gokcen Kestor, D. Kerbyson\",\"doi\":\"10.1109/IPDPSW.2014.165\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To achieve the exaFLOPS performance within a contained power budget, next generation supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution. In this work we propose the design and implementation of a monitoring system that continuously inspects the evolution of running applications and the health of the system. To achieve minimum runtime overhead while maintaining the desired level of flexibility, we propose a decoupled approach in which accurate monitoring is performed at kernel-level while performance anomaly disambiguation and corrective actions are performed at user-level. We evaluate our monitoring system on a 32-core AMD Interlagos compute node: First, we show that the runtime overhead of the monitoring system is negligible (0-2%). Then we show how our system can be used to precisely identify performance faults in two different scenarios. In the first, we inject OS noise while in the second we simulate the execution of a data analytics application next to a scientific simulation.\",\"PeriodicalId\":153864,\"journal\":{\"name\":\"2014 IEEE International Parallel & Distributed Processing Symposium Workshops\",\"volume\":\"96 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE International Parallel & Distributed Processing Symposium Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2014.165\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2014.165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Online Monitoring System for Performance Fault Detection
To achieve the exaFLOPS performance within a contained power budget, next generation supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution. In this work we propose the design and implementation of a monitoring system that continuously inspects the evolution of running applications and the health of the system. To achieve minimum runtime overhead while maintaining the desired level of flexibility, we propose a decoupled approach in which accurate monitoring is performed at kernel-level while performance anomaly disambiguation and corrective actions are performed at user-level. We evaluate our monitoring system on a 32-core AMD Interlagos compute node: First, we show that the runtime overhead of the monitoring system is negligible (0-2%). Then we show how our system can be used to precisely identify performance faults in two different scenarios. In the first, we inject OS noise while in the second we simulate the execution of a data analytics application next to a scientific simulation.