G. Ostrouchov, T. Naughton, C. Engelmann, G. Vallee, S. L. Scott
{"title":"支持HPC弹性的非参数多变量异常分析","authors":"G. Ostrouchov, T. Naughton, C. Engelmann, G. Vallee, S. L. Scott","doi":"10.1109/ESCIW.2009.5407992","DOIUrl":null,"url":null,"abstract":"Large-scale computing systems provide great potential for scientific exploration. However, the complexity that accompanies these enormous machines raises challenges for both, users and operators. The effective use of such systems is often hampered by failures encountered when running applications on systems containing tens-of-thousands of nodes and hundreds-of-thousands of compute cores capable of yielding petaflops of performance. In systems of this size failure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and system logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It discusses the details of an initial prototype that gathers the data and uses statistical techniques for analysis.","PeriodicalId":416133,"journal":{"name":"2009 5th IEEE International Conference on E-Science Workshops","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Nonparametric multivariate anomaly analysis in support of HPC resilience\",\"authors\":\"G. Ostrouchov, T. Naughton, C. Engelmann, G. Vallee, S. L. Scott\",\"doi\":\"10.1109/ESCIW.2009.5407992\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale computing systems provide great potential for scientific exploration. However, the complexity that accompanies these enormous machines raises challenges for both, users and operators. The effective use of such systems is often hampered by failures encountered when running applications on systems containing tens-of-thousands of nodes and hundreds-of-thousands of compute cores capable of yielding petaflops of performance. In systems of this size failure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and system logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It discusses the details of an initial prototype that gathers the data and uses statistical techniques for analysis.\",\"PeriodicalId\":416133,\"journal\":{\"name\":\"2009 5th IEEE International Conference on E-Science Workshops\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 5th IEEE International Conference on E-Science Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ESCIW.2009.5407992\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 5th IEEE International Conference on E-Science Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESCIW.2009.5407992","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Nonparametric multivariate anomaly analysis in support of HPC resilience
Large-scale computing systems provide great potential for scientific exploration. However, the complexity that accompanies these enormous machines raises challenges for both, users and operators. The effective use of such systems is often hampered by failures encountered when running applications on systems containing tens-of-thousands of nodes and hundreds-of-thousands of compute cores capable of yielding petaflops of performance. In systems of this size failure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and system logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It discusses the details of an initial prototype that gathers the data and uses statistical techniques for analysis.