Nitin Sukhija, Elizabeth Bautista, Drake Butz, C. Whitney
{"title":"Towards Anomaly Detection for Monitoring Power Consumption in HPC Facilities","authors":"Nitin Sukhija, Elizabeth Bautista, Drake Butz, C. Whitney","doi":"10.1145/3508397.3564826","DOIUrl":null,"url":null,"abstract":"Given the increasing complexity and the heterogeneity of today's computing system infrastructure, power efficiency and fault tolerance remain the top challenges of an High Performance Computing (HPC) facility operation. Recently, many research efforts are focusing on monitoring solutions for collecting, correlating, and analyzing computing infrastructures health events and metrics data to not only identify the normal events but also the anomalous, thus aiding to reduce downtime and power consumption in the face of a computational center's and users' critical needs. In this preliminary work, we present an anomaly detection methodology integrated with the Operations Monitoring and Notification Infrastructure (OMNI) data warehouse at Lawrence Berkeley National Laboratory's (LBNL) National Energy Scientific Computing Center (NERSC) that has implemented anomaly detection algorithms for identifying abnormal power patterns. We evaluated our methodology using five million unlabeled power datasets from the Cori computation system at NERSC and reported on the accuracy of the anomaly detection algorithms in detecting different anomalous behavior pertaining to the amount of power consumed. The methodology is employed to aid in monitoring and automating power alerting to achieve power efficiency and reliability in future systems to be deployed at NERSC or other HPC facilities.","PeriodicalId":266269,"journal":{"name":"Proceedings of the 14th International Conference on Management of Digital EcoSystems","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th International Conference on Management of Digital EcoSystems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508397.3564826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Given the increasing complexity and the heterogeneity of today's computing system infrastructure, power efficiency and fault tolerance remain the top challenges of an High Performance Computing (HPC) facility operation. Recently, many research efforts are focusing on monitoring solutions for collecting, correlating, and analyzing computing infrastructures health events and metrics data to not only identify the normal events but also the anomalous, thus aiding to reduce downtime and power consumption in the face of a computational center's and users' critical needs. In this preliminary work, we present an anomaly detection methodology integrated with the Operations Monitoring and Notification Infrastructure (OMNI) data warehouse at Lawrence Berkeley National Laboratory's (LBNL) National Energy Scientific Computing Center (NERSC) that has implemented anomaly detection algorithms for identifying abnormal power patterns. We evaluated our methodology using five million unlabeled power datasets from the Cori computation system at NERSC and reported on the accuracy of the anomaly detection algorithms in detecting different anomalous behavior pertaining to the amount of power consumed. The methodology is employed to aid in monitoring and automating power alerting to achieve power efficiency and reliability in future systems to be deployed at NERSC or other HPC facilities.