{"title":"Work in Progress: Topic Modeling for HPC Job State Prediction","authors":"Alexandra DeLucia, Elisabeth Baseman","doi":"10.1145/3217871.3217874","DOIUrl":null,"url":null,"abstract":"As high performance computing approaches the exascale era, progress in automatic computer monitoring becomes increasingly important. Monitoring will no longer to able to rely only on human experts, due to the overwhelming amount of monitoring data, such as system logs, job logs, and temperature reports. Because a human analyst cannot keep up with terabytes of monitoring data per day, we turn to techniques from the statistical machine learning community to assist with analysis of monitoring data. Specifically, we use machine learning techniques predict compute job outcomes using features extracted from system log messages. Our preliminary results show that not only do statistical topics extracted from log messages provide a signal correlated with job outcome, but that the correlation is strong enough that two canonical classification algorithms can achieve very high predictive performance using only topic distributions and basic temporal information as features.","PeriodicalId":174025,"journal":{"name":"Proceedings of the First Workshop on Machine Learning for Computing Systems","volume":"52 10","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the First Workshop on Machine Learning for Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3217871.3217874","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
As high performance computing approaches the exascale era, progress in automatic computer monitoring becomes increasingly important. Monitoring will no longer to able to rely only on human experts, due to the overwhelming amount of monitoring data, such as system logs, job logs, and temperature reports. Because a human analyst cannot keep up with terabytes of monitoring data per day, we turn to techniques from the statistical machine learning community to assist with analysis of monitoring data. Specifically, we use machine learning techniques predict compute job outcomes using features extracted from system log messages. Our preliminary results show that not only do statistical topics extracted from log messages provide a signal correlated with job outcome, but that the correlation is strong enough that two canonical classification algorithms can achieve very high predictive performance using only topic distributions and basic temporal information as features.