Athanasios Chatzidimitriou, G. Papadimitriou, D. Gizopoulos
{"title":"HealthLog Monitor: A Flexible System-Monitoring Linux Service","authors":"Athanasios Chatzidimitriou, G. Papadimitriou, D. Gizopoulos","doi":"10.1109/IOLTS.2018.8474119","DOIUrl":null,"url":null,"abstract":"Error monitoring is a critical procedure for most computing systems, varying from HPC to embedded systems domains. Several generic architectures have been proposed and employed in modern processors, offering the capability of hardware-level error detection. This critical information is required to isolate and/or mitigate failures. However, research has revealed many cases where indications of upcoming failures can be identified early and before the actual fail occurrence, known as symptoms. Such cases become more frequent as technology trends try to exploit the conservative worst-case voltage guardbands and push computing systems towards more aggressive and often hazardous regions. In this paper we present HealthLog monitor, a flexible system monitoring service that offers a generic abstraction layer to combine both error and symptom monitoring. HealthLog is capable of monitoring hardware measurements (performance, sensor and errors) as well as external health-related data, allowing combined symptom description and reaction features supported by an API. The scope of the monitor is to offer a universal standard for error reporting and system monitoring mechanisms in all system layers. The current version of HealthLog was developed and tested on AppliedMicro’s X-Gene 2 micro-server, but it is a cross-platform solution as it does not depend on a specific architecture. This work demonstrates how platform events, software metrics and external peripheral mechanisms can be combined to deliver early warnings of upcoming failures and trigger evading reactions.","PeriodicalId":241735,"journal":{"name":"2018 IEEE 24th International Symposium on On-Line Testing And Robust System Design (IOLTS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 24th International Symposium on On-Line Testing And Robust System Design (IOLTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOLTS.2018.8474119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Error monitoring is a critical procedure for most computing systems, varying from HPC to embedded systems domains. Several generic architectures have been proposed and employed in modern processors, offering the capability of hardware-level error detection. This critical information is required to isolate and/or mitigate failures. However, research has revealed many cases where indications of upcoming failures can be identified early and before the actual fail occurrence, known as symptoms. Such cases become more frequent as technology trends try to exploit the conservative worst-case voltage guardbands and push computing systems towards more aggressive and often hazardous regions. In this paper we present HealthLog monitor, a flexible system monitoring service that offers a generic abstraction layer to combine both error and symptom monitoring. HealthLog is capable of monitoring hardware measurements (performance, sensor and errors) as well as external health-related data, allowing combined symptom description and reaction features supported by an API. The scope of the monitor is to offer a universal standard for error reporting and system monitoring mechanisms in all system layers. The current version of HealthLog was developed and tested on AppliedMicro’s X-Gene 2 micro-server, but it is a cross-platform solution as it does not depend on a specific architecture. This work demonstrates how platform events, software metrics and external peripheral mechanisms can be combined to deliver early warnings of upcoming failures and trigger evading reactions.