{"title":"Monitoring trix","authors":"Christopher D. Maestas","doi":"10.1145/1188455.1188488","DOIUrl":null,"url":null,"abstract":"Monitoring tools have evolved greatly, but when mis-configured they can forget how to do intelligent, non-intrusive and accomplish just-in-time alerts. Non-intrusive collection windows arise during bootup, idle system time, before major workload events and after these events finish. Batch schedulers allow health checking during these opportune times. Most just-in-time alerts arrive via system logs and out-of-band queries that can then trigger appropriate actions. However, abusive out-of-band queries may interrupt normal operational activities.Some vendor and open implementations have been heavyweight in watch-dogging at a brutal cost on computation as systems start scaling to thousands of nodes. Configuring tools to query intelligently during certain opportunities and running only necessary daemons helps to meet monitoring goals. These tools and daemons can include HP's hpasm, Dell's OMSA, supermon, lm_sensors, nagios, ganglia, logsurfer/syslog-ng, torque health checks. Share your monitoring stories and learn how triggers implemented scale to 4000+ node systems.","PeriodicalId":115940,"journal":{"name":"Proceedings of the 2006 ACM/IEEE conference on Supercomputing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2006 ACM/IEEE conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1188455.1188488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Monitoring tools have evolved greatly, but when mis-configured they can forget how to do intelligent, non-intrusive and accomplish just-in-time alerts. Non-intrusive collection windows arise during bootup, idle system time, before major workload events and after these events finish. Batch schedulers allow health checking during these opportune times. Most just-in-time alerts arrive via system logs and out-of-band queries that can then trigger appropriate actions. However, abusive out-of-band queries may interrupt normal operational activities.Some vendor and open implementations have been heavyweight in watch-dogging at a brutal cost on computation as systems start scaling to thousands of nodes. Configuring tools to query intelligently during certain opportunities and running only necessary daemons helps to meet monitoring goals. These tools and daemons can include HP's hpasm, Dell's OMSA, supermon, lm_sensors, nagios, ganglia, logsurfer/syslog-ng, torque health checks. Share your monitoring stories and learn how triggers implemented scale to 4000+ node systems.