{"title":"总召回:在数据密集型计算环境中用于广泛系统性能和用户体验可见性的整体度量","authors":"Erich Birngruber, Petar Forai, Aaron Zauner","doi":"10.1145/2834996.2835001","DOIUrl":null,"url":null,"abstract":"User support personnel, systems engineers, and administrators of HPC installations need to be aware of log and telemetry information from different systems in order to perform routine tasks ranging from systems management to user inquiries. We present an integrated, distributed HPC tailored monitoring system, based on a current generation software stack from the DevOps community, with integration into the work load management system. The goal of this system is to provide a quicker turnaround time for user inquiries in response to errors. Dashboards provide an overlay of system and node level events on top of correlated metrics data. This information is directly available for querying, manipulation, and filtering, allowing statistical analysis and aggregation of collected data. Furthermore, additional dashboards offer in-sight into how users are interacting with available resources and pin-point fluctuations in utilization. The system can integrate sources of information from other monitoring solutions and event-based sources.","PeriodicalId":428233,"journal":{"name":"HUST '15","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Total recall: holistic metrics for broad systems performance and user experience visibility in a data-intensive computing environment\",\"authors\":\"Erich Birngruber, Petar Forai, Aaron Zauner\",\"doi\":\"10.1145/2834996.2835001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"User support personnel, systems engineers, and administrators of HPC installations need to be aware of log and telemetry information from different systems in order to perform routine tasks ranging from systems management to user inquiries. We present an integrated, distributed HPC tailored monitoring system, based on a current generation software stack from the DevOps community, with integration into the work load management system. The goal of this system is to provide a quicker turnaround time for user inquiries in response to errors. Dashboards provide an overlay of system and node level events on top of correlated metrics data. This information is directly available for querying, manipulation, and filtering, allowing statistical analysis and aggregation of collected data. Furthermore, additional dashboards offer in-sight into how users are interacting with available resources and pin-point fluctuations in utilization. The system can integrate sources of information from other monitoring solutions and event-based sources.\",\"PeriodicalId\":428233,\"journal\":{\"name\":\"HUST '15\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"HUST '15\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2834996.2835001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"HUST '15","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2834996.2835001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Total recall: holistic metrics for broad systems performance and user experience visibility in a data-intensive computing environment
User support personnel, systems engineers, and administrators of HPC installations need to be aware of log and telemetry information from different systems in order to perform routine tasks ranging from systems management to user inquiries. We present an integrated, distributed HPC tailored monitoring system, based on a current generation software stack from the DevOps community, with integration into the work load management system. The goal of this system is to provide a quicker turnaround time for user inquiries in response to errors. Dashboards provide an overlay of system and node level events on top of correlated metrics data. This information is directly available for querying, manipulation, and filtering, allowing statistical analysis and aggregation of collected data. Furthermore, additional dashboards offer in-sight into how users are interacting with available resources and pin-point fluctuations in utilization. The system can integrate sources of information from other monitoring solutions and event-based sources.