总召回:在数据密集型计算环境中用于广泛系统性能和用户体验可见性的整体度量

HUST '15 Pub Date : 2015-11-15 DOI:10.1145/2834996.2835001
Erich Birngruber, Petar Forai, Aaron Zauner
{"title":"总召回:在数据密集型计算环境中用于广泛系统性能和用户体验可见性的整体度量","authors":"Erich Birngruber, Petar Forai, Aaron Zauner","doi":"10.1145/2834996.2835001","DOIUrl":null,"url":null,"abstract":"User support personnel, systems engineers, and administrators of HPC installations need to be aware of log and telemetry information from different systems in order to perform routine tasks ranging from systems management to user inquiries. We present an integrated, distributed HPC tailored monitoring system, based on a current generation software stack from the DevOps community, with integration into the work load management system. The goal of this system is to provide a quicker turnaround time for user inquiries in response to errors. Dashboards provide an overlay of system and node level events on top of correlated metrics data. This information is directly available for querying, manipulation, and filtering, allowing statistical analysis and aggregation of collected data. Furthermore, additional dashboards offer in-sight into how users are interacting with available resources and pin-point fluctuations in utilization. The system can integrate sources of information from other monitoring solutions and event-based sources.","PeriodicalId":428233,"journal":{"name":"HUST '15","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Total recall: holistic metrics for broad systems performance and user experience visibility in a data-intensive computing environment\",\"authors\":\"Erich Birngruber, Petar Forai, Aaron Zauner\",\"doi\":\"10.1145/2834996.2835001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"User support personnel, systems engineers, and administrators of HPC installations need to be aware of log and telemetry information from different systems in order to perform routine tasks ranging from systems management to user inquiries. We present an integrated, distributed HPC tailored monitoring system, based on a current generation software stack from the DevOps community, with integration into the work load management system. The goal of this system is to provide a quicker turnaround time for user inquiries in response to errors. Dashboards provide an overlay of system and node level events on top of correlated metrics data. This information is directly available for querying, manipulation, and filtering, allowing statistical analysis and aggregation of collected data. Furthermore, additional dashboards offer in-sight into how users are interacting with available resources and pin-point fluctuations in utilization. The system can integrate sources of information from other monitoring solutions and event-based sources.\",\"PeriodicalId\":428233,\"journal\":{\"name\":\"HUST '15\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"HUST '15\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2834996.2835001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"HUST '15","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2834996.2835001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

用户支持人员、系统工程师和HPC安装管理员需要了解来自不同系统的日志和遥测信息,以便执行从系统管理到用户查询的日常任务。我们提出了一个集成的、分布式的高性能计算监控系统,基于DevOps社区的当前一代软件堆栈,并集成到工作负载管理系统中。该系统的目标是为响应错误的用户查询提供更快的周转时间。仪表板在相关指标数据之上提供系统和节点级事件的覆盖。此信息可直接用于查询、操作和过滤,从而允许对收集的数据进行统计分析和聚合。此外,额外的仪表板提供了对用户如何与可用资源进行交互以及利用率的精确点波动的洞察。该系统可以集成来自其他监控解决方案的信息源和基于事件的信息源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Total recall: holistic metrics for broad systems performance and user experience visibility in a data-intensive computing environment
User support personnel, systems engineers, and administrators of HPC installations need to be aware of log and telemetry information from different systems in order to perform routine tasks ranging from systems management to user inquiries. We present an integrated, distributed HPC tailored monitoring system, based on a current generation software stack from the DevOps community, with integration into the work load management system. The goal of this system is to provide a quicker turnaround time for user inquiries in response to errors. Dashboards provide an overlay of system and node level events on top of correlated metrics data. This information is directly available for querying, manipulation, and filtering, allowing statistical analysis and aggregation of collected data. Furthermore, additional dashboards offer in-sight into how users are interacting with available resources and pin-point fluctuations in utilization. The system can integrate sources of information from other monitoring solutions and event-based sources.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信