The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, Joshi Fullop, A. Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, M. Rajan, M. Showerman, J. Stevenson, Narate Taerat, Thomas W. Tucker
{"title":"The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications","authors":"A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, Joshi Fullop, A. Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, M. Rajan, M. Showerman, J. Stevenson, Narate Taerat, Thomas W. Tucker","doi":"10.1109/SC.2014.18","DOIUrl":null,"url":null,"abstract":"Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"179","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2014.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 179

Abstract

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.
轻量级分布式度量服务:用于持续监控大规模计算系统和应用程序的可扩展基础设施
了解应用程序如何单独或组合地利用高性能计算平台的资源是提高应用程序和平台性能的关键。典型的系统监视工具不能提供足够的保真度,而应用程序分析工具不能捕获争夺共享资源的应用程序之间复杂的相互作用。为了获得新的见解,监视工具必须在系统范围内以适当的频率连续运行,同时对应用程序性能产生最小的影响。我们介绍轻量级分布式度量服务,用于大规模计算系统和应用程序的可伸缩、轻量级监控。我们描述了在桑迪亚国家实验室的容量计算环境和国家超级计算应用中心的蓝水平台上指导部署的问题和限制,包括动机、选择的度量,以及与蓝水的规模和专业性质相关的需求。我们讨论了监控开销和对应用程序性能的影响,并提供了说明性分析结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信