LLload: Simplifying Real-Time Job Monitoring for HPC Users

Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin
{"title":"LLload: Simplifying Real-Time Job Monitoring for HPC Users","authors":"Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin","doi":"arxiv-2407.01481","DOIUrl":null,"url":null,"abstract":"One of the more complex tasks for researchers using HPC systems is\nperformance monitoring and tuning of their applications. Developing a practice\nof continuous performance improvement, both for speed-up and efficient use of\nresources is essential to the long term success of both the HPC practitioner\nand the research project. Profiling tools provide a nice view of the\nperformance of an application but often have a steep learning curve and rarely\nprovide an easy to interpret view of resource utilization. Lower level tools\nsuch as top and htop provide a view of resource utilization for those familiar\nand comfortable with Linux but a barrier for newer HPC practitioners. To expand\nthe existing profiling and job monitoring options, the MIT Lincoln Laboratory\nSupercomputing Center created LLoad, a tool that captures a snapshot of the\nresources being used by a job on a per user basis. LLload is a tool built from\nstandard HPC tools that provides an easy way for a researcher to track resource\nusage of active jobs. We explain how the tool was designed and implemented and\nprovide insight into how it is used to aid new researchers in developing their\nperformance monitoring skills as well as guide researchers in their resource\nrequests.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.01481","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner and the research project. Profiling tools provide a nice view of the performance of an application but often have a steep learning curve and rarely provide an easy to interpret view of resource utilization. Lower level tools such as top and htop provide a view of resource utilization for those familiar and comfortable with Linux but a barrier for newer HPC practitioners. To expand the existing profiling and job monitoring options, the MIT Lincoln Laboratory Supercomputing Center created LLoad, a tool that captures a snapshot of the resources being used by a job on a per user basis. LLload is a tool built from standard HPC tools that provides an easy way for a researcher to track resource usage of active jobs. We explain how the tool was designed and implemented and provide insight into how it is used to aid new researchers in developing their performance monitoring skills as well as guide researchers in their resource requests.
LLload:简化高性能计算用户的实时作业监控
对于使用高性能计算系统的研究人员来说,更复杂的任务之一是对其应用程序进行性能监控和调整。开发一种持续改进性能的方法,既能提高速度,又能有效利用资源,这对高性能计算从业人员和研究项目的长期成功都至关重要。剖析工具提供了一个很好的应用程序性能视图,但通常学习曲线很陡峭,而且很少提供易于解释的资源利用率视图。较低级别的工具,如 top 和 htop,可以为熟悉 Linux 的人提供资源利用率的视图,但对较新的 HPC 从业人员来说却是个障碍。为了扩展现有的剖析和作业监控选项,麻省理工学院林肯实验室超级计算中心创建了 LLoad,这是一种按用户捕获作业所用资源快照的工具。LLoad 是一款由标准 HPC 工具构建而成的工具,它为研究人员跟踪活动作业的资源使用情况提供了一种简便的方法。我们解释了该工具的设计和实施过程,并深入介绍了它如何用于帮助新研究人员提高性能监控技能,以及指导研究人员的资源需求。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信