Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin
{"title":"LLload:简化高性能计算用户的实时作业监控","authors":"Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin","doi":"arxiv-2407.01481","DOIUrl":null,"url":null,"abstract":"One of the more complex tasks for researchers using HPC systems is\nperformance monitoring and tuning of their applications. Developing a practice\nof continuous performance improvement, both for speed-up and efficient use of\nresources is essential to the long term success of both the HPC practitioner\nand the research project. Profiling tools provide a nice view of the\nperformance of an application but often have a steep learning curve and rarely\nprovide an easy to interpret view of resource utilization. Lower level tools\nsuch as top and htop provide a view of resource utilization for those familiar\nand comfortable with Linux but a barrier for newer HPC practitioners. To expand\nthe existing profiling and job monitoring options, the MIT Lincoln Laboratory\nSupercomputing Center created LLoad, a tool that captures a snapshot of the\nresources being used by a job on a per user basis. LLload is a tool built from\nstandard HPC tools that provides an easy way for a researcher to track resource\nusage of active jobs. We explain how the tool was designed and implemented and\nprovide insight into how it is used to aid new researchers in developing their\nperformance monitoring skills as well as guide researchers in their resource\nrequests.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLload: Simplifying Real-Time Job Monitoring for HPC Users\",\"authors\":\"Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin\",\"doi\":\"arxiv-2407.01481\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the more complex tasks for researchers using HPC systems is\\nperformance monitoring and tuning of their applications. Developing a practice\\nof continuous performance improvement, both for speed-up and efficient use of\\nresources is essential to the long term success of both the HPC practitioner\\nand the research project. Profiling tools provide a nice view of the\\nperformance of an application but often have a steep learning curve and rarely\\nprovide an easy to interpret view of resource utilization. Lower level tools\\nsuch as top and htop provide a view of resource utilization for those familiar\\nand comfortable with Linux but a barrier for newer HPC practitioners. To expand\\nthe existing profiling and job monitoring options, the MIT Lincoln Laboratory\\nSupercomputing Center created LLoad, a tool that captures a snapshot of the\\nresources being used by a job on a per user basis. LLload is a tool built from\\nstandard HPC tools that provides an easy way for a researcher to track resource\\nusage of active jobs. We explain how the tool was designed and implemented and\\nprovide insight into how it is used to aid new researchers in developing their\\nperformance monitoring skills as well as guide researchers in their resource\\nrequests.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.01481\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.01481","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
对于使用高性能计算系统的研究人员来说,更复杂的任务之一是对其应用程序进行性能监控和调整。开发一种持续改进性能的方法,既能提高速度,又能有效利用资源,这对高性能计算从业人员和研究项目的长期成功都至关重要。剖析工具提供了一个很好的应用程序性能视图,但通常学习曲线很陡峭,而且很少提供易于解释的资源利用率视图。较低级别的工具,如 top 和 htop,可以为熟悉 Linux 的人提供资源利用率的视图,但对较新的 HPC 从业人员来说却是个障碍。为了扩展现有的剖析和作业监控选项,麻省理工学院林肯实验室超级计算中心创建了 LLoad,这是一种按用户捕获作业所用资源快照的工具。LLoad 是一款由标准 HPC 工具构建而成的工具,它为研究人员跟踪活动作业的资源使用情况提供了一种简便的方法。我们解释了该工具的设计和实施过程,并深入介绍了它如何用于帮助新研究人员提高性能监控技能,以及指导研究人员的资源需求。
LLload: Simplifying Real-Time Job Monitoring for HPC Users
One of the more complex tasks for researchers using HPC systems is
performance monitoring and tuning of their applications. Developing a practice
of continuous performance improvement, both for speed-up and efficient use of
resources is essential to the long term success of both the HPC practitioner
and the research project. Profiling tools provide a nice view of the
performance of an application but often have a steep learning curve and rarely
provide an easy to interpret view of resource utilization. Lower level tools
such as top and htop provide a view of resource utilization for those familiar
and comfortable with Linux but a barrier for newer HPC practitioners. To expand
the existing profiling and job monitoring options, the MIT Lincoln Laboratory
Supercomputing Center created LLoad, a tool that captures a snapshot of the
resources being used by a job on a per user basis. LLload is a tool built from
standard HPC tools that provides an easy way for a researcher to track resource
usage of active jobs. We explain how the tool was designed and implemented and
provide insight into how it is used to aid new researchers in developing their
performance monitoring skills as well as guide researchers in their resource
requests.