Lightweight Measurement and Analysis of HPC Performance Variability

Jered Dominguez-Trujillo, Keira Haskins, S. J. Khouzani, Chris Leap, Sahba Tashakkori, Quincy Wofford, Trilce Estrada, P. Bridges, Patrick M. Widener
{"title":"Lightweight Measurement and Analysis of HPC Performance Variability","authors":"Jered Dominguez-Trujillo, Keira Haskins, S. J. Khouzani, Chris Leap, Sahba Tashakkori, Quincy Wofford, Trilce Estrada, P. Bridges, Patrick M. Widener","doi":"10.1109/PMBS51919.2020.00011","DOIUrl":null,"url":null,"abstract":"Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"374 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PMBS51919.2020.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.
高性能计算性能变异性的轻量化测量与分析
源自硬件和软件的性能变化在现代科学和数据密集型计算系统中很常见,并行和分布式程序中的同步通常会在规模上加剧它们的影响。不幸的是,这种变化的分散和突现效应也难以系统地测量、分析和预测;在有意义的应用规模上,不能保证足够严格的建模假设使分析易于处理,而且在这种规模上的纵向方法可能需要捕获和操作不切实际的大量数据。本文描述了一种新的、可扩展的、统计健壮的方法,用于高效建模、测量和分析高性能计算系统中的大规模性能变化。我们的方法通过关注分布式工作负载间隔的最大长度,避免了在大量单个应用程序进程中对复杂的运行时分布进行推理的需要。我们描述了这种方法及其在MPI中的实现,使其适用于各种HPC工作负载。我们还对这些量化和预测大规模计算系统性能变化的技术进行了评估,并讨论了潜在建模假设的优势和局限性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信