Lightweight Measurement and Analysis of HPC Performance Variability

2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2020-11-01 DOI:10.1109/PMBS51919.2020.00011

Jered Dominguez-Trujillo, Keira Haskins, S. J. Khouzani, Chris Leap, Sahba Tashakkori, Quincy Wofford, Trilce Estrada, P. Bridges, Patrick M. Widener

{"title":"Lightweight Measurement and Analysis of HPC Performance Variability","authors":"Jered Dominguez-Trujillo, Keira Haskins, S. J. Khouzani, Chris Leap, Sahba Tashakkori, Quincy Wofford, Trilce Estrada, P. Bridges, Patrick M. Widener","doi":"10.1109/PMBS51919.2020.00011","DOIUrl":null,"url":null,"abstract":"Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"374 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PMBS51919.2020.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.

查看原文本刊更多论文

高性能计算性能变异性的轻量化测量与分析

源自硬件和软件的性能变化在现代科学和数据密集型计算系统中很常见，并行和分布式程序中的同步通常会在规模上加剧它们的影响。不幸的是，这种变化的分散和突现效应也难以系统地测量、分析和预测;在有意义的应用规模上，不能保证足够严格的建模假设使分析易于处理，而且在这种规模上的纵向方法可能需要捕获和操作不切实际的大量数据。本文描述了一种新的、可扩展的、统计健壮的方法，用于高效建模、测量和分析高性能计算系统中的大规模性能变化。我们的方法通过关注分布式工作负载间隔的最大长度，避免了在大量单个应用程序进程中对复杂的运行时分布进行推理的需要。我们描述了这种方法及其在MPI中的实现，使其适用于各种HPC工作负载。我们还对这些量化和预测大规模计算系统性能变化的技术进行了评估，并讨论了潜在建模假设的优势和局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

自引率

0.00%

发文量