Characterization and identification of HPC applications at leadership computing facility

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI:10.1145/3392717.3392774

Zhengchun Liu, Ryan Lewis, R. Kettimuthu, K. Harms, P. Carns, N. Rao, Ian T Foster, M. Papka

{"title":"Characterization and identification of HPC applications at leadership computing facility","authors":"Zhengchun Liu, Ryan Lewis, R. Kettimuthu, K. Harms, P. Carns, N. Rao, Ian T Foster, M. Papka","doi":"10.1145/3392717.3392774","DOIUrl":null,"url":null,"abstract":"High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation. In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O. Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to \"fingerprint\" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th ACM International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3392717.3392774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation. In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O. Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.

查看原文本刊更多论文

高性能计算在领先计算设施中的应用

高性能计算(HPC)是通过大规模模拟、数据分析或人工智能进行科学发现的重要方法。一流的超级计算机价格昂贵，但对于运行大型HPC应用程序是必不可少的。超级计算机的千兆级时代始于2008年，第一台机器的性能超过每秒一千万亿次，随着2021年新超级计算机的出现(例如极光，前沿)，百兆级时代将很快开始。然而，在设计超级计算机时，机器的高理论计算能力(即峰值FLOPS)并不是唯一有意义的目标，因为应用程序对资源的需求是不同的。深入了解在领导力超级计算机上运行的应用程序的特征是规划其设计、开发和运行的最重要方法之一。为了提高我们对高性能计算应用、用户需求和资源使用特征的理解，我们对某领先超级计算机不同子系统的各种日志进行了相关分析。这种分析揭示了令人惊讶的，有时是反直觉的模式，在某些情况下，与现有的假设相冲突，并对未来的系统设计和超级计算机操作具有重要意义。例如，我们的分析表明，虽然应用程序在MPI上花费了大量时间，但大多数应用程序在文件I/O上花费的时间很少。综合分析硬件事件日志和任务失败日志可以发现，硬件FATAL事件导致任务失败的概率较低。对控制系统日志和文件I/O日志的综合分析表明，纯POSIX I/O比高级并行I/O使用得更广泛。基于从不同角度对多个日志进行组合和共同分析而获得的对应用程序的整体见解和一般直觉，我们设计了一些功能来“指纹化”HPC应用程序。我们使用t-SNE(一种用于降维的机器学习技术)来验证我们的特征的可解释性，并最终训练机器学习模型来识别HPC应用程序或将具有相似特征的应用程序分组。据我们所知，这是第一次将文件I/O日志、计算和节点间通信结合起来，对生产环境中的HPC应用程序进行深入分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th ACM International Conference on Supercomputing

自引率

0.00%

发文量