The Case of Performance Variability on Dragonfly-based Systems

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI:10.1109/IPDPS47924.2020.00096

A. Bhatele, Jayaraman J. Thiagarajan, Taylor L. Groves, Rushil Anirudh, Staci A. Smith, B. Cook, D. Lowenthal

引用次数: 16

Abstract

Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology – specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.

查看原文本刊更多论文

基于dragonfly的系统性能变异性的案例

在大型超级计算机上运行的并行代码的性能在每次运行时可能会有很大差异，即使可执行文件及其输入参数保持不变。这种可变性可能由于代码中计算和/或通信的扰动而发生。在本文中，我们研究了由于使用蜻蜓拓扑的超级计算机(特别是配备Aries互连的Cray XC系统)的网络效应而引起的性能变化的情况。我们对网络硬件计数器、概要输出、作业队列日志和位置信息执行事后分析，所有这些都是从定期的代表性应用程序运行中收集的。我们使用偏差预测和递归特征消除来研究性能变化的原因。此外，使用单个应用程序的时间步长性能数据，我们训练可以预测未来时间步长执行时间的机器学习模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量