Active Measurement of the Impact of Network Switch Utilization on Application Performance

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.28

Marc Casas, G. Bronevetsky

{"title":"Active Measurement of the Impact of Network Switch Utilization on Application Performance","authors":"Marc Casas, G. Bronevetsky","doi":"10.1109/IPDPS.2014.28","DOIUrl":null,"url":null,"abstract":"Inter-node networks are a key capability of High-Performance Computing (HPC) systems that differentiates them from less capable classes of machines. However, in spite of their very high performance, the increasing computational power of HPC compute nodes and the associated rise in application communication needs make network performance a common performance bottleneck. To achieve high performance in spite of network limitations application developers require tools to measure their applications' network utilization and inform them about how the network's communication capacity relates to the performance of their applications. This paper presents a new performance measurement and analysis methodology based on empirical measurements of network behavior. Our approach uses two benchmarks that inject extra network communication. The first probes the fraction of the network that is utilized by a software component (an application or an individual task) to determine the existence and severity of network contention. The second aggressively injects network traffic while a software component runs to evaluate its performance on less capable networks or when it shares the network with other software components. We then combine the information from the two types of experiments to predict the performance slowdown experienced by multiple software components (e.g. multiple processes of a single MPI application) when they share a single network. Our methodology is applied to individual network switches and demonstrated taking 6 representative HPC applications and predicting the performance slowdowns of the 36 possible application pairs. The average error of our predictions is less than 10%.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"154 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Inter-node networks are a key capability of High-Performance Computing (HPC) systems that differentiates them from less capable classes of machines. However, in spite of their very high performance, the increasing computational power of HPC compute nodes and the associated rise in application communication needs make network performance a common performance bottleneck. To achieve high performance in spite of network limitations application developers require tools to measure their applications' network utilization and inform them about how the network's communication capacity relates to the performance of their applications. This paper presents a new performance measurement and analysis methodology based on empirical measurements of network behavior. Our approach uses two benchmarks that inject extra network communication. The first probes the fraction of the network that is utilized by a software component (an application or an individual task) to determine the existence and severity of network contention. The second aggressively injects network traffic while a software component runs to evaluate its performance on less capable networks or when it shares the network with other software components. We then combine the information from the two types of experiments to predict the performance slowdown experienced by multiple software components (e.g. multiple processes of a single MPI application) when they share a single network. Our methodology is applied to individual network switches and demonstrated taking 6 representative HPC applications and predicting the performance slowdowns of the 36 possible application pairs. The average error of our predictions is less than 10%.

查看原文本刊更多论文

网络交换机利用率对应用性能影响的主动测量

节点间网络是高性能计算(HPC)系统的一项关键功能，它将高性能计算系统与性能较差的机器区分开来。然而，尽管高性能计算节点具有非常高的性能，但随着高性能计算节点计算能力的不断增强以及应用程序通信需求的增加，网络性能成为常见的性能瓶颈。为了在网络限制的情况下实现高性能，应用程序开发人员需要工具来测量其应用程序的网络利用率，并告知他们网络通信容量与应用程序性能的关系。本文提出了一种基于网络行为实证测量的新型性能测量与分析方法。我们的方法使用了两个注入额外网络通信的基准测试。第一种方法探测软件组件(应用程序或单个任务)使用的网络部分，以确定网络争用的存在和严重程度。第二种方法是在软件组件运行时注入网络流量，以评估其在性能较差的网络上的性能，或者与其他软件组件共享网络。然后，我们结合两种实验的信息来预测多个软件组件(例如单个MPI应用程序的多个进程)在共享单个网络时所经历的性能下降。我们的方法应用于单个网络交换机，并以6个具有代表性的HPC应用程序为例进行了演示，并预测了36对可能的应用程序的性能下降。我们预测的平均误差小于10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量