Early Experiences Co-Scheduling Work and Communication Tasks for Hybrid MPI+X Applications

2014 Workshop on Exascale MPI at Supercomputing Conference Pub Date : 2014-11-16 DOI:10.1109/ExaMPI.2014.6

Dylan T. Stark, R. Barrett, Ryan E. Grant, Stephen L. Olivier, K. Pedretti, C. Vaughan

{"title":"Early Experiences Co-Scheduling Work and Communication Tasks for Hybrid MPI+X Applications","authors":"Dylan T. Stark, R. Barrett, Ryan E. Grant, Stephen L. Olivier, K. Pedretti, C. Vaughan","doi":"10.1109/ExaMPI.2014.6","DOIUrl":null,"url":null,"abstract":"Advances in node-level architecture and interconnect technology needed to reach extreme scale necessitate a reevaluation of long-standing models of computation, in particular bulk synchronous processing. The end of Dennard-scaling and subsequent increases in CPU core counts each successive generation of general purpose processor has made the ability to leverage parallelism for communication an increasingly critical aspect for future extreme-scale application performance. But the use of massive multithreading in combination with MPI is an open research area, with many proposed approaches requiring code changes that can be unfeasible for important large legacy applications already written in MPI. This paper covers the design and initial evaluation of an extension of a massive multithreading runtime system supporting dynamic parallelism to interface with MPI to handle fine-grain parallel communication and communication-computation overlap. Our initial evaluation of the approach uses the ubiquitous stencil computation, in three dimensions, with the halo exchange as the driving example that has a demonstrated tie to real code bases. The preliminary results suggest that even for a very well-studied and balanced workload and message exchange pattern, co-scheduling work and communication tasks is effective at significant levels of decomposition using up to 131,072 cores. Furthermore, we demonstrate useful communication-computation overlap when handling blocking send and receive calls, and show evidence suggesting that we can decrease the burstiness of network traffic, with a corresponding decrease in the rate of stalls (congestion) seen on the host link and network.","PeriodicalId":425070,"journal":{"name":"2014 Workshop on Exascale MPI at Supercomputing Conference","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 Workshop on Exascale MPI at Supercomputing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ExaMPI.2014.6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

Advances in node-level architecture and interconnect technology needed to reach extreme scale necessitate a reevaluation of long-standing models of computation, in particular bulk synchronous processing. The end of Dennard-scaling and subsequent increases in CPU core counts each successive generation of general purpose processor has made the ability to leverage parallelism for communication an increasingly critical aspect for future extreme-scale application performance. But the use of massive multithreading in combination with MPI is an open research area, with many proposed approaches requiring code changes that can be unfeasible for important large legacy applications already written in MPI. This paper covers the design and initial evaluation of an extension of a massive multithreading runtime system supporting dynamic parallelism to interface with MPI to handle fine-grain parallel communication and communication-computation overlap. Our initial evaluation of the approach uses the ubiquitous stencil computation, in three dimensions, with the halo exchange as the driving example that has a demonstrated tie to real code bases. The preliminary results suggest that even for a very well-studied and balanced workload and message exchange pattern, co-scheduling work and communication tasks is effective at significant levels of decomposition using up to 131,072 cores. Furthermore, we demonstrate useful communication-computation overlap when handling blocking send and receive calls, and show evidence suggesting that we can decrease the burstiness of network traffic, with a corresponding decrease in the rate of stalls (congestion) seen on the host link and network.

查看原文本刊更多论文

混合MPI+X应用协同调度工作和通信任务的早期经验

节点级架构和互连技术的进步需要达到极端规模，因此需要重新评估长期存在的计算模型，特别是批量同步处理。Dennard-scaling的终结以及随后每一代通用处理器CPU核数的增加，使得利用通信并行性的能力成为未来超大规模应用程序性能的一个日益重要的方面。但是，将大规模多线程与MPI结合使用是一个开放的研究领域，许多建议的方法需要更改代码，这对于已经用MPI编写的重要的大型遗留应用程序来说是不可行的。本文介绍了一个大规模多线程运行时系统的扩展的设计和初步评估，该系统支持动态并行与MPI接口，以处理细粒度并行通信和通信-计算重叠。我们对该方法的初步评估使用了无处不在的三维模板计算，并以光晕交换作为驱动示例，该示例与实际代码库有演示关系。初步结果表明，即使对于一个研究得非常充分且平衡的工作负载和消息交换模式，协同调度工作和通信任务在使用多达131,072个核心的分解级别上也是有效的。此外，我们展示了在处理阻塞发送和接收呼叫时有用的通信计算重叠，并展示了证据表明我们可以减少网络流量的突发性，在主机链路和网络上看到的摊位(拥塞)率相应降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 Workshop on Exascale MPI at Supercomputing Conference

自引率

0.00%

发文量