Benjamin W. Priest, Trevor Steil, G. Sanders, R. Pearce
{"title":"You've Got Mail (YGM): Building Missing Asynchronous Communication Primitives","authors":"Benjamin W. Priest, Trevor Steil, G. Sanders, R. Pearce","doi":"10.1109/IPDPSW.2019.00045","DOIUrl":null,"url":null,"abstract":"The Message Passing Interface (MPI) is the de facto standard for message handling in distributed computing. MPI collective communication schemes where many processors communicate with one another depend upon synchronous handshake agreements. This results in applications depending upon iterative collective communications moving at the speed of their slowest processors. We describe a methodology for bootstrapping asynchronous communication primitives to MPI, with an emphasis on irregular and imbalanced all-to-all communication patterns found in many data analytics applications. In such applications, the communication payload between a pair of processors is often small, requiring message aggregation on modern networks. In this work, we develop novel routing schemes that divide routing logically into local and remote routing. In these schemes, each core on a node is responsible for handing all local node sends and/or receives with a subset of remote cores. Collective communications route messages along their designated intermediaries, and are not influenced by the availability of cores not on their route. Unlike conventional synchronous collectives, cores participating in these schemes can enter the protocol when ready and exit once all of their sends and receives are processed. We demonstrate, using simple benchmarks, how this collective communication improves overall wall clock performance, as well as bandwidth and core utilization, for applications with a high demand for arbitrary core-core communication and unequal computational load between cores.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
The Message Passing Interface (MPI) is the de facto standard for message handling in distributed computing. MPI collective communication schemes where many processors communicate with one another depend upon synchronous handshake agreements. This results in applications depending upon iterative collective communications moving at the speed of their slowest processors. We describe a methodology for bootstrapping asynchronous communication primitives to MPI, with an emphasis on irregular and imbalanced all-to-all communication patterns found in many data analytics applications. In such applications, the communication payload between a pair of processors is often small, requiring message aggregation on modern networks. In this work, we develop novel routing schemes that divide routing logically into local and remote routing. In these schemes, each core on a node is responsible for handing all local node sends and/or receives with a subset of remote cores. Collective communications route messages along their designated intermediaries, and are not influenced by the availability of cores not on their route. Unlike conventional synchronous collectives, cores participating in these schemes can enter the protocol when ready and exit once all of their sends and receives are processed. We demonstrate, using simple benchmarks, how this collective communication improves overall wall clock performance, as well as bandwidth and core utilization, for applications with a high demand for arbitrary core-core communication and unequal computational load between cores.