{"title":"Optimized Reduce for Mesh-Based NoC Multiprocessors","authors":"A. Kohler, M. Radetzki","doi":"10.1109/IPDPSW.2012.111","DOIUrl":null,"url":null,"abstract":"Future processors are expected to be made up of a large number of computation cores interconnected by fast on-chip networks (Network-on-Chip, NoC). Such distributed structures motivate the use of message passing programming models similar to MPI. Since the properties of these networks, like e.g. the topology, are known and fixed after production, this knowledge can be used to optimize the communication stack. We describe two schemes that take advantage of this to accelerate the (All-)Reduce operation defined in MPI, namely a contention avoiding rank-to-core mapping and a way of interleaving communication and computation by means of pipelining. Simulations show that the combination of both schemes can accelerate (All-)Reduce operations by more than 60%.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2012.111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Future processors are expected to be made up of a large number of computation cores interconnected by fast on-chip networks (Network-on-Chip, NoC). Such distributed structures motivate the use of message passing programming models similar to MPI. Since the properties of these networks, like e.g. the topology, are known and fixed after production, this knowledge can be used to optimize the communication stack. We describe two schemes that take advantage of this to accelerate the (All-)Reduce operation defined in MPI, namely a contention avoiding rank-to-core mapping and a way of interleaving communication and computation by means of pipelining. Simulations show that the combination of both schemes can accelerate (All-)Reduce operations by more than 60%.