{"title":"优化了基于网格的NoC多处理器的Reduce","authors":"A. Kohler, M. Radetzki","doi":"10.1109/IPDPSW.2012.111","DOIUrl":null,"url":null,"abstract":"Future processors are expected to be made up of a large number of computation cores interconnected by fast on-chip networks (Network-on-Chip, NoC). Such distributed structures motivate the use of message passing programming models similar to MPI. Since the properties of these networks, like e.g. the topology, are known and fixed after production, this knowledge can be used to optimize the communication stack. We describe two schemes that take advantage of this to accelerate the (All-)Reduce operation defined in MPI, namely a contention avoiding rank-to-core mapping and a way of interleaving communication and computation by means of pipelining. Simulations show that the combination of both schemes can accelerate (All-)Reduce operations by more than 60%.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Optimized Reduce for Mesh-Based NoC Multiprocessors\",\"authors\":\"A. Kohler, M. Radetzki\",\"doi\":\"10.1109/IPDPSW.2012.111\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Future processors are expected to be made up of a large number of computation cores interconnected by fast on-chip networks (Network-on-Chip, NoC). Such distributed structures motivate the use of message passing programming models similar to MPI. Since the properties of these networks, like e.g. the topology, are known and fixed after production, this knowledge can be used to optimize the communication stack. We describe two schemes that take advantage of this to accelerate the (All-)Reduce operation defined in MPI, namely a contention avoiding rank-to-core mapping and a way of interleaving communication and computation by means of pipelining. Simulations show that the combination of both schemes can accelerate (All-)Reduce operations by more than 60%.\",\"PeriodicalId\":378335,\"journal\":{\"name\":\"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2012.111\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2012.111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimized Reduce for Mesh-Based NoC Multiprocessors
Future processors are expected to be made up of a large number of computation cores interconnected by fast on-chip networks (Network-on-Chip, NoC). Such distributed structures motivate the use of message passing programming models similar to MPI. Since the properties of these networks, like e.g. the topology, are known and fixed after production, this knowledge can be used to optimize the communication stack. We describe two schemes that take advantage of this to accelerate the (All-)Reduce operation defined in MPI, namely a contention avoiding rank-to-core mapping and a way of interleaving communication and computation by means of pipelining. Simulations show that the combination of both schemes can accelerate (All-)Reduce operations by more than 60%.