通过有效使用缓存优化intel MIC上的MPI集合

Pinak Panigrahi, Sriram Kanchiraju, A. Srinivasan, P. K. Baruah, C. D. Sudheer
{"title":"通过有效使用缓存优化intel MIC上的MPI集合","authors":"Pinak Panigrahi, Sriram Kanchiraju, A. Srinivasan, P. K. Baruah, C. D. Sudheer","doi":"10.1109/PDGC.2014.7030721","DOIUrl":null,"url":null,"abstract":"The Intel MIC architecture, implemented in the Xeon Phi coprocessor, is targeted at highly parallel applications. In order to exploit it, one needs to make full use of simultaneous multi-threading, which permits four simultaneous threads per core. Our results also show that distributed tag directories can be a greater bottleneck than the ring for small messages when multiple threads access the same cache line. Careful design of algorithms and implementations based on these results can yield substantial performance improvement. We demonstrate these ideas by optimizing MPI collective calls. We obtain a speedup of 9x on barrier and a speed-up of 10x on broadcast, when compared with Intel's MPI implementation. We also show the usefulness of our collectives in two realistic codes: particle transport and the load balancing phase in QMC. Another important contribution of our work lies in showing that optimization techniques - such as double buffering - used with programmer controlled caches are also useful on MIC. These results can help optimize other communication intensive codes running on MIC.","PeriodicalId":311953,"journal":{"name":"2014 International Conference on Parallel, Distributed and Grid Computing","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Optimizing MPI collectives on intel MIC through effective use of cache\",\"authors\":\"Pinak Panigrahi, Sriram Kanchiraju, A. Srinivasan, P. K. Baruah, C. D. Sudheer\",\"doi\":\"10.1109/PDGC.2014.7030721\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Intel MIC architecture, implemented in the Xeon Phi coprocessor, is targeted at highly parallel applications. In order to exploit it, one needs to make full use of simultaneous multi-threading, which permits four simultaneous threads per core. Our results also show that distributed tag directories can be a greater bottleneck than the ring for small messages when multiple threads access the same cache line. Careful design of algorithms and implementations based on these results can yield substantial performance improvement. We demonstrate these ideas by optimizing MPI collective calls. We obtain a speedup of 9x on barrier and a speed-up of 10x on broadcast, when compared with Intel's MPI implementation. We also show the usefulness of our collectives in two realistic codes: particle transport and the load balancing phase in QMC. Another important contribution of our work lies in showing that optimization techniques - such as double buffering - used with programmer controlled caches are also useful on MIC. These results can help optimize other communication intensive codes running on MIC.\",\"PeriodicalId\":311953,\"journal\":{\"name\":\"2014 International Conference on Parallel, Distributed and Grid Computing\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Conference on Parallel, Distributed and Grid Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDGC.2014.7030721\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Parallel, Distributed and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC.2014.7030721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

Intel MIC架构在Xeon Phi协处理器中实现,针对高度并行应用。为了利用它,需要充分利用并发多线程,它允许每个核心同时有四个线程。我们的结果还表明,当多个线程访问相同的缓存线时,分布式标记目录可能是比小消息环路更大的瓶颈。基于这些结果仔细设计算法和实现可以产生实质性的性能改进。我们通过优化MPI集合调用来演示这些想法。与英特尔的MPI实现相比,我们在屏障上获得了9倍的加速,在广播上获得了10倍的加速。我们还在两个现实的代码中展示了我们的集体的有用性:粒子传输和QMC中的负载平衡阶段。我们工作的另一个重要贡献在于展示了与程序员控制的缓存一起使用的优化技术(例如双缓冲)在MIC上也很有用。这些结果可以帮助优化运行在MIC上的其他通信密集型代码。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Optimizing MPI collectives on intel MIC through effective use of cache
The Intel MIC architecture, implemented in the Xeon Phi coprocessor, is targeted at highly parallel applications. In order to exploit it, one needs to make full use of simultaneous multi-threading, which permits four simultaneous threads per core. Our results also show that distributed tag directories can be a greater bottleneck than the ring for small messages when multiple threads access the same cache line. Careful design of algorithms and implementations based on these results can yield substantial performance improvement. We demonstrate these ideas by optimizing MPI collective calls. We obtain a speedup of 9x on barrier and a speed-up of 10x on broadcast, when compared with Intel's MPI implementation. We also show the usefulness of our collectives in two realistic codes: particle transport and the load balancing phase in QMC. Another important contribution of our work lies in showing that optimization techniques - such as double buffering - used with programmer controlled caches are also useful on MIC. These results can help optimize other communication intensive codes running on MIC.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信