通过有效使用缓存优化intel MIC上的MPI集合

2014 International Conference on Parallel, Distributed and Grid Computing Pub Date : 2014-12-01 DOI:10.1109/PDGC.2014.7030721

Pinak Panigrahi, Sriram Kanchiraju, A. Srinivasan, P. K. Baruah, C. D. Sudheer

{"title":"通过有效使用缓存优化intel MIC上的MPI集合","authors":"Pinak Panigrahi, Sriram Kanchiraju, A. Srinivasan, P. K. Baruah, C. D. Sudheer","doi":"10.1109/PDGC.2014.7030721","DOIUrl":null,"url":null,"abstract":"The Intel MIC architecture, implemented in the Xeon Phi coprocessor, is targeted at highly parallel applications. In order to exploit it, one needs to make full use of simultaneous multi-threading, which permits four simultaneous threads per core. Our results also show that distributed tag directories can be a greater bottleneck than the ring for small messages when multiple threads access the same cache line. Careful design of algorithms and implementations based on these results can yield substantial performance improvement. We demonstrate these ideas by optimizing MPI collective calls. We obtain a speedup of 9x on barrier and a speed-up of 10x on broadcast, when compared with Intel's MPI implementation. We also show the usefulness of our collectives in two realistic codes: particle transport and the load balancing phase in QMC. Another important contribution of our work lies in showing that optimization techniques - such as double buffering - used with programmer controlled caches are also useful on MIC. These results can help optimize other communication intensive codes running on MIC.","PeriodicalId":311953,"journal":{"name":"2014 International Conference on Parallel, Distributed and Grid Computing","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Optimizing MPI collectives on intel MIC through effective use of cache\",\"authors\":\"Pinak Panigrahi, Sriram Kanchiraju, A. Srinivasan, P. K. Baruah, C. D. Sudheer\",\"doi\":\"10.1109/PDGC.2014.7030721\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Intel MIC architecture, implemented in the Xeon Phi coprocessor, is targeted at highly parallel applications. In order to exploit it, one needs to make full use of simultaneous multi-threading, which permits four simultaneous threads per core. Our results also show that distributed tag directories can be a greater bottleneck than the ring for small messages when multiple threads access the same cache line. Careful design of algorithms and implementations based on these results can yield substantial performance improvement. We demonstrate these ideas by optimizing MPI collective calls. We obtain a speedup of 9x on barrier and a speed-up of 10x on broadcast, when compared with Intel's MPI implementation. We also show the usefulness of our collectives in two realistic codes: particle transport and the load balancing phase in QMC. Another important contribution of our work lies in showing that optimization techniques - such as double buffering - used with programmer controlled caches are also useful on MIC. These results can help optimize other communication intensive codes running on MIC.\",\"PeriodicalId\":311953,\"journal\":{\"name\":\"2014 International Conference on Parallel, Distributed and Grid Computing\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Conference on Parallel, Distributed and Grid Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDGC.2014.7030721\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Parallel, Distributed and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC.2014.7030721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

Intel MIC架构在Xeon Phi协处理器中实现，针对高度并行应用。为了利用它，需要充分利用并发多线程，它允许每个核心同时有四个线程。我们的结果还表明，当多个线程访问相同的缓存线时，分布式标记目录可能是比小消息环路更大的瓶颈。基于这些结果仔细设计算法和实现可以产生实质性的性能改进。我们通过优化MPI集合调用来演示这些想法。与英特尔的MPI实现相比，我们在屏障上获得了9倍的加速，在广播上获得了10倍的加速。我们还在两个现实的代码中展示了我们的集体的有用性:粒子传输和QMC中的负载平衡阶段。我们工作的另一个重要贡献在于展示了与程序员控制的缓存一起使用的优化技术(例如双缓冲)在MIC上也很有用。这些结果可以帮助优化运行在MIC上的其他通信密集型代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing MPI collectives on intel MIC through effective use of cache

The Intel MIC architecture, implemented in the Xeon Phi coprocessor, is targeted at highly parallel applications. In order to exploit it, one needs to make full use of simultaneous multi-threading, which permits four simultaneous threads per core. Our results also show that distributed tag directories can be a greater bottleneck than the ring for small messages when multiple threads access the same cache line. Careful design of algorithms and implementations based on these results can yield substantial performance improvement. We demonstrate these ideas by optimizing MPI collective calls. We obtain a speedup of 9x on barrier and a speed-up of 10x on broadcast, when compared with Intel's MPI implementation. We also show the usefulness of our collectives in two realistic codes: particle transport and the load balancing phase in QMC. Another important contribution of our work lies in showing that optimization techniques - such as double buffering - used with programmer controlled caches are also useful on MIC. These results can help optimize other communication intensive codes running on MIC.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 International Conference on Parallel, Distributed and Grid Computing

自引率

0.00%

发文量