Intel Xeon Phi处理器OpenSHMEM中基于本地模式的远程内存访问优化

International Conference on Partitioned Global Address Space Programming Models Pub Date : 2014-10-06 DOI:10.1145/2676870.2676881

N. Namashivayam, Sayan Ghosh, Dounia Khaldi, Deepak Eachempati, B. Chapman

{"title":"Intel Xeon Phi处理器OpenSHMEM中基于本地模式的远程内存访问优化","authors":"N. Namashivayam, Sayan Ghosh, Dounia Khaldi, Deepak Eachempati, B. Chapman","doi":"10.1145/2676870.2676881","DOIUrl":null,"url":null,"abstract":"OpenSHMEM is a PGAS library that aims to deliver high performance while retaining portability. Communication operations are a major obstacle to scalable parallel performance and are highly dependent on the target architecture. However, to date there has been no work on how to efficiently support OpenSHMEM running natively on Intel Xeon Phi, a highly-parallel, power-efficient and widely-used many-core architecture. Given the importance of communication in parallel architectures, this paper describes a novel methodology for optimizing remote-memory accesses for execution of OpenSHMEM programs on Intel Xeon Phi processors.\n In native mode, we can exploit the Xeon Phi shared memory and convert OpenSHMEM one-sided communication calls into local load/store statements using the shmem_ptr routine. This approach makes it possible for the compiler to perform essential optimizations for Xeon Phi such as vectorization. To the best of our knowledge, this is the first time the impact of shmem_ptr is analyzed thoroughly on a many-core system. We show the benefits of this approach on the PGAS-Microbenchmarks we specifically developed for this research. Our results exhibit a decrease in latency for one-sided communication operations by up to 60% and increase in bandwidth by up to 12x. Moreover, we study different reduction algorithms and exploit local load/store to optimize data transfers in these algorithms for Xeon Phi which permits improvement of up to 22% compared to MVAPICH and up to 60% compared to Intel MPI. Apart from microbenchmarks, experimental results on NAS IS and SP benchmarks show that performance gains of up to 20x are possible.","PeriodicalId":245693,"journal":{"name":"International Conference on Partitioned Global Address Space Programming Models","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Native Mode-Based Optimizations of Remote Memory Accesses in OpenSHMEM for Intel Xeon Phi\",\"authors\":\"N. Namashivayam, Sayan Ghosh, Dounia Khaldi, Deepak Eachempati, B. Chapman\",\"doi\":\"10.1145/2676870.2676881\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"OpenSHMEM is a PGAS library that aims to deliver high performance while retaining portability. Communication operations are a major obstacle to scalable parallel performance and are highly dependent on the target architecture. However, to date there has been no work on how to efficiently support OpenSHMEM running natively on Intel Xeon Phi, a highly-parallel, power-efficient and widely-used many-core architecture. Given the importance of communication in parallel architectures, this paper describes a novel methodology for optimizing remote-memory accesses for execution of OpenSHMEM programs on Intel Xeon Phi processors.\\n In native mode, we can exploit the Xeon Phi shared memory and convert OpenSHMEM one-sided communication calls into local load/store statements using the shmem_ptr routine. This approach makes it possible for the compiler to perform essential optimizations for Xeon Phi such as vectorization. To the best of our knowledge, this is the first time the impact of shmem_ptr is analyzed thoroughly on a many-core system. We show the benefits of this approach on the PGAS-Microbenchmarks we specifically developed for this research. Our results exhibit a decrease in latency for one-sided communication operations by up to 60% and increase in bandwidth by up to 12x. Moreover, we study different reduction algorithms and exploit local load/store to optimize data transfers in these algorithms for Xeon Phi which permits improvement of up to 22% compared to MVAPICH and up to 60% compared to Intel MPI. Apart from microbenchmarks, experimental results on NAS IS and SP benchmarks show that performance gains of up to 20x are possible.\",\"PeriodicalId\":245693,\"journal\":{\"name\":\"International Conference on Partitioned Global Address Space Programming Models\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Partitioned Global Address Space Programming Models\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2676870.2676881\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Partitioned Global Address Space Programming Models","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2676870.2676881","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

OpenSHMEM是一个PGAS库，旨在提供高性能的同时保持可移植性。通信操作是可扩展并行性能的主要障碍，并且高度依赖于目标体系结构。然而，到目前为止，还没有关于如何有效地支持在Intel Xeon Phi(一种高度并行、节能且广泛使用的多核架构)上运行的OpenSHMEM的工作。鉴于通信在并行架构中的重要性，本文描述了一种在Intel Xeon Phi处理器上优化OpenSHMEM程序执行的远程内存访问的新方法。在本机模式下，我们可以利用Xeon Phi共享内存，并使用shmem_ptr例程将OpenSHMEM单侧通信调用转换为本地加载/存储语句。这种方法使编译器能够为Xeon Phi执行必要的优化，例如向量化。据我们所知，这是第一次在多核系统上彻底分析shmem_ptr的影响。我们在专门为此研究开发的pgas - microbenchmark上展示了这种方法的好处。我们的结果显示，单侧通信操作的延迟减少了60%，带宽增加了12倍。此外，我们研究了不同的约简算法，并利用本地负载/存储来优化Xeon Phi的这些算法中的数据传输，与MVAPICH相比可提高22%，与英特尔MPI相比可提高60%。除了微基准测试，NAS IS和SP基准测试的实验结果表明，性能提升高达20倍是可能的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Native Mode-Based Optimizations of Remote Memory Accesses in OpenSHMEM for Intel Xeon Phi

OpenSHMEM is a PGAS library that aims to deliver high performance while retaining portability. Communication operations are a major obstacle to scalable parallel performance and are highly dependent on the target architecture. However, to date there has been no work on how to efficiently support OpenSHMEM running natively on Intel Xeon Phi, a highly-parallel, power-efficient and widely-used many-core architecture. Given the importance of communication in parallel architectures, this paper describes a novel methodology for optimizing remote-memory accesses for execution of OpenSHMEM programs on Intel Xeon Phi processors. In native mode, we can exploit the Xeon Phi shared memory and convert OpenSHMEM one-sided communication calls into local load/store statements using the shmem_ptr routine. This approach makes it possible for the compiler to perform essential optimizations for Xeon Phi such as vectorization. To the best of our knowledge, this is the first time the impact of shmem_ptr is analyzed thoroughly on a many-core system. We show the benefits of this approach on the PGAS-Microbenchmarks we specifically developed for this research. Our results exhibit a decrease in latency for one-sided communication operations by up to 60% and increase in bandwidth by up to 12x. Moreover, we study different reduction algorithms and exploit local load/store to optimize data transfers in these algorithms for Xeon Phi which permits improvement of up to 22% compared to MVAPICH and up to 60% compared to Intel MPI. Apart from microbenchmarks, experimental results on NAS IS and SP benchmarks show that performance gains of up to 20x are possible.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Partitioned Global Address Space Programming Models

自引率

0.00%

发文量