Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters

K. Vaidyanathan, K. Pamnany, Dhiraj D. Kalamkar, A. Heinecke, M. Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha G. Shet, Bharat Kaul, B. Joó, P. Dubey
{"title":"Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters","authors":"K. Vaidyanathan, K. Pamnany, Dhiraj D. Kalamkar, A. Heinecke, M. Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha G. Shet, Bharat Kaul, B. Joó, P. Dubey","doi":"10.1109/IPDPS.2014.113","DOIUrl":null,"url":null,"abstract":"Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.
改进Intel Xeon Phi协处理器集群上本地应用程序的通信性能和可扩展性
基于Intel Xeon Phi协处理器的集群为并行工作负载提供了高计算和内存性能,并且还支持直接网络访问。许多现实世界中的应用程序都受到网络特性的显著影响,为了在这些集群上最大化这些应用程序的性能,有效地饱和网络带宽和/或隐藏通信延迟尤为重要。我们将演示如何使用用于数据传输的流水线dma、动态块大小和更好的异步进度等技术来实现这一点。我们还展示了在应用程序通信阶段避免序列化和最大化并行性的方法及其影响。此外,我们应用程序优化侧重于平衡计算和通信,以隐藏通信延迟和提高利用率的核心和网络带宽。我们演示了我们的技术对运行在Intel Xeon Phi协处理器上的三种知名且高度优化的HPC内核的影响。对于来自Lattice QCD的Wilson-Dslash算子,我们描述了我们对通信性能的每个优化所带来的改进,应用我们的方法在通信阶段最大化并发性,并显示了比之前发布的最佳结果总体上提高了48%。对于HPL/LINPACK,我们在128个Intel Xeon Phi协处理器上以97 TFLOPs显示了68.5%的效率,这是有史以来第一次在基于协处理器的超级计算机上报道的本地HPL效率。对于FFT,我们在TACC Stampede集群上使用1024个Intel Xeon Phi协处理器显示了10.8 TFLOPs,这是在任何基于Intel架构的集群上报告的最高性能,也是在基于协处理器的超级计算机上报告的第一个这样的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信