Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI:10.1109/ipdps47924.2020.00108

Hancheng Wu, M. Becchi

{"title":"Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi","authors":"Hancheng Wu, M. Becchi","doi":"10.1109/ipdps47924.2020.00108","DOIUrl":null,"url":null,"abstract":"Manycore processors such as GPUs and Intel Xeon Phis have become popular due to their massive parallelism and high power-efficiency. To achieve optimal performance, it is necessary to optimize the use of the compute cores and of the memory system available on these devices. Previous work has proposed techniques to improve the use of the GPU resources. While Intel Phi can provide massive parallelism through their x86 cores and vector units, optimization techniques for these platforms have received less consideration.In this work, we study the benefits of thread coarsening and low-cost synchronization on applications running on Intel Xeon Phi processors and encoded in SIMT fashion. Specifically, we explore thread coarsening as a way to remap the work to the available cores and vector lanes. In addition, we propose low- overhead synchronization primitives, such as atomic operations and barriers, which transparently apply to threads mapped to the same or different VPUs and x86 cores. Finally, we consider the combined use of thread coarsening and our proposed synchronization primitives. We evaluate the effect of these techniques on the performance of two kinds of kernels: collaborative and non-collaborative ones, the former using scratchpad memory to explicitly control data sharing among threads. Our evaluation leads to the following results. First, while not always beneficial for non-collaborative kernels, thread coarsening improves the performance of collaborative kernels consistently by reducing the synchronization overhead. Second, our synchronization primitives outperform standard pthread APIs by a factor up to 8x in real-world benchmarks. Last, the combined use of the proposed techniques leads to performance improvements, especially for collaborative kernels.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"1018-1029"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps47924.2020.00108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Manycore processors such as GPUs and Intel Xeon Phis have become popular due to their massive parallelism and high power-efficiency. To achieve optimal performance, it is necessary to optimize the use of the compute cores and of the memory system available on these devices. Previous work has proposed techniques to improve the use of the GPU resources. While Intel Phi can provide massive parallelism through their x86 cores and vector units, optimization techniques for these platforms have received less consideration.In this work, we study the benefits of thread coarsening and low-cost synchronization on applications running on Intel Xeon Phi processors and encoded in SIMT fashion. Specifically, we explore thread coarsening as a way to remap the work to the available cores and vector lanes. In addition, we propose low- overhead synchronization primitives, such as atomic operations and barriers, which transparently apply to threads mapped to the same or different VPUs and x86 cores. Finally, we consider the combined use of thread coarsening and our proposed synchronization primitives. We evaluate the effect of these techniques on the performance of two kinds of kernels: collaborative and non-collaborative ones, the former using scratchpad memory to explicitly control data sharing among threads. Our evaluation leads to the following results. First, while not always beneficial for non-collaborative kernels, thread coarsening improves the performance of collaborative kernels consistently by reducing the synchronization overhead. Second, our synchronization primitives outperform standard pthread APIs by a factor up to 8x in real-world benchmarks. Last, the combined use of the proposed techniques leads to performance improvements, especially for collaborative kernels.

查看原文本刊更多论文

在Intel Xeon Phi处理器上评估线程粗化和低成本同步

gpu和英特尔至强处理器等多核处理器由于其大规模并行性和高能效而变得流行。为了实现最佳性能，有必要优化这些设备上可用的计算核心和内存系统的使用。以前的工作已经提出了改进GPU资源使用的技术。虽然Intel Phi可以通过其x86内核和矢量单元提供大量并行性，但针对这些平台的优化技术却很少得到考虑。在这项工作中，我们研究了线程粗化和低成本同步对运行在Intel Xeon Phi处理器上并以SIMT方式编码的应用程序的好处。具体来说，我们探索线程粗化作为一种将工作重新映射到可用内核和向量通道的方法。此外，我们提出了低开销的同步原语，如原子操作和屏障，它们透明地应用于映射到相同或不同vpu和x86内核的线程。最后，我们考虑了线程粗化和我们提出的同步原语的组合使用。我们评估了这些技术对两种内核性能的影响:协作和非协作，前者使用刮板内存显式地控制线程之间的数据共享。我们的评估得出以下结果。首先，虽然线程粗化并不总是对非协作内核有利，但它通过减少同步开销来持续提高协作内核的性能。其次，在实际基准测试中，我们的同步原语的性能比标准pthread api高出8倍。最后，综合使用所提出的技术可以提高性能，特别是对于协作内核。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量