Progression of MPI Non-blocking Collective Operations Using Hyper-Threading

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Pub Date : 2015-03-04 DOI:10.1109/PDP.2015.68

Masahiro Miwa, Kohta Nakashima

{"title":"Progression of MPI Non-blocking Collective Operations Using Hyper-Threading","authors":"Masahiro Miwa, Kohta Nakashima","doi":"10.1109/PDP.2015.68","DOIUrl":null,"url":null,"abstract":"MPI non-blocking collective operations offer a high level interface to MPI library users, and potentially allow communication to be overlapped with calculation. Progression, which controls communications running in the background of the calculation, is the key factor to achieve an efficient overlap. The most commonly used progression method is manual progression, in which a progression function is called in the main calculation. In manual progression, MPI library users have to estimate the communication timing to maximize the overlap effect and thus to manage the complex communication optimization. An alternative approach for progression is the use of separate communication threads. By using communication threads, communication calculation overlap can be achieved simply. However, context switches between the calculation thread and the communication thread cause lower performance in the frequent case where all cores are used for calculation. In this paper, we propose a novel threaded progression method using Hyper-Threading to maximize the overlap effect of non-blocking collective operations. We apply MONITOR/MWAIT instructions to the communication thread on Hyper-Threading so as not to degrade the calculation thread due to shared core resource conflict. Evaluation on 8-node Infini Band connected IA server clustered systems confirmed that the latency is suppressed to a small level and that our approach has an advantage over manual progression in terms of communication-calculation overlap. Using a real application of CG benchmark, our method achieved 32% reduction in execution time compared to using blocking collective operation, and that is nearly perfect overlap. Although manual progression also achieved perfect overlap, our method has the advantage that no communication timing tuning is required for each application.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2015.68","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

MPI non-blocking collective operations offer a high level interface to MPI library users, and potentially allow communication to be overlapped with calculation. Progression, which controls communications running in the background of the calculation, is the key factor to achieve an efficient overlap. The most commonly used progression method is manual progression, in which a progression function is called in the main calculation. In manual progression, MPI library users have to estimate the communication timing to maximize the overlap effect and thus to manage the complex communication optimization. An alternative approach for progression is the use of separate communication threads. By using communication threads, communication calculation overlap can be achieved simply. However, context switches between the calculation thread and the communication thread cause lower performance in the frequent case where all cores are used for calculation. In this paper, we propose a novel threaded progression method using Hyper-Threading to maximize the overlap effect of non-blocking collective operations. We apply MONITOR/MWAIT instructions to the communication thread on Hyper-Threading so as not to degrade the calculation thread due to shared core resource conflict. Evaluation on 8-node Infini Band connected IA server clustered systems confirmed that the latency is suppressed to a small level and that our approach has an advantage over manual progression in terms of communication-calculation overlap. Using a real application of CG benchmark, our method achieved 32% reduction in execution time compared to using blocking collective operation, and that is nearly perfect overlap. Although manual progression also achieved perfect overlap, our method has the advantage that no communication timing tuning is required for each application.

查看原文本刊更多论文

使用超线程的MPI非阻塞集体操作的进展

MPI非阻塞集合操作为MPI库用户提供了一个高级接口，并且可能允许通信与计算重叠。进度是实现有效重叠的关键因素，它控制着计算后台运行的通信。最常用的递进方法是手动递进，在主计算中调用一个递进函数。在手动进度中，MPI库用户必须估计通信时间以最大化重叠效果，从而管理复杂的通信优化。进程的另一种方法是使用单独的通信线程。通过使用通信线程，可以简单地实现通信计算重叠。但是，计算线程和通信线程之间的上下文切换会导致性能降低，因为在这种情况下，所有内核都用于计算。在本文中，我们提出了一种新的线程进度方法，使用超线程来最大化非阻塞集体操作的重叠效果。为了避免由于共享核心资源冲突导致计算线程降级，我们在超线程上对通信线程应用MONITOR/MWAIT指令。对8节点Infini Band连接的IA服务器集群系统的评估证实，延迟被抑制到很小的水平，并且我们的方法在通信-计算重叠方面比手动进展具有优势。通过对CG基准的实际应用，我们的方法与使用阻塞集体操作相比，执行时间减少了32%，并且几乎是完美的重叠。尽管手动进度也实现了完美的重叠，但我们的方法的优点是不需要对每个应用程序进行通信时序调优。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

自引率

0.00%

发文量