多核架构和主机上通信特性对集体通信的影响分析

2011 40th International Conference on Parallel Processing Workshops Pub Date : 2011-09-13 DOI:10.1109/ICPPW.2011.15

Joshua Ladd, Manjunath Gorentla Venkata, R. Graham, Pavel Shamis

{"title":"多核架构和主机上通信特性对集体通信的影响分析","authors":"Joshua Ladd, Manjunath Gorentla Venkata, R. Graham, Pavel Shamis","doi":"10.1109/ICPPW.2011.15","DOIUrl":null,"url":null,"abstract":"Shared memory optimizations for blocking collective communications implemented for multi-core, and distributed systems have previously shown to improve the performance of these operations. Such previous studies have tended to neglect the architecture of multi-core node and shared-memory communication characteristics. In this paper, we examine in detail the impact of on-node memory and cache hierarchy, and the optimization opportunities these provide, on the performance of the barrier and broadcast collective operations. The primary contribution of this paper is the demonstration of how exploiting the local memory-hierarchy impacts the performance of these operations in the distributed system context. Our results show that factors such as the location of communicating process in the node, number of communication processes, amount of shared-memory communication, and the amount of inter-socket (Central Processing Unit (CPU) socket) communication affect latency-sensitive and bandwidth-sensitive collective operations. The effect of these parameters varies on the type of operations, and are coupled to the architecture of the shared-memory node and the scale of collective operation. We have seen that for 3,072 processes on Jaguar, and considering the socket layout in collective communication algorithm improves the large-data MPI Bcast () performance by 50% and MPI Barrier by 40% when compared to neglecting this architectural feature. For 512 processes job on Smoky, the corresponding improvement is 38%, and an order of magnitude, respectively. Small data broadcast performance is not noticeably impacted on Jaguar, when considering the shared-memory hierarchy, and on Smoky the corresponding performance improvement is 3%.","PeriodicalId":173271,"journal":{"name":"2011 40th International Conference on Parallel Processing Workshops","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Analyzing the Effects of Multicore Architectures and On-Host Communication Characteristics on Collective Communications\",\"authors\":\"Joshua Ladd, Manjunath Gorentla Venkata, R. Graham, Pavel Shamis\",\"doi\":\"10.1109/ICPPW.2011.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Shared memory optimizations for blocking collective communications implemented for multi-core, and distributed systems have previously shown to improve the performance of these operations. Such previous studies have tended to neglect the architecture of multi-core node and shared-memory communication characteristics. In this paper, we examine in detail the impact of on-node memory and cache hierarchy, and the optimization opportunities these provide, on the performance of the barrier and broadcast collective operations. The primary contribution of this paper is the demonstration of how exploiting the local memory-hierarchy impacts the performance of these operations in the distributed system context. Our results show that factors such as the location of communicating process in the node, number of communication processes, amount of shared-memory communication, and the amount of inter-socket (Central Processing Unit (CPU) socket) communication affect latency-sensitive and bandwidth-sensitive collective operations. The effect of these parameters varies on the type of operations, and are coupled to the architecture of the shared-memory node and the scale of collective operation. We have seen that for 3,072 processes on Jaguar, and considering the socket layout in collective communication algorithm improves the large-data MPI Bcast () performance by 50% and MPI Barrier by 40% when compared to neglecting this architectural feature. For 512 processes job on Smoky, the corresponding improvement is 38%, and an order of magnitude, respectively. Small data broadcast performance is not noticeably impacted on Jaguar, when considering the shared-memory hierarchy, and on Smoky the corresponding performance improvement is 3%.\",\"PeriodicalId\":173271,\"journal\":{\"name\":\"2011 40th International Conference on Parallel Processing Workshops\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 40th International Conference on Parallel Processing Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPPW.2011.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 40th International Conference on Parallel Processing Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPPW.2011.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

针对多核和分布式系统实现的阻塞集体通信的共享内存优化先前已被证明可以提高这些操作的性能。以往的研究往往忽视了多核节点的架构和共享内存的通信特性。在本文中，我们详细研究了节点上内存和缓存层次结构的影响，以及它们提供的优化机会，对屏障和广播集体操作的性能。本文的主要贡献是演示了如何利用本地内存层次结构影响分布式系统上下文中这些操作的性能。我们的结果表明，诸如通信进程在节点中的位置、通信进程的数量、共享内存通信的数量以及套接字间(中央处理单元(CPU)套接字)通信的数量等因素都会影响延迟敏感和带宽敏感的集体操作。这些参数的影响随操作类型的不同而变化，并且与共享内存节点的体系结构和集体操作的规模相耦合。我们已经看到，对于Jaguar上的3,072个进程，在集体通信算法中考虑套接字布局，与忽略这一架构特性相比，大数据MPI Bcast()性能提高了50%，MPI Barrier性能提高了40%。对于在Smoky上的512个进程作业，相应的改进分别为38%和一个数量级。当考虑到共享内存层次结构时，小数据广播性能对Jaguar没有明显影响，而在Smoky上相应的性能提高了3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analyzing the Effects of Multicore Architectures and On-Host Communication Characteristics on Collective Communications

Shared memory optimizations for blocking collective communications implemented for multi-core, and distributed systems have previously shown to improve the performance of these operations. Such previous studies have tended to neglect the architecture of multi-core node and shared-memory communication characteristics. In this paper, we examine in detail the impact of on-node memory and cache hierarchy, and the optimization opportunities these provide, on the performance of the barrier and broadcast collective operations. The primary contribution of this paper is the demonstration of how exploiting the local memory-hierarchy impacts the performance of these operations in the distributed system context. Our results show that factors such as the location of communicating process in the node, number of communication processes, amount of shared-memory communication, and the amount of inter-socket (Central Processing Unit (CPU) socket) communication affect latency-sensitive and bandwidth-sensitive collective operations. The effect of these parameters varies on the type of operations, and are coupled to the architecture of the shared-memory node and the scale of collective operation. We have seen that for 3,072 processes on Jaguar, and considering the socket layout in collective communication algorithm improves the large-data MPI Bcast () performance by 50% and MPI Barrier by 40% when compared to neglecting this architectural feature. For 512 processes job on Smoky, the corresponding improvement is 38%, and an order of magnitude, respectively. Small data broadcast performance is not noticeably impacted on Jaguar, when considering the shared-memory hierarchy, and on Smoky the corresponding performance improvement is 3%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 40th International Conference on Parallel Processing Workshops

自引率

0.00%

发文量