多核高性能计算系统的位置感知并行进程映射

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI:10.1109/CLUSTER.2011.59

Joshua Hursey, J. Squyres, T. Dontje

{"title":"多核高性能计算系统的位置感知并行进程映射","authors":"Joshua Hursey, J. Squyres, T. Dontje","doi":"10.1109/CLUSTER.2011.59","DOIUrl":null,"url":null,"abstract":"High Performance Computing (HPC) systems are composed of servers containing an ever-increasing number of cores. With such high processor core counts, non-uniform memory access (NUMA) architectures are almost universally used to reduce inter-processor and memory communication bottlenecks by distributing processors and memory throughout a server-internal networking topology. Application studies have shown that the tuning of processes placement in a server's NUMA networking topology to the application can have a dramatic impact on performance. The performance implications are magnified when running a parallel job across multiple server nodes, especially with large scale HPC applications. This paper presents the Locality-Aware Mapping Algorithm (LAMA) for distributing the individual processes of a parallel application across processing resources in an HPC system, paying particular attention to the internal server NUMA topologies. The algorithm is able to support both homogeneous and heterogeneous hardware systems, and dynamically adapts to the available hardware and user-specified process layout at run-time. As implemented in Open MPI, the LAMA provides 362,880 mapping permutations and is able to naturally scale out to additional hardware resources as they become available in future architectures.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"Locality-Aware Parallel Process Mapping for Multi-core HPC Systems\",\"authors\":\"Joshua Hursey, J. Squyres, T. Dontje\",\"doi\":\"10.1109/CLUSTER.2011.59\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High Performance Computing (HPC) systems are composed of servers containing an ever-increasing number of cores. With such high processor core counts, non-uniform memory access (NUMA) architectures are almost universally used to reduce inter-processor and memory communication bottlenecks by distributing processors and memory throughout a server-internal networking topology. Application studies have shown that the tuning of processes placement in a server's NUMA networking topology to the application can have a dramatic impact on performance. The performance implications are magnified when running a parallel job across multiple server nodes, especially with large scale HPC applications. This paper presents the Locality-Aware Mapping Algorithm (LAMA) for distributing the individual processes of a parallel application across processing resources in an HPC system, paying particular attention to the internal server NUMA topologies. The algorithm is able to support both homogeneous and heterogeneous hardware systems, and dynamically adapts to the available hardware and user-specified process layout at run-time. As implemented in Open MPI, the LAMA provides 362,880 mapping permutations and is able to naturally scale out to additional hardware resources as they become available in future architectures.\",\"PeriodicalId\":200830,\"journal\":{\"name\":\"2011 IEEE International Conference on Cluster Computing\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE International Conference on Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLUSTER.2011.59\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2011.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

摘要

高性能计算(HPC)系统由包含越来越多的核心的服务器组成。有了如此高的处理器核数，非统一内存访问(NUMA)体系结构几乎被普遍用于通过在服务器内部网络拓扑中分布处理器和内存来减少处理器间和内存通信瓶颈。应用程序研究表明，将服务器NUMA网络拓扑中的进程位置调优到应用程序可以对性能产生巨大影响。当跨多个服务器节点运行并行作业时，性能影响会被放大，特别是对于大规模HPC应用程序。本文提出了一种位置感知映射算法(LAMA)，用于在高性能计算系统中跨处理资源分配并行应用程序的各个进程，特别注意内部服务器NUMA拓扑结构。该算法能够支持同构和异构硬件系统，并在运行时动态适应可用的硬件和用户指定的进程布局。正如在Open MPI中实现的那样，LAMA提供了362,880个映射排列，并且能够自然地向外扩展到未来架构中可用的其他硬件资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Locality-Aware Parallel Process Mapping for Multi-core HPC Systems

High Performance Computing (HPC) systems are composed of servers containing an ever-increasing number of cores. With such high processor core counts, non-uniform memory access (NUMA) architectures are almost universally used to reduce inter-processor and memory communication bottlenecks by distributing processors and memory throughout a server-internal networking topology. Application studies have shown that the tuning of processes placement in a server's NUMA networking topology to the application can have a dramatic impact on performance. The performance implications are magnified when running a parallel job across multiple server nodes, especially with large scale HPC applications. This paper presents the Locality-Aware Mapping Algorithm (LAMA) for distributing the individual processes of a parallel application across processing resources in an HPC system, paying particular attention to the internal server NUMA topologies. The algorithm is able to support both homogeneous and heterogeneous hardware systems, and dynamically adapts to the available hardware and user-specified process layout at run-time. As implemented in Open MPI, the LAMA provides 362,880 mapping permutations and is able to naturally scale out to additional hardware resources as they become available in future architectures.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量