大规模分布式存储平台上分层MPI实现的性能表征

2009 International Conference on Parallel Processing Pub Date : 2009-09-22 DOI:10.1109/ICPP.2009.51

S. Alam, R. Barrett, J. Kuehn, Steve Poole

{"title":"大规模分布式存储平台上分层MPI实现的性能表征","authors":"S. Alam, R. Barrett, J. Kuehn, Steve Poole","doi":"10.1109/ICPP.2009.51","DOIUrl":null,"url":null,"abstract":"The building blocks of emerging Petascale massively parallel processing (MPP) systems are multi-core processors with four or more cores as a single processing element and a customized network interface. The resulting memory and communication hierarchy of these platforms are now exposed to application developers and end users by creating a hierarchical or multi-core aware message-passing (MPI) programming interface and by providing a handful of runtime, tunable parameters that allows mapping and control of MPI tasks and message handling. We characterize performance of MPI communication patterns and present strategies for optimizing applications performance on Cray XT series systems that are composed of contemporary AMD processors and a proprietary network infrastructure. We highlight dependencies in its memory and network subsystems, which could influence production-level applications performance. We demonstrate that MPI micro-benchmarks could mislead an application developer or end user since these benchmarks often do not expose the interplay between memory allocation and usage in the user space, which depends on the number of tasks or cores and workload characteristics. Our studies show performance improvements compared to the default options for our target scientific benchmarks and production-level applications.","PeriodicalId":169408,"journal":{"name":"2009 International Conference on Parallel Processing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Performance Characterization of a Hierarchical MPI Implementation on Large-scale Distributed-memory Platforms\",\"authors\":\"S. Alam, R. Barrett, J. Kuehn, Steve Poole\",\"doi\":\"10.1109/ICPP.2009.51\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The building blocks of emerging Petascale massively parallel processing (MPP) systems are multi-core processors with four or more cores as a single processing element and a customized network interface. The resulting memory and communication hierarchy of these platforms are now exposed to application developers and end users by creating a hierarchical or multi-core aware message-passing (MPI) programming interface and by providing a handful of runtime, tunable parameters that allows mapping and control of MPI tasks and message handling. We characterize performance of MPI communication patterns and present strategies for optimizing applications performance on Cray XT series systems that are composed of contemporary AMD processors and a proprietary network infrastructure. We highlight dependencies in its memory and network subsystems, which could influence production-level applications performance. We demonstrate that MPI micro-benchmarks could mislead an application developer or end user since these benchmarks often do not expose the interplay between memory allocation and usage in the user space, which depends on the number of tasks or cores and workload characteristics. Our studies show performance improvements compared to the default options for our target scientific benchmarks and production-level applications.\",\"PeriodicalId\":169408,\"journal\":{\"name\":\"2009 International Conference on Parallel Processing\",\"volume\":\"61 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2009.51\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2009.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

新兴的Petascale大规模并行处理(MPP)系统的构建模块是多核处理器，其中四个或更多核作为单个处理元素和定制的网络接口。现在，通过创建分层或多核感知消息传递(MPI)编程接口，并通过提供一些运行时可调参数(允许映射和控制MPI任务和消息处理)，这些平台的最终内存和通信层次结构向应用程序开发人员和最终用户公开。我们描述了MPI通信模式的性能，并提出了在由现代AMD处理器和专有网络基础设施组成的Cray XT系列系统上优化应用程序性能的策略。我们强调了其内存和网络子系统中的依赖关系，这可能会影响生产级应用程序的性能。我们证明MPI微基准测试可能会误导应用程序开发人员或最终用户，因为这些基准测试通常不会暴露用户空间中的内存分配和使用之间的相互作用，这取决于任务或核心的数量和工作负载特征。我们的研究表明，与我们的目标科学基准和生产级应用程序的默认选项相比，性能有所提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Characterization of a Hierarchical MPI Implementation on Large-scale Distributed-memory Platforms

The building blocks of emerging Petascale massively parallel processing (MPP) systems are multi-core processors with four or more cores as a single processing element and a customized network interface. The resulting memory and communication hierarchy of these platforms are now exposed to application developers and end users by creating a hierarchical or multi-core aware message-passing (MPI) programming interface and by providing a handful of runtime, tunable parameters that allows mapping and control of MPI tasks and message handling. We characterize performance of MPI communication patterns and present strategies for optimizing applications performance on Cray XT series systems that are composed of contemporary AMD processors and a proprietary network infrastructure. We highlight dependencies in its memory and network subsystems, which could influence production-level applications performance. We demonstrate that MPI micro-benchmarks could mislead an application developer or end user since these benchmarks often do not expose the interplay between memory allocation and usage in the user space, which depends on the number of tasks or cores and workload characteristics. Our studies show performance improvements compared to the default options for our target scientific benchmarks and production-level applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 International Conference on Parallel Processing

自引率

0.00%

发文量