大规模应用程序特性描述:从开发用于高性能计算的分布式开放社区运行时系统中获得的经验教训

Joshua Landwehr, Joshua D. Suetterlein, A. Márquez, J. Manzano, G. Gao
{"title":"大规模应用程序特性描述:从开发用于高性能计算的分布式开放社区运行时系统中获得的经验教训","authors":"Joshua Landwehr, Joshua D. Suetterlein, A. Márquez, J. Manzano, G. Gao","doi":"10.1145/2903150.2903166","DOIUrl":null,"url":null,"abstract":"Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control. Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them. This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing\",\"authors\":\"Joshua Landwehr, Joshua D. Suetterlein, A. Márquez, J. Manzano, G. Gao\",\"doi\":\"10.1145/2903150.2903166\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control. Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them. This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.\",\"PeriodicalId\":226569,\"journal\":{\"name\":\"Proceedings of the ACM International Conference on Computing Frontiers\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM International Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2903150.2903166\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2903150.2903166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

自2012年以来,美国能源部的X-Stack计划一直在为Exascale系统软件开发解决方案,包括运行时系统、编程模型、语言、编译器和工具,以满足关键的性能和功率要求。细粒度编程模型和运行时系统显示出有效利用底层硬件的巨大潜力。因此,它们对于许多X-Stack工作都是必不可少的。大量的小任务可以更好地利用当前和未来机器上可用的巨大并行性。此外,由于状态和控制的减少,精细的任务可以更快地恢复并更好地适应。尽管如此,当前的应用程序已经被编写为利用旧的范例(例如通信顺序处理器和批量同步并行处理)。为了充分利用这些新系统的优势,应用程序需要适应这些新范例。作为应用程序移植过程的一部分,需要对应用程序特性和运行时特性进行深入的特性研究,以充分了解应用程序性能瓶颈以及如何解决这些瓶颈。本文介绍了一种新型高性能运行时系统的特性研究,称为开放社区运行时,使用关键的HPC内核作为其载体。本研究有以下贡献:第一个实现OCR标准的高性能、细粒度、分布式内存运行时系统(版本0.99a);以及在节点内和节点间环境中运行时原语方面对关键HPC内核的表征研究。在一个通用集群上运行,我们发现在128个节点上并行平铺Cholesky内核(每个节点有16个内核)的相对速度提高了1635倍,在128个节点上并行平铺Smith-Waterman内核(每个节点有30个内核)的相对速度提高了1864倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing
Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control. Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them. This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信