大规模应用程序特性描述:从开发用于高性能计算的分布式开放社区运行时系统中获得的经验教训

Proceedings of the ACM International Conference on Computing Frontiers Pub Date : 2016-05-16 DOI:10.1145/2903150.2903166

Joshua Landwehr, Joshua D. Suetterlein, A. Márquez, J. Manzano, G. Gao

{"title":"大规模应用程序特性描述:从开发用于高性能计算的分布式开放社区运行时系统中获得的经验教训","authors":"Joshua Landwehr, Joshua D. Suetterlein, A. Márquez, J. Manzano, G. Gao","doi":"10.1145/2903150.2903166","DOIUrl":null,"url":null,"abstract":"Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control. Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them. This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing\",\"authors\":\"Joshua Landwehr, Joshua D. Suetterlein, A. Márquez, J. Manzano, G. Gao\",\"doi\":\"10.1145/2903150.2903166\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control. Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them. This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.\",\"PeriodicalId\":226569,\"journal\":{\"name\":\"Proceedings of the ACM International Conference on Computing Frontiers\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM International Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2903150.2903166\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2903150.2903166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

自2012年以来，美国能源部的X-Stack计划一直在为Exascale系统软件开发解决方案，包括运行时系统、编程模型、语言、编译器和工具，以满足关键的性能和功率要求。细粒度编程模型和运行时系统显示出有效利用底层硬件的巨大潜力。因此，它们对于许多X-Stack工作都是必不可少的。大量的小任务可以更好地利用当前和未来机器上可用的巨大并行性。此外，由于状态和控制的减少，精细的任务可以更快地恢复并更好地适应。尽管如此，当前的应用程序已经被编写为利用旧的范例(例如通信顺序处理器和批量同步并行处理)。为了充分利用这些新系统的优势，应用程序需要适应这些新范例。作为应用程序移植过程的一部分，需要对应用程序特性和运行时特性进行深入的特性研究，以充分了解应用程序性能瓶颈以及如何解决这些瓶颈。本文介绍了一种新型高性能运行时系统的特性研究，称为开放社区运行时，使用关键的HPC内核作为其载体。本研究有以下贡献:第一个实现OCR标准的高性能、细粒度、分布式内存运行时系统(版本0.99a);以及在节点内和节点间环境中运行时原语方面对关键HPC内核的表征研究。在一个通用集群上运行，我们发现在128个节点上并行平铺Cholesky内核(每个节点有16个内核)的相对速度提高了1635倍，在128个节点上并行平铺Smith-Waterman内核(每个节点有30个内核)的相对速度提高了1864倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Application characterization at scale: lessons learned from developing a distributed open community runtime system for high performance computing

Since 2012, the U.S. Department of Energy's X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control. Nevertheless, current applications have been written to exploit old paradigms (such as Communicating Sequential Processor and Bulk Synchronous Parallel processing). To fully utilize the advantages of these new systems, applications need to be adapted to these new paradigms. As part of the applications' porting process, in-depth characterization studies, focused on both application characteristics and runtime features, need to take place to fully understand the application performance bottlenecks and how to resolve them. This paper presents a characterization study for a novel high performance runtime system, called the Open Community Runtime, using key HPC kernels as its vehicle. This study has the following contributions: one of the first high performance, fine grain, distributed memory runtime system implementing the OCR standard (version 0.99a); and a characterization study of key HPC kernels in terms of runtime primitives running on both intra and inter node environments. Running on a general purpose cluster, we have found up to 1635x relative speed-up for a parallel tiled Cholesky Kernels on 128 nodes with 16 cores each and a 1864x relative speed-up for a parallel tiled Smith-Waterman kernel on 128 nodes with 30 cores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACM International Conference on Computing Frontiers

自引率

0.00%

发文量