Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI:10.1109/CLUSTER.2010.12

Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi, Kai Lu

{"title":"Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing","authors":"Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi, Kai Lu","doi":"10.1109/CLUSTER.2010.12","DOIUrl":null,"url":null,"abstract":"In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor’s library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"83","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2010.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 83

Abstract

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor’s library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009.

查看原文本刊更多论文

千万亿异构CPU/GPU计算的自适应优化

在本文中，我们描述了我们在天河一号上开发Linpack基准测试的实验，天河一号是一个千万亿次CPU/GPU超级计算机系统，是迄今为止尝试过的最大的GPU加速系统。提出了一种自适应优化框架，在运行时开销可忽略的情况下平衡gpu和cpu之间的工作负载分布，从而获得比静态或训练分区方法更好的性能。CPU-GPU通信开销通过软件流水线技术有效地隐藏，这对于内存受限的大型应用程序特别有用。结合其他传统优化，我们使用自适应优化框架优化的Linpack在天河一号的单个计算单元上实现了196.7 GFLOPS。这个结果是峰值计算能力的70.1%，比使用供应商库的结果快3.3倍。在天河一号的完整配置下，我们的优化结果使Linpack性能达到0.563PFLOPS，这使得天河一号在2009年11月发布的世界500强中排名第五。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量