迈向节能计算的一步:在CPU-GPU上重新设计流体力学应用程序

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.103

Tingxing Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, J. Dongarra

{"title":"迈向节能计算的一步:在CPU-GPU上重新设计流体力学应用程序","authors":"Tingxing Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, J. Dongarra","doi":"10.1109/IPDPS.2014.103","DOIUrl":null,"url":null,"abstract":"Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs BLAST, Dobrev. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor's library routines substantially in performance. We proposed an auto tuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2.5× overall speedup and 1.42× energy efficiency (green up) using 4th order (Q_4) finite elements, and a 1.9× speedup and 1.27× green up using 2nd order (Q2) finite elements.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"60","resultStr":"{\"title\":\"A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU\",\"authors\":\"Tingxing Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, J. Dongarra\",\"doi\":\"10.1109/IPDPS.2014.103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs BLAST, Dobrev. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor's library routines substantially in performance. We proposed an auto tuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2.5× overall speedup and 1.42× energy efficiency (green up) using 4th order (Q_4) finite elements, and a 1.9× speedup and 1.27× green up using 2nd order (Q2) finite elements.\",\"PeriodicalId\":309291,\"journal\":{\"name\":\"2014 IEEE 28th International Parallel and Distributed Processing Symposium\",\"volume\":\"79 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"60\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 28th International Parallel and Distributed Processing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2014.103\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 60

摘要

在高性能计算中，功率和能耗日益受到关注。与多核cpu相比，gpu的每瓦特性能要好得多。在本文中，我们讨论了重新设计BLAST中计算最密集的部分的努力，BLAST是一个使用gpu BLAST, Dobrev解决高阶有限元可压缩流体动力学方程的应用程序。为了充分利用gpu的硬件并行性，实现高性能，我们实现了自定义线性代数内核。我们通过利用内存层次结构对CUDA内核进行了密集优化，在性能上大大超过了供应商的库例程。我们提出了一种自动调谐技术，使我们的CUDA内核适应有限元法的阶数。与之前的基础实现相比，我们的重新设计和优化在两个方面降低了GPU的能耗:解决方案的时间减少了60%，所需功率减少了10%。与仅使用cpu的解决方案相比，我们的GPU加速BLAST使用四阶(Q_4)有限元获得了2.5倍的总体加速和1.42倍的能效(绿色向上)，使用二阶(Q2)有限元获得了1.9倍的加速和1.27倍的绿色向上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU

Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs BLAST, Dobrev. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor's library routines substantially in performance. We proposed an auto tuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2.5× overall speedup and 1.42× energy efficiency (green up) using 4th order (Q_4) finite elements, and a 1.9× speedup and 1.27× green up using 2nd order (Q2) finite elements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量