A dynamic schema to increase performance in many-core architectures through percolation operations

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI:10.1109/HiPC.2013.6799134

E. Garcia, Daniel A. Orozco, R. Khan, Ioannis E. Venetis, Kelly Livingston, G. Gao

{"title":"A dynamic schema to increase performance in many-core architectures through percolation operations","authors":"E. Garcia, Daniel A. Orozco, R. Khan, Ioannis E. Venetis, Kelly Livingston, G. Gao","doi":"10.1109/HiPC.2013.6799134","DOIUrl":null,"url":null,"abstract":"Optimization of parallel applications under new many-core architectures is challenging even for regular applications. Successful strategies inherited from previous generations of parallel or serial architectures just return incremental gains in performance and further optimization and tuning are required. We argue that conservative static optimizations are not the best fit for modern many-core architectures. The limited advantages of static techniques come from the new scenarios present in many-cores: Plenty of thread units sharing several resources under different coordination mechanisms. We point out that scheduling and data movement across the memory hierarchy are extremely important in the performance of applications. In particular, we found that scheduling of data movement operations significantly impact performance. To overcome those difficulties, we took advantage of the fine-grain synchronization primitives of many-cores to define percolation operations in order to schedule data movement properly. In addition, we have fused percolation operations with dynamic scheduling into a dynamic percolation approach. We used Dense Matrix Multiplication on a modern manycore to illustrate how our proposed techniques are able to increase the performance under these new environments. In our study on the IBM Cyclops-64, we raised the performance from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in on-chip memory) and 65.6 GFLOPS (operands in off-chip memory). The success of our approach also resulted in excellent power efficiency: 1.09 GFLOPS/Watt and 993 MFLOPS/Watt when the input data resided in on-chip and off-chip memory respectively.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"20th Annual International Conference on High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2013.6799134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Optimization of parallel applications under new many-core architectures is challenging even for regular applications. Successful strategies inherited from previous generations of parallel or serial architectures just return incremental gains in performance and further optimization and tuning are required. We argue that conservative static optimizations are not the best fit for modern many-core architectures. The limited advantages of static techniques come from the new scenarios present in many-cores: Plenty of thread units sharing several resources under different coordination mechanisms. We point out that scheduling and data movement across the memory hierarchy are extremely important in the performance of applications. In particular, we found that scheduling of data movement operations significantly impact performance. To overcome those difficulties, we took advantage of the fine-grain synchronization primitives of many-cores to define percolation operations in order to schedule data movement properly. In addition, we have fused percolation operations with dynamic scheduling into a dynamic percolation approach. We used Dense Matrix Multiplication on a modern manycore to illustrate how our proposed techniques are able to increase the performance under these new environments. In our study on the IBM Cyclops-64, we raised the performance from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in on-chip memory) and 65.6 GFLOPS (operands in off-chip memory). The success of our approach also resulted in excellent power efficiency: 1.09 GFLOPS/Watt and 993 MFLOPS/Watt when the input data resided in on-chip and off-chip memory respectively.

查看原文本刊更多论文

通过渗透操作提高多核体系结构性能的动态模式

在新的多核架构下优化并行应用程序即使对常规应用程序也是具有挑战性的。从前几代并行或串行架构继承的成功策略只会在性能上获得增量收益，并且需要进一步的优化和调优。我们认为保守的静态优化并不适合现代的多核架构。静态技术的有限优势来自于多核中出现的新场景:大量线程单元在不同的协调机制下共享多个资源。我们指出，跨内存层次结构的调度和数据移动对应用程序的性能非常重要。特别是，我们发现数据移动操作的调度会显著影响性能。为了克服这些困难，我们利用多核的细粒度同步原语来定义渗透操作，以便正确地调度数据移动。此外，我们还将渗透操作与动态调度融合为一种动态渗透方法。我们在现代多核上使用密集矩阵乘法来说明我们提出的技术如何能够在这些新环境下提高性能。在我们对IBM Cyclops-64的研究中，我们将性能从44 GFLOPS(可能的80 GFLOPS)提高到70.0 GFLOPS(片内存储器中的操作数)和65.6 GFLOPS(片外存储器中的操作数)。我们的方法的成功还导致了出色的功率效率:当输入数据驻留在片内和片外存储器时，分别为1.09 GFLOPS/Watt和993 MFLOPS/Watt。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

20th Annual International Conference on High Performance Computing

自引率

0.00%

发文量