Auto-Tuning Stencil Computations on Multicore and Accelerators

Scientific Computing with Multicore and Accelerators Pub Date : 2010-12-07 DOI:10.1201/B10376-18

K. Datta, Samuel Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick

{"title":"Auto-Tuning Stencil Computations on Multicore and Accelerators","authors":"K. Datta, Samuel Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick","doi":"10.1201/B10376-18","DOIUrl":null,"url":null,"abstract":"Author(s): Datta, K; Williams, S; Volkov, V; Carter, J; Oliker, L; Shalf, J; Yelick, K | Editor(s): Kurzak, J; Bader, D; Dongarra, J | Abstract: © 2011 by Taylor and Francis Group, LLC. The recent transformation from an environment where gains in computational performance came from increasing clock frequency and other hardware engineering innovations, to an environment where gains are realized through the deployment of ever increasing numbers of modest performance cores has profoundly changed the landscape of scientific application programming. This exponential increase in core count represents both an opportunity and a challenge: access to petascale simulation capabilities and beyond will require that this concurrency be efficiently exploited. The problem for application programmers is further compounded by the diversity of multicore architectures that are now emerging [4]. From relatively complex out-of-order CPUs with complex cache structures, to relatively simple cores that support hardware multithreading, to chips that require explicit use of software controlled memory, designing optimal code for these different platforms represents a serious impediment. An emerging solution to this problem is auto-tuning: the automatic generation of many versions of a code kernel that incorporate various tuning strategies, and the benchmarking of these to select the highest performing version. Typical tuning strategies might include: maximizing incore performance with loop unrolling and restructuring; maximizing memory bandwidth by exploiting non-uniform memory access (NUMA), engaging prefetch by directives; and minimizing memory traffic by cache blocking or array padding. Often a key parameter is associated with each tuning strategy (e.g., the amount of loop unrolling or the cache blocking factor), and these parameters must be explored in addition to the layering of the basic strategies themselves.","PeriodicalId":411793,"journal":{"name":"Scientific Computing with Multicore and Accelerators","volume":"245 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Computing with Multicore and Accelerators","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1201/B10376-18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Author(s): Datta, K; Williams, S; Volkov, V; Carter, J; Oliker, L; Shalf, J; Yelick, K | Editor(s): Kurzak, J; Bader, D; Dongarra, J | Abstract: © 2011 by Taylor and Francis Group, LLC. The recent transformation from an environment where gains in computational performance came from increasing clock frequency and other hardware engineering innovations, to an environment where gains are realized through the deployment of ever increasing numbers of modest performance cores has profoundly changed the landscape of scientific application programming. This exponential increase in core count represents both an opportunity and a challenge: access to petascale simulation capabilities and beyond will require that this concurrency be efficiently exploited. The problem for application programmers is further compounded by the diversity of multicore architectures that are now emerging [4]. From relatively complex out-of-order CPUs with complex cache structures, to relatively simple cores that support hardware multithreading, to chips that require explicit use of software controlled memory, designing optimal code for these different platforms represents a serious impediment. An emerging solution to this problem is auto-tuning: the automatic generation of many versions of a code kernel that incorporate various tuning strategies, and the benchmarking of these to select the highest performing version. Typical tuning strategies might include: maximizing incore performance with loop unrolling and restructuring; maximizing memory bandwidth by exploiting non-uniform memory access (NUMA), engaging prefetch by directives; and minimizing memory traffic by cache blocking or array padding. Often a key parameter is associated with each tuning strategy (e.g., the amount of loop unrolling or the cache blocking factor), and these parameters must be explored in addition to the layering of the basic strategies themselves.

查看原文本刊更多论文

多核和加速器上的自调优模板计算

作者:Datta, K;威廉姆斯,年代;罗蒙,V;卡特,J。Oliker L;Shalf J;编辑:库尔扎克，J;巴德,D;摘要:©2011 by Taylor and Francis Group, LLC.最近从一个计算性能的提高来自时钟频率和其他硬件工程创新的环境转变为一个通过部署越来越多的中等性能核心来实现收益的环境，这深刻地改变了科学应用程序编程的景观。核心数量的指数级增长既代表了机遇，也代表了挑战:访问千万亿级模拟功能以及更高的功能将需要有效地利用这种并发性。现在出现的多核架构的多样性进一步加剧了应用程序程序员的问题。从具有复杂缓存结构的相对复杂的乱序cpu，到支持硬件多线程的相对简单的内核，再到需要明确使用软件控制内存的芯片，为这些不同的平台设计最佳代码是一个严重的障碍。这个问题的一个新兴解决方案是自动调优:自动生成包含各种调优策略的代码内核的许多版本，并对这些版本进行基准测试，以选择性能最高的版本。典型的调优策略可能包括:通过循环展开和重组来最大化核心性能;通过利用非均匀内存访问(NUMA)最大化内存带宽，通过指令进行预取;并通过缓存阻塞或数组填充来最小化内存流量。通常一个关键参数与每个调优策略相关联(例如，循环展开的数量或缓存阻塞因子)，除了对基本策略本身进行分层之外，还必须对这些参数进行研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific Computing with Multicore and Accelerators

自引率

0.00%

发文量