Mesh independent loop fusion for unstructured mesh applications

ACM International Conference on Computing Frontiers Pub Date : 2012-05-15 DOI:10.1145/2212908.2212917

C. Bertolli, A. Betts, P. Kelly, G. Mudalige, M. Giles

{"title":"Mesh independent loop fusion for unstructured mesh applications","authors":"C. Bertolli, A. Betts, P. Kelly, G. Mudalige, M. Giles","doi":"10.1145/2212908.2212917","DOIUrl":null,"url":null,"abstract":"Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation.\n In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis --- which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2212908.2212917","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation. In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis --- which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion.

查看原文本刊更多论文

非结构化网格应用的网格独立环路融合

基于非结构化网格的应用程序通常是计算密集型的，导致运行时间长。原则上，最先进的硬件，如多核cpu和多核gpu，可以用于它们的加速，但这些深奥的架构需要专门的知识才能实现最佳性能。OP2是一个并行编程层，它试图通过允许程序员通过API调用(即所谓的OP2-loop)对非结构化网格中的元素进行并行迭代来减轻这种编程负担。OP2编译器基础结构然后使用源到源转换来实现每个OP2循环的并行实现，并发现优化的机会。在本文中，我们描述了几种编译器技术如何有效地利用串联来提高非结构化网格应用程序的性能。特别是，我们展示了由于控制流图的大小而经常被抑制的整个程序分析如何由于OP2编程模型而变得可行，从而促进了积极的优化。我们随后展示了整个程序分析如何成为op2循环优化的推动者。在此基础上，我们展示了如何在编译时定义经典技术，即循环融合，这通常很难应用于非结构化网格应用程序。我们研究了其应用的局限性，并在计算流体动力学应用基准上展示了实验结果，评估了环路融合带来的性能增益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM International Conference on Computing Frontiers

自引率

0.00%

发文量