Performance analysis of HPC applications with irregular tree data structures

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI:10.1109/PADSW.2014.7097837

A. Khawaja, Jiajun Wang, A. Gerstlauer, L. John, D. Malhotra, G. Biros

{"title":"Performance analysis of HPC applications with irregular tree data structures","authors":"A. Khawaja, Jiajun Wang, A. Gerstlauer, L. John, D. Malhotra, G. Biros","doi":"10.1109/PADSW.2014.7097837","DOIUrl":null,"url":null,"abstract":"Adaptive mesh refinement (AMR) numerical methods utilizing octree data structures are an important class of HPC applications, in particular the solution of partial differential equations. Much effort goes into the implementation of efficient versions of these types of programs, where the emphasis is often on increasing multi-node performance when utilizing GPUs and coprocessors. By contrast, our analysis aims to characterize these workloads on traditional CPUs, as we believe that single-threaded intra-node performance of critical kernels is still a key factor for achieving performance at scale. Especially irregular workloads such as AMR methods, however, exhibit severe underutilization on general purpose processors. In this paper, we analyze the single core performance of two state-of-the-art, highly scalable adaptive mesh refinement codes, one based on the Fast Multipole Method (FMM) and one based on the Finite Element Method (FEM), when running on a x86 CPU. We examined both scalar and vectorized implementations to identify performance bottlenecks. We demonstrate that vectorization can provide a significant benefit in achieving high performance. The greatest bottleneck to peak performance is the high fraction of non-floating point instructions in the kernels.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PADSW.2014.7097837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Adaptive mesh refinement (AMR) numerical methods utilizing octree data structures are an important class of HPC applications, in particular the solution of partial differential equations. Much effort goes into the implementation of efficient versions of these types of programs, where the emphasis is often on increasing multi-node performance when utilizing GPUs and coprocessors. By contrast, our analysis aims to characterize these workloads on traditional CPUs, as we believe that single-threaded intra-node performance of critical kernels is still a key factor for achieving performance at scale. Especially irregular workloads such as AMR methods, however, exhibit severe underutilization on general purpose processors. In this paper, we analyze the single core performance of two state-of-the-art, highly scalable adaptive mesh refinement codes, one based on the Fast Multipole Method (FMM) and one based on the Finite Element Method (FEM), when running on a x86 CPU. We examined both scalar and vectorized implementations to identify performance bottlenecks. We demonstrate that vectorization can provide a significant benefit in achieving high performance. The greatest bottleneck to peak performance is the high fraction of non-floating point instructions in the kernels.

查看原文本刊更多论文

不规则树状数据结构的高性能计算应用性能分析

利用八叉树数据结构的自适应网格细化(AMR)数值方法是高性能计算的重要应用，尤其是偏微分方程的求解。在实现这些类型的程序的高效版本上投入了大量的努力，其中的重点通常是在利用gpu和协处理器时提高多节点性能。相比之下，我们的分析旨在描述传统cpu上的这些工作负载，因为我们认为关键内核的单线程节点内性能仍然是实现大规模性能的关键因素。但是，特别是不规则的工作负载，如AMR方法，在通用处理器上表现出严重的利用率不足。在本文中，我们分析了两种最先进的，高度可扩展的自适应网格细化代码，一种基于快速多极方法(FMM)，另一种基于有限元方法(FEM)，在x86 CPU上运行时的单核性能。我们研究了标量实现和矢量化实现，以确定性能瓶颈。我们证明了向量化可以在实现高性能方面提供显著的好处。性能峰值的最大瓶颈是内核中非浮点指令的高比例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量