Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2015-01-24 DOI:10.1145/2688500.2688517

Loïc Thébault, E. Petit, Quang Dinh

{"title":"Scalable and efficient implementation of 3d unstructured meshes computation: a case study on matrix assembly","authors":"Loïc Thébault, E. Petit, Quang Dinh","doi":"10.1145/2688500.2688517","DOIUrl":null,"url":null,"abstract":"Exposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing number of cores per nodes, especially with new many-core processors. In this paper, we propose an hybrid approach using domain decomposition to exploit distributed memory parallelism, Divide-and-Conquer, D&C, to exploit shared memory parallelism and improve locality, and mesh coloring at core level to exploit vectors. It illustrates a new trade-off for many-cores between structuredness, memory locality, and vectorization. We evaluate our approach on the finite element matrix assembly of an industrial fluid dynamic code developed by Dassault Aviation. We compare our D&C approach to domain decomposition and to mesh coloring. D&C achieves a high parallel efficiency, a good data locality as well as an improved bandwidth usage. It competes on current nodes with the optimized pure MPI version with a minimum 10% speed-up. D&C shows an impressive 319x strong scaling on 512 cores (32 nodes) with only 2000 vertices per core. Finally, the Intel Xeon Phi version has a performance similar to 10 Intel E5-2665 Xeon Sandy Bridge cores and 95% parallel efficiency on the 60 physical cores. Running on 4 Xeon Phi (240 cores), D&C has 92% efficiency on the physical cores and performance similar to 33 Intel E5-2665 Xeon Sandy Bridge cores.","PeriodicalId":291839,"journal":{"name":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2688500.2688517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Exposing massive parallelism on 3D unstructured meshes computation with efficient load balancing and minimal synchronizations is challenging. Current approaches relying on domain decomposition and mesh coloring struggle to scale with the increasing number of cores per nodes, especially with new many-core processors. In this paper, we propose an hybrid approach using domain decomposition to exploit distributed memory parallelism, Divide-and-Conquer, D&C, to exploit shared memory parallelism and improve locality, and mesh coloring at core level to exploit vectors. It illustrates a new trade-off for many-cores between structuredness, memory locality, and vectorization. We evaluate our approach on the finite element matrix assembly of an industrial fluid dynamic code developed by Dassault Aviation. We compare our D&C approach to domain decomposition and to mesh coloring. D&C achieves a high parallel efficiency, a good data locality as well as an improved bandwidth usage. It competes on current nodes with the optimized pure MPI version with a minimum 10% speed-up. D&C shows an impressive 319x strong scaling on 512 cores (32 nodes) with only 2000 vertices per core. Finally, the Intel Xeon Phi version has a performance similar to 10 Intel E5-2665 Xeon Sandy Bridge cores and 95% parallel efficiency on the 60 physical cores. Running on 4 Xeon Phi (240 cores), D&C has 92% efficiency on the physical cores and performance similar to 33 Intel E5-2665 Xeon Sandy Bridge cores.

查看原文本刊更多论文

三维非结构化网格计算的可扩展和高效实现:矩阵装配的案例研究

通过有效的负载平衡和最小的同步来揭示3D非结构化网格计算的大量并行性是具有挑战性的。当前依赖于区域分解和网格着色的方法很难随着每个节点内核数量的增加而扩展，特别是在新的多核处理器上。在本文中，我们提出了一种混合方法，利用领域分解来利用分布式内存并行性，分治法，D&C，利用共享内存并行性和改进局部性，并在核心层网格着色来利用向量。它说明了多核在结构化、内存局部性和向量化之间的一种新的权衡。我们对达索航空公司开发的工业流体动力学程序的有限元矩阵装配方法进行了评价。我们将D&C方法与域分解和网格着色进行比较。D&C实现了高并行效率，良好的数据局部性以及改进的带宽使用。它与优化的纯MPI版本在当前节点上竞争，至少有10%的加速。D&C在512个核心(32个节点)上显示了令人印象深刻的319倍的强大扩展，每个核心只有2000个顶点。最后，英特尔至强Phi版本的性能与10个英特尔E5-2665至强Sandy Bridge内核相似，在60个物理内核上并行效率为95%。运行在4 Xeon Phi(240核)上，D&C在物理核上具有92%的效率，性能类似于33个Intel E5-2665 Xeon Sandy Bridge核。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量