超级计算机上可扩展的NUMA-Aware Wilson-Dirac

2017 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2017-07-17 DOI:10.1109/HPCS.2017.56

C. Tadonki

{"title":"超级计算机上可扩展的NUMA-Aware Wilson-Dirac","authors":"C. Tadonki","doi":"10.1109/HPCS.2017.56","DOIUrl":null,"url":null,"abstract":"We revisit the Wilson-Dirac operator, also referred as Dslash, on NUMA manycore vector machines and thereby seek an efficient supercomputing implementation. Quantum Chro- moDynamics (QCD) is the theory of the strong nuclear force and its discrete formalism is the so-called Lattice Quantum ChromoDynamics (LQCD). Wilson-Dirac is the major computing kernel in LQCD, where a special attention is paid to large scale simulations. The corresponding computing demand is tremendous at various levels from storage to floating-point operations, thus the crucial need for powerful supercomputers. Designing efficient LQCD codes on modern (mostly hybrid) supercomputers requires to efficiently exploit all available levels of parallelism including accelerators. Since Wilson-Dirac is a coarse-grain stencil computation performed on a huge volume of data, any performance and scalability related investigation should skillfully address memory accesses and interprocessor communication overheads. In order to lower the latter, explicit shared memory implementations should be considered at the level of a compute node, since this will lead to a less complex data communication graph and thus (at least intuitively) reduce the overall communication latency. We focus on this aspect and propose a novel efficient NUMA-aware scheduling, together with a combination of the major HPC strategies for large-scale LQCD. We reach nearly optimal performances on a single core and a significant scalability improvement on several NUMA nodes. Then, using a classical domain decomposition approach, we extend our scheduling to a large cluster of many-core nodes, thus illustrating the global efficiency of our hybrid implementation.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Scalable NUMA-Aware Wilson-Dirac on Supercomputers\",\"authors\":\"C. Tadonki\",\"doi\":\"10.1109/HPCS.2017.56\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We revisit the Wilson-Dirac operator, also referred as Dslash, on NUMA manycore vector machines and thereby seek an efficient supercomputing implementation. Quantum Chro- moDynamics (QCD) is the theory of the strong nuclear force and its discrete formalism is the so-called Lattice Quantum ChromoDynamics (LQCD). Wilson-Dirac is the major computing kernel in LQCD, where a special attention is paid to large scale simulations. The corresponding computing demand is tremendous at various levels from storage to floating-point operations, thus the crucial need for powerful supercomputers. Designing efficient LQCD codes on modern (mostly hybrid) supercomputers requires to efficiently exploit all available levels of parallelism including accelerators. Since Wilson-Dirac is a coarse-grain stencil computation performed on a huge volume of data, any performance and scalability related investigation should skillfully address memory accesses and interprocessor communication overheads. In order to lower the latter, explicit shared memory implementations should be considered at the level of a compute node, since this will lead to a less complex data communication graph and thus (at least intuitively) reduce the overall communication latency. We focus on this aspect and propose a novel efficient NUMA-aware scheduling, together with a combination of the major HPC strategies for large-scale LQCD. We reach nearly optimal performances on a single core and a significant scalability improvement on several NUMA nodes. Then, using a classical domain decomposition approach, we extend our scheduling to a large cluster of many-core nodes, thus illustrating the global efficiency of our hybrid implementation.\",\"PeriodicalId\":115758,\"journal\":{\"name\":\"2017 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCS.2017.56\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

我们在NUMA多核向量机上重新审视Wilson-Dirac算子(也称为Dslash)，从而寻求一种高效的超级计算实现。量子色动力学(QCD)是强核力的理论，其离散形式称为点阵量子色动力学(LQCD)。Wilson-Dirac是LQCD的主要计算内核，它特别关注大规模模拟。在从存储到浮点运算的各个层面上，相应的计算需求是巨大的，因此对强大的超级计算机的需求至关重要。在现代(主要是混合)超级计算机上设计高效的LQCD代码需要有效地利用所有可用的并行性级别，包括加速器。由于Wilson-Dirac是在大量数据上执行的粗粒度模板计算，因此任何与性能和可伸缩性相关的研究都应该巧妙地解决内存访问和处理器间通信开销。为了降低后者，应该在计算节点级别考虑显式共享内存实现，因为这将导致不那么复杂的数据通信图，从而(至少直观地)减少总体通信延迟。针对这一问题，我们提出了一种新的高效的numa感知调度方法，并结合了大规模LQCD的主要HPC策略。我们在单核上实现了近乎最佳的性能，并在多个NUMA节点上实现了显著的可伸缩性改进。然后，使用经典的领域分解方法，我们将调度扩展到多核节点的大型集群，从而说明了我们的混合实现的全局效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scalable NUMA-Aware Wilson-Dirac on Supercomputers

We revisit the Wilson-Dirac operator, also referred as Dslash, on NUMA manycore vector machines and thereby seek an efficient supercomputing implementation. Quantum Chro- moDynamics (QCD) is the theory of the strong nuclear force and its discrete formalism is the so-called Lattice Quantum ChromoDynamics (LQCD). Wilson-Dirac is the major computing kernel in LQCD, where a special attention is paid to large scale simulations. The corresponding computing demand is tremendous at various levels from storage to floating-point operations, thus the crucial need for powerful supercomputers. Designing efficient LQCD codes on modern (mostly hybrid) supercomputers requires to efficiently exploit all available levels of parallelism including accelerators. Since Wilson-Dirac is a coarse-grain stencil computation performed on a huge volume of data, any performance and scalability related investigation should skillfully address memory accesses and interprocessor communication overheads. In order to lower the latter, explicit shared memory implementations should be considered at the level of a compute node, since this will lead to a less complex data communication graph and thus (at least intuitively) reduce the overall communication latency. We focus on this aspect and propose a novel efficient NUMA-aware scheduling, together with a combination of the major HPC strategies for large-scale LQCD. We reach nearly optimal performances on a single core and a significant scalability improvement on several NUMA nodes. Then, using a classical domain decomposition approach, we extend our scheduling to a large cluster of many-core nodes, thus illustrating the global efficiency of our hybrid implementation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量