Optimized Mappings for Symmetric Range-Limited Molecular Force Calculations on FPGAs

Chunshu Wu, Sahan Bandara, Tong Geng, Anqi Guo, Pouya Haghi, Vipin Sachdeva, W. Sherman, Martin C. Herbordt
{"title":"Optimized Mappings for Symmetric Range-Limited Molecular Force Calculations on FPGAs","authors":"Chunshu Wu, Sahan Bandara, Tong Geng, Anqi Guo, Pouya Haghi, Vipin Sachdeva, W. Sherman, Martin C. Herbordt","doi":"10.1109/FPL57034.2022.00026","DOIUrl":null,"url":null,"abstract":"In N-body applications, the efficient evaluation of range-limited forces depends on applying certain constraints, including a cut-off radius and force symmetry (Newton's Third Law). When computing the pair-wise forces in parallel, finding the optimal mapping of particles and computations to memories and processors is surprisingly challenging, but can result in greatly reduced data movement and computation. Despite FPGAs having a distinct compute model (BRAMs/network/pipelines) from CPUs and ASICs, mappings on FPGAs have not previously been studied in depth: it was thought that the half-shell method was preferred. In this work, we find that the Manhattan method is sur-prisingly compatible with FPGA hardware. With the cache overlapping technique proposed in this paper, the ultra-fine-grained data access demanded by the Manhattan method can be satisfied, despite the fact that the memory blocks on FPGAs appear to be insufficiently fine-grained. We further demonstrate that, compared to the traditional baseline half-shell method, approximately a half of the filters (preprocessors) can be removed without performance degradation. For communication, the amount of data transferred can be reduced by 40% - 75% in the most common multi-FPGA scenarios. Moreover, data transfers are almost perfectly balanced along all directions, and the optimization requires only minimal hardware resources. The practical consequence is that nearly 2 x to 4 x the workload can be handled without upgrading the network connections between FPGAs. This is a critical finding given the relatively limited bandwidth available in many common accelerator boards and the strong-scaling applications to which FPGA clusters are being applied.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL57034.2022.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In N-body applications, the efficient evaluation of range-limited forces depends on applying certain constraints, including a cut-off radius and force symmetry (Newton's Third Law). When computing the pair-wise forces in parallel, finding the optimal mapping of particles and computations to memories and processors is surprisingly challenging, but can result in greatly reduced data movement and computation. Despite FPGAs having a distinct compute model (BRAMs/network/pipelines) from CPUs and ASICs, mappings on FPGAs have not previously been studied in depth: it was thought that the half-shell method was preferred. In this work, we find that the Manhattan method is sur-prisingly compatible with FPGA hardware. With the cache overlapping technique proposed in this paper, the ultra-fine-grained data access demanded by the Manhattan method can be satisfied, despite the fact that the memory blocks on FPGAs appear to be insufficiently fine-grained. We further demonstrate that, compared to the traditional baseline half-shell method, approximately a half of the filters (preprocessors) can be removed without performance degradation. For communication, the amount of data transferred can be reduced by 40% - 75% in the most common multi-FPGA scenarios. Moreover, data transfers are almost perfectly balanced along all directions, and the optimization requires only minimal hardware resources. The practical consequence is that nearly 2 x to 4 x the workload can be handled without upgrading the network connections between FPGAs. This is a critical finding given the relatively limited bandwidth available in many common accelerator boards and the strong-scaling applications to which FPGA clusters are being applied.
fpga上对称范围限制分子力计算的优化映射
在n体应用中,范围有限力的有效评估取决于应用某些约束,包括截止半径和力对称(牛顿第三定律)。当并行计算成对力时,找到粒子和计算到存储器和处理器的最佳映射是非常具有挑战性的,但可以大大减少数据移动和计算。尽管fpga与cpu和asic有不同的计算模型(bram /网络/管道),但fpga上的映射以前没有被深入研究过:人们认为半壳方法是首选。在这项工作中,我们发现Manhattan方法与FPGA硬件惊人地兼容。利用本文提出的缓存重叠技术,可以满足Manhattan方法所要求的超细粒度数据访问,尽管fpga上的内存块似乎不够细粒度。我们进一步证明,与传统的基线半壳方法相比,可以在不降低性能的情况下去除大约一半的滤波器(预处理器)。对于通信,在最常见的多fpga场景中,传输的数据量可以减少40% - 75%。此外,数据传输几乎在所有方向上都是完美平衡的,并且优化只需要最少的硬件资源。实际结果是,在不升级fpga之间的网络连接的情况下,可以处理近2到4倍的工作负载。这是一个关键的发现,因为在许多常见的加速器板和FPGA集群应用的强缩放应用中,可用的带宽相对有限。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信