基于可重构平台的全球大气模拟

L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Guangwen Yang
{"title":"基于可重构平台的全球大气模拟","authors":"L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Guangwen Yang","doi":"10.1109/FCCM.2013.26","DOIUrl":null,"url":null,"abstract":"Summary form only given. As the only method to study long-term climate trend and to predict potential climate risk, climate modeling is becoming a key research topic among governments and research organizations. One of the most essential and challenging components in climate modeling is the atmospheric model. To cover high resolution in climate simulation scenarios, developers have to face the challenges from billions of mesh points and extremely complex algorithms. Shallow Water Equations (SWEs) are a set of conservation laws that perform most of the essential characteristics of the atmosphere. The study of SWEs can serve as the starting point for understanding the dynamic behavior of the global atmosphere. We choose cubed-sphere mesh as the computational mesh for its better load balance in pole regions over other meshes such as the latitude-longitude mesh. The cubed-sphere mesh is obtained by mapping a cube to the surface of the sphere. The computational domain is then the six patches, each of which is covered with N × N mesh points to be calculated. When written in local coordinates, SWEs have an identical expression on the six patches, that is ∂Q/∂t + 1/Λ ∂(ΛF1)/∂x1 + 1/Λ ∂(ΛF1)/∂z2 + S=0, (1) where (x1, x2) ∈ [-π/4, π/4] are the local coordinates, Q = (h, hu1, hu2)T is the prognostic variable, Fi = uiQ (i = 1, 2) are the convective fluxes, S is the source term. Spatially discretized with a cell-centered finite volume method and integrated with a second-order accurate TVD Runge-Kutta method, SWE solvers are transferred to the computation of a 13-point upwind stencil that exhibits a diamond shape. To get the prognostic components (h, hu1 and hu2) of the central point, its neighboring 12 points need to be accessed. The stencil kernel includes at least 434 ADD/SUB operations, 570 multiplications, 99 divisions. The high arithmetic density of the SWEs algorithm makes it difficult to implement one kernel into the resource-limited FPGA card. In this study, we first proposes a hybrid algorithm that utilizes both CPUs and FPGAs to simulate the global shallow water equations (SWEs). In each of the computational patch, most of the complicated communications happen in the two layers of the outer boundary, whose value need to be exchanged with other patches. Therefore, we decompose each of the six patches into an outer part that includes two layers of the outer boundary meshes, and an inner part that is the remaining part. We assign CPU to handle the communications and the stencil calculation of the outer part, while assign FPGA to process the inner-part stencil. In this way, FPGA and CPU will work simultaneously and the CPU time for stencil and communication can be hidden in the FPGA time for stencil. For the Virtex-6 SX475T that we use in our study, the original program in double-precision will require 299% of the on-board LUTs, 283% of the FFs and 189% of the DSPs, and cannot fit into one FPGA. In order to fit the SWE kernel into one FPGA chip, we apply two algorithmic optimizations to the original design. One is to replace certain computations by lookup tables, so as to reduce the usage of computation resources. The other one is to locate common factors in the algorithm and to remove redundant computations. These two optimizations reduce the resource usage by 20%. To further reduce the resource cost and to fit the extremely complex stencil kernel into one FPGA chip, we perform optimization in the space of customizable representations and precisions. For the variables with a relatively small range, we apply fixed-point number to replace the double-precisions. For the rest parts with a wide dynamic range, we use floating-point numbers with a mixed-precision. Through mixed-precision floating-point and fixed-point arithmetic, we build a complex upwind stencil kernel on a single FPGA. The design includes a highly-efficient pipeline that can perform hundreds of floating-point and fixed-point arithmetic operations concurrently. Compared with our previous work in [1], the solution based on one FPGA acceleration card provides 100 times speedup over a 6-core CPU, and 4 times speedup over a Tianhe-1A supercomputer node that consists of 12 CPU cores and one Fermi GPU.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Global Atmospheric Simulation on a Reconfigurable Platform\",\"authors\":\"L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Guangwen Yang\",\"doi\":\"10.1109/FCCM.2013.26\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. As the only method to study long-term climate trend and to predict potential climate risk, climate modeling is becoming a key research topic among governments and research organizations. One of the most essential and challenging components in climate modeling is the atmospheric model. To cover high resolution in climate simulation scenarios, developers have to face the challenges from billions of mesh points and extremely complex algorithms. Shallow Water Equations (SWEs) are a set of conservation laws that perform most of the essential characteristics of the atmosphere. The study of SWEs can serve as the starting point for understanding the dynamic behavior of the global atmosphere. We choose cubed-sphere mesh as the computational mesh for its better load balance in pole regions over other meshes such as the latitude-longitude mesh. The cubed-sphere mesh is obtained by mapping a cube to the surface of the sphere. The computational domain is then the six patches, each of which is covered with N × N mesh points to be calculated. When written in local coordinates, SWEs have an identical expression on the six patches, that is ∂Q/∂t + 1/Λ ∂(ΛF1)/∂x1 + 1/Λ ∂(ΛF1)/∂z2 + S=0, (1) where (x1, x2) ∈ [-π/4, π/4] are the local coordinates, Q = (h, hu1, hu2)T is the prognostic variable, Fi = uiQ (i = 1, 2) are the convective fluxes, S is the source term. Spatially discretized with a cell-centered finite volume method and integrated with a second-order accurate TVD Runge-Kutta method, SWE solvers are transferred to the computation of a 13-point upwind stencil that exhibits a diamond shape. To get the prognostic components (h, hu1 and hu2) of the central point, its neighboring 12 points need to be accessed. The stencil kernel includes at least 434 ADD/SUB operations, 570 multiplications, 99 divisions. The high arithmetic density of the SWEs algorithm makes it difficult to implement one kernel into the resource-limited FPGA card. In this study, we first proposes a hybrid algorithm that utilizes both CPUs and FPGAs to simulate the global shallow water equations (SWEs). In each of the computational patch, most of the complicated communications happen in the two layers of the outer boundary, whose value need to be exchanged with other patches. Therefore, we decompose each of the six patches into an outer part that includes two layers of the outer boundary meshes, and an inner part that is the remaining part. We assign CPU to handle the communications and the stencil calculation of the outer part, while assign FPGA to process the inner-part stencil. In this way, FPGA and CPU will work simultaneously and the CPU time for stencil and communication can be hidden in the FPGA time for stencil. For the Virtex-6 SX475T that we use in our study, the original program in double-precision will require 299% of the on-board LUTs, 283% of the FFs and 189% of the DSPs, and cannot fit into one FPGA. In order to fit the SWE kernel into one FPGA chip, we apply two algorithmic optimizations to the original design. One is to replace certain computations by lookup tables, so as to reduce the usage of computation resources. The other one is to locate common factors in the algorithm and to remove redundant computations. These two optimizations reduce the resource usage by 20%. To further reduce the resource cost and to fit the extremely complex stencil kernel into one FPGA chip, we perform optimization in the space of customizable representations and precisions. For the variables with a relatively small range, we apply fixed-point number to replace the double-precisions. For the rest parts with a wide dynamic range, we use floating-point numbers with a mixed-precision. Through mixed-precision floating-point and fixed-point arithmetic, we build a complex upwind stencil kernel on a single FPGA. The design includes a highly-efficient pipeline that can perform hundreds of floating-point and fixed-point arithmetic operations concurrently. Compared with our previous work in [1], the solution based on one FPGA acceleration card provides 100 times speedup over a 6-core CPU, and 4 times speedup over a Tianhe-1A supercomputer node that consists of 12 CPU cores and one Fermi GPU.\",\"PeriodicalId\":269887,\"journal\":{\"name\":\"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FCCM.2013.26\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2013.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

只提供摘要形式。气候模拟作为研究长期气候趋势和预测潜在气候风险的唯一方法,正成为各国政府和研究机构的重点研究课题。气候模式中最重要和最具挑战性的组成部分之一是大气模式。为了覆盖气候模拟场景的高分辨率,开发人员必须面对来自数十亿网格点和极其复杂的算法的挑战。浅水方程(SWEs)是一组守恒定律,它表现了大气的大多数基本特征。对swe的研究可以作为理解全球大气动力学行为的起点。我们选择立方球网格作为计算网格,因为它比其他网格(如经纬度网格)在极点区域具有更好的负载平衡。立方体-球体网格是通过将立方体映射到球体表面来获得的。计算域为六个补丁,每个补丁上覆盖N × N个待计算的网格点。当用局部坐标写时,ses在六个小块上有相同的表达式,即∂Q/∂t + 1/Λ∂(ΛF1)/∂x1 + 1/Λ∂(ΛF1)/∂z2 + S=0,(1)其中(x1, x2)∈[-π/4, π/4]为局部坐标,Q = (h, hu1, hu2) t为预测变量,Fi = uiQ (i = 1,2)为对流通量,S为源项。采用以单元为中心的有限体积法进行空间离散,并结合二阶精确TVD龙格-库塔法,将SWE求解方法转化为菱形13点迎风模板的计算。为了得到中心点的预测分量(h, hu1和hu2),需要访问其邻近的12个点。模板内核包括至少434个ADD/SUB操作,570个乘法,99个除法。SWEs算法的高算法密度使得它难以在资源有限的FPGA卡上实现一个内核。在本研究中,我们首先提出了一种混合算法,利用cpu和fpga来模拟全局浅水方程(SWEs)。在每个计算补丁中,大部分复杂的通信发生在外边界的两层,这两层的值需要与其他补丁进行交换。因此,我们将六个补丁中的每一个都分解为一个包含两层外边界网格的外部部分和一个剩余部分的内部部分。我们分配CPU处理外部的通信和模板计算,而分配FPGA处理内部的模板。这样,FPGA和CPU可以同时工作,并且可以将CPU用于模板和通信的时间隐藏在FPGA用于模板的时间中。对于我们在研究中使用的Virtex-6 SX475T,双精度的原始程序将需要299%的板载lut, 283%的ff和189%的dsp,并且不能放入一个FPGA中。为了将SWE内核装入一个FPGA芯片,我们对原始设计进行了两种算法优化。一种是用查找表代替某些计算,以减少计算资源的使用。二是找出算法中的公因数,消除冗余计算。这两个优化减少了20%的资源使用。为了进一步降低资源成本,并将极其复杂的模板内核装入一个FPGA芯片,我们在可定制的表示和精度空间中进行了优化。对于范围较小的变量,采用定点数代替双精度。对于其他动态范围较大的部分,我们使用混合精度的浮点数。通过混合精度浮点和定点运算,我们在单个FPGA上构建了一个复杂的逆风模板内核。该设计包括一个高效的管道,可以同时执行数百个浮点和定点算术运算。与我们之前的工作[1]相比,基于1个FPGA加速卡的解决方案在6核CPU上提供100倍的加速,在由12个CPU核和1个费米GPU组成的天河1a超级计算机节点上提供4倍的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Global Atmospheric Simulation on a Reconfigurable Platform
Summary form only given. As the only method to study long-term climate trend and to predict potential climate risk, climate modeling is becoming a key research topic among governments and research organizations. One of the most essential and challenging components in climate modeling is the atmospheric model. To cover high resolution in climate simulation scenarios, developers have to face the challenges from billions of mesh points and extremely complex algorithms. Shallow Water Equations (SWEs) are a set of conservation laws that perform most of the essential characteristics of the atmosphere. The study of SWEs can serve as the starting point for understanding the dynamic behavior of the global atmosphere. We choose cubed-sphere mesh as the computational mesh for its better load balance in pole regions over other meshes such as the latitude-longitude mesh. The cubed-sphere mesh is obtained by mapping a cube to the surface of the sphere. The computational domain is then the six patches, each of which is covered with N × N mesh points to be calculated. When written in local coordinates, SWEs have an identical expression on the six patches, that is ∂Q/∂t + 1/Λ ∂(ΛF1)/∂x1 + 1/Λ ∂(ΛF1)/∂z2 + S=0, (1) where (x1, x2) ∈ [-π/4, π/4] are the local coordinates, Q = (h, hu1, hu2)T is the prognostic variable, Fi = uiQ (i = 1, 2) are the convective fluxes, S is the source term. Spatially discretized with a cell-centered finite volume method and integrated with a second-order accurate TVD Runge-Kutta method, SWE solvers are transferred to the computation of a 13-point upwind stencil that exhibits a diamond shape. To get the prognostic components (h, hu1 and hu2) of the central point, its neighboring 12 points need to be accessed. The stencil kernel includes at least 434 ADD/SUB operations, 570 multiplications, 99 divisions. The high arithmetic density of the SWEs algorithm makes it difficult to implement one kernel into the resource-limited FPGA card. In this study, we first proposes a hybrid algorithm that utilizes both CPUs and FPGAs to simulate the global shallow water equations (SWEs). In each of the computational patch, most of the complicated communications happen in the two layers of the outer boundary, whose value need to be exchanged with other patches. Therefore, we decompose each of the six patches into an outer part that includes two layers of the outer boundary meshes, and an inner part that is the remaining part. We assign CPU to handle the communications and the stencil calculation of the outer part, while assign FPGA to process the inner-part stencil. In this way, FPGA and CPU will work simultaneously and the CPU time for stencil and communication can be hidden in the FPGA time for stencil. For the Virtex-6 SX475T that we use in our study, the original program in double-precision will require 299% of the on-board LUTs, 283% of the FFs and 189% of the DSPs, and cannot fit into one FPGA. In order to fit the SWE kernel into one FPGA chip, we apply two algorithmic optimizations to the original design. One is to replace certain computations by lookup tables, so as to reduce the usage of computation resources. The other one is to locate common factors in the algorithm and to remove redundant computations. These two optimizations reduce the resource usage by 20%. To further reduce the resource cost and to fit the extremely complex stencil kernel into one FPGA chip, we perform optimization in the space of customizable representations and precisions. For the variables with a relatively small range, we apply fixed-point number to replace the double-precisions. For the rest parts with a wide dynamic range, we use floating-point numbers with a mixed-precision. Through mixed-precision floating-point and fixed-point arithmetic, we build a complex upwind stencil kernel on a single FPGA. The design includes a highly-efficient pipeline that can perform hundreds of floating-point and fixed-point arithmetic operations concurrently. Compared with our previous work in [1], the solution based on one FPGA acceleration card provides 100 times speedup over a 6-core CPU, and 4 times speedup over a Tianhe-1A supercomputer node that consists of 12 CPU cores and one Fermi GPU.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信