Automatic synthesis of physical system differential equation models to a custom network of general processing elements on FPGAs

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI:10.1145/2514641.2514650

Chen-Chun Huang, F. Vahid, T. Givargis

{"title":"Automatic synthesis of physical system differential equation models to a custom network of general processing elements on FPGAs","authors":"Chen-Chun Huang, F. Vahid, T. Givargis","doi":"10.1145/2514641.2514650","DOIUrl":null,"url":null,"abstract":"Fast execution of physical system models has various uses, such as simulating physical phenomena or real-time testing of medical equipment. Physical system models commonly consist of thousands of differential equations. Solving such equations using software on microprocessor devices may be slow. Several past efforts implement such models as parallel circuits on special computing devices called Field-Programmable Gate Arrays (FPGAs), demonstrating large speedups due to the excellent match between the massive fine-grained local communication parallelism common in physical models and the fine-grained parallel compute elements and local connectivity of FPGAs. However, past implementation efforts were mostly manual or ad hoc. We present the first method for automatically converting a set of ordinary differential equations into circuits on FPGAs. The method uses a general Processing Element (PE) that we developed, designed to quickly solve a set of ordinary differential equations while using few FPGA resources. The method instantiates a network of general PEs, partitions equations among the PEs to minimize communication, generates each PE's custom program, creates custom connections among PEs, and maintains synchronization of all PEs in the network. Our experiments show that the method generates a 400-PE network on a commercial FPGA that executes four different models on average 15x faster than a 3 GHz Intel processor, 30x faster than a commercial 4-core ARM, 14x faster than a commercial 6-core Texas Instruments digital signal processor, and 4.4x faster than an NVIDIA 336-core graphics processing unit. We also show that the FPGA-based approach is reasonably cost effective compared to using the other platforms. The method yields 2.1x faster circuits than a commercial high-level synthesis tool that uses the traditional method for converting behavior to circuits, while using 2x fewer lookup tables, 2x fewer hardcore multiplier (DSP) units, though 3.5x more block RAM due to being programmable. Furthermore, the method does not just generate a single fastest design, but generates a range of designs that trade off size and performance, by using different numbers of PEs.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Embed. Comput. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2514641.2514650","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Fast execution of physical system models has various uses, such as simulating physical phenomena or real-time testing of medical equipment. Physical system models commonly consist of thousands of differential equations. Solving such equations using software on microprocessor devices may be slow. Several past efforts implement such models as parallel circuits on special computing devices called Field-Programmable Gate Arrays (FPGAs), demonstrating large speedups due to the excellent match between the massive fine-grained local communication parallelism common in physical models and the fine-grained parallel compute elements and local connectivity of FPGAs. However, past implementation efforts were mostly manual or ad hoc. We present the first method for automatically converting a set of ordinary differential equations into circuits on FPGAs. The method uses a general Processing Element (PE) that we developed, designed to quickly solve a set of ordinary differential equations while using few FPGA resources. The method instantiates a network of general PEs, partitions equations among the PEs to minimize communication, generates each PE's custom program, creates custom connections among PEs, and maintains synchronization of all PEs in the network. Our experiments show that the method generates a 400-PE network on a commercial FPGA that executes four different models on average 15x faster than a 3 GHz Intel processor, 30x faster than a commercial 4-core ARM, 14x faster than a commercial 6-core Texas Instruments digital signal processor, and 4.4x faster than an NVIDIA 336-core graphics processing unit. We also show that the FPGA-based approach is reasonably cost effective compared to using the other platforms. The method yields 2.1x faster circuits than a commercial high-level synthesis tool that uses the traditional method for converting behavior to circuits, while using 2x fewer lookup tables, 2x fewer hardcore multiplier (DSP) units, though 3.5x more block RAM due to being programmable. Furthermore, the method does not just generate a single fastest design, but generates a range of designs that trade off size and performance, by using different numbers of PEs.

查看原文本刊更多论文

物理系统微分方程模型的自动合成到fpga上一般处理单元的自定义网络

物理系统模型的快速执行具有多种用途，例如模拟物理现象或对医疗设备进行实时测试。物理系统模型通常由数千个微分方程组成。用微处理器设备上的软件来求解这样的方程可能很慢。过去的一些努力在称为现场可编程门阵列(fpga)的特殊计算设备上实现了并行电路等模型，由于物理模型中常见的大量细粒度本地通信并行性与细粒度并行计算元素和fpga的本地连通性之间的良好匹配，显示出了很大的速度。然而，过去的实现工作大多是手工的或临时的。我们提出了第一种在fpga上自动转换常微分方程到电路的方法。该方法使用我们开发的通用处理单元(PE)，旨在快速求解一组常微分方程，同时使用较少的FPGA资源。该方法实例化一般PE网络，在PE之间划分方程以减少通信，生成每个PE的自定义程序，在PE之间创建自定义连接，并保持网络中所有PE的同步。我们的实验表明，该方法在商用FPGA上生成400-PE网络，执行四个不同的模型，平均速度比3 GHz英特尔处理器快15倍，比商用4核ARM快30倍，比商用6核德州仪器数字信号处理器快14倍，比NVIDIA 336核图形处理单元快4.4倍。我们还表明，与使用其他平台相比，基于fpga的方法具有合理的成本效益。该方法产生的电路比使用传统方法将行为转换为电路的商业高级合成工具快2.1倍，同时使用的查找表减少了2倍，核心乘子(DSP)单元减少了2倍，但由于可编程，块RAM增加了3.5倍。此外，该方法不仅生成单个最快的设计，而且通过使用不同数量的pe生成一系列折衷尺寸和性能的设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Trans. Embed. Comput. Syst.

自引率

0.00%

发文量