RIPL

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2018-03-14 DOI:10.1145/3180481

Robert J. Stewart, Kirsty Duncan, G. Michaelson, Paulo Garcia, Deepayan Bhowmik, A. Wallace

{"title":"RIPL","authors":"Robert J. Stewart, Kirsty Duncan, G. Michaelson, Paulo Garcia, Deepayan Bhowmik, A. Wallace","doi":"10.1145/3180481","DOIUrl":null,"url":null,"abstract":"Specialized FPGA implementations can deliver higher performance and greater power efficiency than embedded CPU or GPU implementations for real-time image processing. Programming challenges limit their wider use, because the implementation of FPGA architectures at the register transfer level is time consuming and error prone. Existing software languages supported by high-level synthesis (HLS), although providing a productivity improvement, are too general purpose to generate efficient hardware without the use of hardware-specific code optimizations. Such optimizations leak hardware details into the abstractions that software languages are there to provide, and they require knowledge of FPGAs to generate efficient hardware, such as by using language pragmas to partition data structures across memory blocks. This article presents a thorough account of the Rathlin image processing language (RIPL), a high-level image processing domain-specific language for FPGAs. We motivate its design, based on higher-order algorithmic skeletons, with requirements from the image processing domain. RIPL’s skeletons suffice to elegantly describe image processing stencils, as well as recursive algorithms with nonlocal random access patterns. At its core, RIPL employs a dataflow intermediate representation. We give a formal account of the compilation scheme from RIPL skeletons to static and cyclostatic dataflow models to describe their data rates and static scheduling on FPGAs. RIPL compares favorably to the Vivado HLS OpenCV library and C++ compiled with Vivado HLS. RIPL achieves between 54 and 191 frames per second (FPS) at 100MHz for four synthetic benchmarks, faster than HLS OpenCV in three cases. Two real-world algorithms are implemented in RIPL: visual saliency and mean shift segmentation. For the visual saliency algorithm, RIPL achieves 71 FPS compared to optimized C++ at 28 FPS. RIPL is also concise, being 5x shorter than C++ and 111x shorter than an equivalent direct dataflow implementation. For mean shift segmentation, RIPL achieves 7 FPS compared to optimized C++ on 64 CPU cores at 1.1, and RIPL is 10x shorter than the direct dataflow FPGA implementation.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"140 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3180481","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Specialized FPGA implementations can deliver higher performance and greater power efficiency than embedded CPU or GPU implementations for real-time image processing. Programming challenges limit their wider use, because the implementation of FPGA architectures at the register transfer level is time consuming and error prone. Existing software languages supported by high-level synthesis (HLS), although providing a productivity improvement, are too general purpose to generate efficient hardware without the use of hardware-specific code optimizations. Such optimizations leak hardware details into the abstractions that software languages are there to provide, and they require knowledge of FPGAs to generate efficient hardware, such as by using language pragmas to partition data structures across memory blocks. This article presents a thorough account of the Rathlin image processing language (RIPL), a high-level image processing domain-specific language for FPGAs. We motivate its design, based on higher-order algorithmic skeletons, with requirements from the image processing domain. RIPL’s skeletons suffice to elegantly describe image processing stencils, as well as recursive algorithms with nonlocal random access patterns. At its core, RIPL employs a dataflow intermediate representation. We give a formal account of the compilation scheme from RIPL skeletons to static and cyclostatic dataflow models to describe their data rates and static scheduling on FPGAs. RIPL compares favorably to the Vivado HLS OpenCV library and C++ compiled with Vivado HLS. RIPL achieves between 54 and 191 frames per second (FPS) at 100MHz for four synthetic benchmarks, faster than HLS OpenCV in three cases. Two real-world algorithms are implemented in RIPL: visual saliency and mean shift segmentation. For the visual saliency algorithm, RIPL achieves 71 FPS compared to optimized C++ at 28 FPS. RIPL is also concise, being 5x shorter than C++ and 111x shorter than an equivalent direct dataflow implementation. For mean shift segmentation, RIPL achieves 7 FPS compared to optimized C++ on 64 CPU cores at 1.1, and RIPL is 10x shorter than the direct dataflow FPGA implementation.

查看原文本刊更多论文

RIPL

专门的FPGA实现可以提供比嵌入式CPU或GPU实现更高的性能和更大的功率效率，用于实时图像处理。编程挑战限制了它们的广泛使用，因为FPGA架构在寄存器传输级别的实现是耗时且容易出错的。由高级合成(HLS)支持的现有软件语言，虽然提供了生产力的改进，但如果不使用特定于硬件的代码优化，就无法生成高效的硬件。这种优化将硬件细节泄漏到软件语言提供的抽象中，并且它们需要fpga知识来生成高效的硬件，例如通过使用语言pragmas跨内存块划分数据结构。本文全面介绍了Rathlin图像处理语言(RIPL)，这是一种用于fpga的高级图像处理领域特定语言。我们根据图像处理领域的要求，基于高阶算法骨架来激励其设计。RIPL的框架足以优雅地描述图像处理模板，以及具有非局部随机访问模式的递归算法。在其核心，RIPL使用数据流中间表示。我们给出了从RIPL骨架到静态和循环静态数据流模型的编译方案的正式说明，以描述它们在fpga上的数据速率和静态调度。与Vivado HLS的OpenCV库和用Vivado HLS编译的c++相比，RIPL更有优势。在四个合成基准测试中，RIPL在100MHz下达到每秒54到191帧(FPS)，在三种情况下比HLS OpenCV更快。在RIPL中实现了两种现实世界的算法:视觉显著性分割和均值移位分割。对于视觉显著性算法，RIPL达到71 FPS，而优化后的c++为28 FPS。RIPL也很简洁，比c++短5倍，比等价的直接数据流实现短111%。对于平均移位分割，RIPL在1.1的64个CPU核上比优化的c++实现了7 FPS，并且RIPL比直接数据流FPGA实现短10倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

自引率

0.00%

发文量