FlowPix:使用特定领域编译器加速FPGA覆盖上的图像处理管道

IF 1.8 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-25 DOI:10.1145/3629523

Ziaul Choudhury, Anish Gulati, Suresh Purini

{"title":"FlowPix:使用特定领域编译器加速FPGA覆盖上的图像处理管道","authors":"Ziaul Choudhury, Anish Gulati, Suresh Purini","doi":"10.1145/3629523","DOIUrl":null,"url":null,"abstract":"The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 4","pages":"0"},"PeriodicalIF":1.8000,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler\",\"authors\":\"Ziaul Choudhury, Anish Gulati, Suresh Purini\",\"doi\":\"10.1145/3629523\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks.\",\"PeriodicalId\":50920,\"journal\":{\"name\":\"ACM Transactions on Architecture and Code Optimization\",\"volume\":\"42 4\",\"pages\":\"0\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2023-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Architecture and Code Optimization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3629523\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3629523","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，摩尔定律所保证的指数级性能增长已经开始逐渐减弱。同时，像图像处理这样的新兴应用需要大量的计算性能。这些因素不可避免地导致领域特定加速器(DSA)的出现，以填补传统架构留下的性能空白。fpga由于其低功耗和更高程度的并行性，正迅速发展成为设计dsa的定制asic的替代方案。FPGA上的DSA设计需要仔细校准FPGA计算和内存资源，以实现最佳吞吐量。像Verilog这样的硬件描述语言(HDL)传统上被用于设计FPGA硬件。HDLs不面向任何域，用户必须投入大量精力来描述寄存器传输级别的硬件。领域特定语言(dsl)和编译器最近被用于将针对特定领域的手写hdl模板编织在一起。最近的努力是设计带有针对fpga的图像处理dsl的dsa。DSL中的图像计算降低到预先存在的模板或低级语言(如HLS-C)。这种方法需要为每个新的工作负载重新刷新昂贵的FPGA。与这种固定功能的硬件方法相比，覆盖正在获得吸引力。覆盖层是类似处理器的dsa，它被合成并在FPGA上闪现一次，但足够灵活，可以通过软重新配置处理广泛的计算。在图像处理覆盖的背景下，报告的工作较少。图像处理算法在大小和形状上各不相同，从简单的模糊操作到复杂的金字塔系统。设计图像处理覆盖层的主要挑战是保持映射不同算法的灵活性。本文提出了一种基于dsl的图像处理叠加加速器FlowPix。DSL程序表示为管道，每个阶段表示整个算法中的一个计算步骤。我们在Virtex-7-690t FPGA上使用FlowPix实现了15个图像处理基准测试。基准测试范围从简单的模糊操作到复杂的管道，如Lucas-Kande光流。我们将FlowPix与现有的DSL-to-FPGA框架(如Hetero-Halide和Vitis Vision库)进行比较，这些框架可以生成固定功能的硬件。在大多数基准测试中，我们看到延迟降低了25%，FPGA LUT消耗增加了大约1.7到2倍。我们能够执行任何基准测试，而不会产生硬件合成、放置和路由以及FPGA重新闪烁的高成本，这证明了我们所经历的轻微性能损失和增加的资源消耗是合理的。在实现的基准测试中，FlowPix在1920x1080像素的高清帧上实现了170 FPS的平均帧率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler

The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.