具有数据重用和细粒度并行性的模板计算优化微体系结构(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI:10.1145/3174243.3174964

Yuze Chi, Peipei Zhou, J. Cong

{"title":"具有数据重用和细粒度并行性的模板计算优化微体系结构(摘要)","authors":"Yuze Chi, Peipei Zhou, J. Cong","doi":"10.1145/3174243.3174964","DOIUrl":null,"url":null,"abstract":"Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism: (Abstract Only)\",\"authors\":\"Yuze Chi, Peipei Zhou, J. Cong\",\"doi\":\"10.1145/3174243.3174964\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.\",\"PeriodicalId\":164936,\"journal\":{\"name\":\"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3174243.3174964\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3174243.3174964","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

模板计算是图像处理、求解偏微分方程和元胞自动机等许多应用中最重要的核心之一。然而，由于高内存访问负载和低操作强度的特性，实现高吞吐量的模板内核并非易事。在这项工作中，我们采用数据重用和细粒度并行，提出了一种最优的模板计算微架构。数据重用行缓冲区不仅充分利用外部内存带宽和完全重用输入数据，而且在给定细粒度并行化和完全流水线化pe的数量的情况下，它们还将数据重用缓冲区的大小最小化。利用所提出的微体系结构，可以增加pe的数量以饱和所有可用的片外存储器带宽。我们使用基于高级合成(HLS)的模板来实现这个微架构，而不是寄存器传输级(RTL)规范，这提供了很大的可编程性。为了指导系统设计，除了详细的模型评估和优化分析外，我们还提出了一个性能模型。机载执行的实验结果表明，我们的设计可以提供平均6.5倍的加速，而只有2.4倍的资源开销。与仅循环转换的设计相比，我们的设计可以实现全流水线加速器，用于仅循环转换的应用程序，由于其高内存冲突和低设计灵活性而无法实现。此外，我们的FPGA实现提供14核CPU 83%的吞吐量和4倍的能效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism: (Abstract Only)

Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量