Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA Systems

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI:10.1145/3174243.3174262

G. Stitt, Abhay Gupta, Madison N. Emas, David Wilson, A. Baylis

{"title":"Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA Systems","authors":"G. Stitt, Abhay Gupta, Madison N. Emas, David Wilson, A. Baylis","doi":"10.1145/3174243.3174262","DOIUrl":null,"url":null,"abstract":"Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications, an important subset of digital signal processing, demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7x improvement over previous work. We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81x and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7x. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3174243.3174262","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications, an important subset of digital signal processing, demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7x improvement over previous work. We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81x and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7x. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters.

查看原文本刊更多论文

用于Intel Broadwell+Arria 10和高带宽FPGA系统的可扩展窗口生成

新兴的FPGA系统正在提供更高的外部存储器带宽，以与GPU的性能竞争。然而，由于FPGA通常通过深度管道实现并行性，传统的FPGA设计策略不一定能很好地扩展到可以利用更高带宽的大量复制管道。我们展示了滑动窗口应用程序，数字信号处理的一个重要子集，证明了这种可扩展性问题。我们引入了一个窗口生成器架构，使复制速度超过330 GB/s，比以前的工作提高了8.7倍。我们在英特尔Broadwell+Arria10系统上对窗口生成器进行了2D卷积评估，并表明对于传统的卷积(每张图像一个滤波器)，我们的方法在大多数输入尺寸上比12核至强Broadwell E5和高端Nvidia P6000 GPU的性能高出81倍和一个数量级，同时将能量提高15.7倍。对于卷积神经网络(CNN)，我们表明，尽管GPU和至强处理器通常优于现有的FPGA系统，但对于许多常见的CNN参数，在带宽足够的FPGA上运行的窗口生成器的投影性能可以优于高端GPU。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量