Philip Colangelo, Enno Lübbers, Randy Huang, M. Margala, Kevin Nealis
{"title":"Application of convolutional neural networks on Intel® Xeon® processor with integrated FPGA","authors":"Philip Colangelo, Enno Lübbers, Randy Huang, M. Margala, Kevin Nealis","doi":"10.1109/HPEC.2017.8091025","DOIUrl":null,"url":null,"abstract":"Intel®'s Xeon® processor with integrated FPGA is a new research platform that provides all the capabilities of a Broadwell Xeon Processor with the added functionality of an Arria 10 FPGA in the same package. In this paper, we present an implementation on this platform to showcase the abilities and effectiveness of utilizing both hardware architectures to accelerate a convolutional based neural network (CNN). We choose a network topology that uses binary weights and low precision activation data to take advantage of the available customizable fabric provided by the FPGA. Further, compared to standard multiply accumulate CNN's, binary weighted networks (BWN) reduce the amount of computation by eliminating the need for multiplication resulting in little to no classification accuracy degradation. Coupling Intel's Open Programmable Acceleration Engine (OPAE) with Caffe provides a robust framework that was used as the foundation for our application. Due to the convolution primitives taking the most computation in our network, we offload the feature and weight data to a customized binary convolution accelerator loaded in the FPGA. Employing the low latency Quick Path Interconnect (QPI) that bridges the Broadwell Xeon processor and Arria 10 FPGA, we can carry out fine-grained offloads while avoiding bandwidth bottlenecks. An initial proof of concept design showcasing this new platform that utilizes only a portion of the FPGA core logic exemplifies that by using both the Xeon processor and FPGA together we can improve the throughput by 2× on some layers and by 1.3× overall.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2017.8091025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
Intel®'s Xeon® processor with integrated FPGA is a new research platform that provides all the capabilities of a Broadwell Xeon Processor with the added functionality of an Arria 10 FPGA in the same package. In this paper, we present an implementation on this platform to showcase the abilities and effectiveness of utilizing both hardware architectures to accelerate a convolutional based neural network (CNN). We choose a network topology that uses binary weights and low precision activation data to take advantage of the available customizable fabric provided by the FPGA. Further, compared to standard multiply accumulate CNN's, binary weighted networks (BWN) reduce the amount of computation by eliminating the need for multiplication resulting in little to no classification accuracy degradation. Coupling Intel's Open Programmable Acceleration Engine (OPAE) with Caffe provides a robust framework that was used as the foundation for our application. Due to the convolution primitives taking the most computation in our network, we offload the feature and weight data to a customized binary convolution accelerator loaded in the FPGA. Employing the low latency Quick Path Interconnect (QPI) that bridges the Broadwell Xeon processor and Arria 10 FPGA, we can carry out fine-grained offloads while avoiding bandwidth bottlenecks. An initial proof of concept design showcasing this new platform that utilizes only a portion of the FPGA core logic exemplifies that by using both the Xeon processor and FPGA together we can improve the throughput by 2× on some layers and by 1.3× overall.