{"title":"FCLNN:基于OpenCL和Caffe的FPGA快速CNN原型设计的灵活框架","authors":"Xianchao Xu, Brian Liu","doi":"10.1109/FPT.2018.00043","DOIUrl":null,"url":null,"abstract":"The CNN algorithms are still in rapid evolution, while the traditional RTL level programming on FPGA is relatively slow and requires great efforts and expertise. In this paper, we propose a flexible HW/SW co-design framework for both fast and high-throughput CNN prototyping with commercial high-level OpenCL language and the standard open-source deep learning framework Caffe. We build up a parameterizable stream-architected convolution engine and extend it to support any input size and filter depth. For iterative development process, we provide both layer-based and subgraph-based execution schedule. While for competitive performance, both on-chip and off-chip communication are optimized. Using our framework with Intel Arria 10 GX1150 FPGA, we achieve 69.2 fps and 18.6 fps on official YOLOv2-tiny-voc and YOLOv2-voc respectively. To the best of our knowledge, this is the first work to accelerate the state-of-the-art YOLOv2 with both real-time performance and < 1% accuracy drop on FPGA.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"FCLNN: A Flexible Framework for Fast CNN Prototyping on FPGA with OpenCL and Caffe\",\"authors\":\"Xianchao Xu, Brian Liu\",\"doi\":\"10.1109/FPT.2018.00043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The CNN algorithms are still in rapid evolution, while the traditional RTL level programming on FPGA is relatively slow and requires great efforts and expertise. In this paper, we propose a flexible HW/SW co-design framework for both fast and high-throughput CNN prototyping with commercial high-level OpenCL language and the standard open-source deep learning framework Caffe. We build up a parameterizable stream-architected convolution engine and extend it to support any input size and filter depth. For iterative development process, we provide both layer-based and subgraph-based execution schedule. While for competitive performance, both on-chip and off-chip communication are optimized. Using our framework with Intel Arria 10 GX1150 FPGA, we achieve 69.2 fps and 18.6 fps on official YOLOv2-tiny-voc and YOLOv2-voc respectively. To the best of our knowledge, this is the first work to accelerate the state-of-the-art YOLOv2 with both real-time performance and < 1% accuracy drop on FPGA.\",\"PeriodicalId\":434541,\"journal\":{\"name\":\"2018 International Conference on Field-Programmable Technology (FPT)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Conference on Field-Programmable Technology (FPT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FPT.2018.00043\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2018.00043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
FCLNN: A Flexible Framework for Fast CNN Prototyping on FPGA with OpenCL and Caffe
The CNN algorithms are still in rapid evolution, while the traditional RTL level programming on FPGA is relatively slow and requires great efforts and expertise. In this paper, we propose a flexible HW/SW co-design framework for both fast and high-throughput CNN prototyping with commercial high-level OpenCL language and the standard open-source deep learning framework Caffe. We build up a parameterizable stream-architected convolution engine and extend it to support any input size and filter depth. For iterative development process, we provide both layer-based and subgraph-based execution schedule. While for competitive performance, both on-chip and off-chip communication are optimized. Using our framework with Intel Arria 10 GX1150 FPGA, we achieve 69.2 fps and 18.6 fps on official YOLOv2-tiny-voc and YOLOv2-voc respectively. To the best of our knowledge, this is the first work to accelerate the state-of-the-art YOLOv2 with both real-time performance and < 1% accuracy drop on FPGA.