{"title":"FPGA-based LSTM Acceleration for Real-Time EEG Signal Processing: (Abstract Only)","authors":"Zhe Chen, Andrew G. Howe, H. T. Blair, J. Cong","doi":"10.1145/3174243.3174969","DOIUrl":"https://doi.org/10.1145/3174243.3174969","url":null,"abstract":"Closed-loop neurofeedback is a growing area of research and development for novel therapies to treat brain disorders. A neurofeedback device can detect disease symptoms (such as motor tremors or seizures) in real time from electroencephalogram (EEG) signals, and respond by rapidly delivering neurofeedback stimulation that relieves these symptoms. Conventional EEG processing algorithms rely on acausal filters, which impose delays that can exceed the short feedback latency required for closed-loop stimulation. In this paper, we first introduce a method for causal filtering using long short-term memory (LSTM) networks, which radically reduces the filtering latency. We then propose a reconfigurable architecture that supports time-division multiplexing of LSTM inference engines on a prototype neurofeedback device. We implemented a 128-channel EEG signal processing design on a Zynq-7030 device, and demonstrated its feasibility. Then, we further scaled up the design onto Zynq-7045 and Virtex-690t devices to achieve high performance and energy efficient implementations for massively parallel brain signal processing. We evaluated the performance against optimized implementations on CPU and GPU at the same CMOS technology node. Experiment results show that the Virtex-690t can achieve 1.32x and 11x speed-up against the K40c GPU and the multi-thread Xeon E5-2860 CPU, respectively, while FPGA achieves 6.1x and 26.6x energy efficiency compared to the GPU and CPU.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126148450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentai Zhang, Jiaxi Zhang, Minghua Shen, Nong Xiao, Guojie Luo
{"title":"Mapping Large-Scale DNNs on Asymmetric FPGAs: (Abstract Only)","authors":"Wentai Zhang, Jiaxi Zhang, Minghua Shen, Nong Xiao, Guojie Luo","doi":"10.1145/3174243.3174982","DOIUrl":"https://doi.org/10.1145/3174243.3174982","url":null,"abstract":"FPGAs are very attractive to accelerate the deep neural networks (DNNs). While single-FPGA can provide good performance for small-scale DNNs, support for large-scale DNNs is very limited due to they require higher resource demand. In this paper, we propose an efficient mapping approach for accelerating large-scale DNNs on an asymmetric multi-FPGA architecture. Relative to the state-of-the-art single-FPGA resource reuse for large-scale DNNs, we consider multi-FPGA fashion to strive for higher performance. In this fashion, the neural network mapping problem can be formulated as a resource allocation problem, and a dynamic programming-based partitioning is designed to solve this problem optimally. Notice that the network topology and communication bandwidth of multiple FPGAs are always used to guide the partitioning to boost the performance while satisfying the constraints of resource-performance trade-off in a single FPGA. Experimental results using the large-scale ResNet-152 demonstrate that our approach deploys sixteen FPGAs to provide an advantage of 16.4x GOPS over the state-of-the-art work.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129592558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 3: Deep Learning","authors":"P. Cheung","doi":"10.1145/3252938","DOIUrl":"https://doi.org/10.1145/3252938","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127079230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 4: High Level Synthesis 1","authors":"S. Neuendorffer","doi":"10.1145/3252939","DOIUrl":"https://doi.org/10.1145/3252939","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129224253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 5: Applications 1","authors":"J. Lockwood","doi":"10.1145/3252940","DOIUrl":"https://doi.org/10.1145/3252940","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116245456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Liquid Silicon: A Data-Centric Reconfigurable Architecture Enabled by RRAM Technology","authors":"Yue Zha, J. Li","doi":"10.1145/3174243.3174244","DOIUrl":"https://doi.org/10.1145/3174243.3174244","url":null,"abstract":"This paper presents a data-centric reconfigurable architecture, namely Liquid Silicon, enabled by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture of commercial FPGAs, Liquid Silicon is inherently a homogeneous architecture comprising a two-dimensional (2D) array of identical 'tiles'. Each tile can be configured into one or a combination of four modes: TCAM, logic, interconnect, and memory. Such flexibility allows users to partition resources based on applications? needs, in contrast to the fixed hardware design using dedicated hard IP blocks in FPGAs. In addition to better resource usage, its 'memory friendly' architecture effectively addresses the limitations of commercial FPGAs i.e., scarce on-chip memory resources, making it an effective complement to FPGAs. Moreover, its coarse-grained logic implementation results in shallower logic depth, less inter-tile routing overhead, and thus smaller area and better performance, compared with its FPGA counterpart. Our study shows that, on average, for both traditional and emerging applications, we achieve 62% area reduction, 27% speedup and 31% improvement in energy efficiency when mapping applications onto Liquid Silicon instead of FPGAs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133496412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ning Mao, Zhihong Huang, Xing Wei, He Zhao, Xinkai Di, Le Yu, Haigang Yang
{"title":"A Self-adaptation Method of Fitting Convolutional Neural Network into FPGA: Abstract Only)","authors":"Ning Mao, Zhihong Huang, Xing Wei, He Zhao, Xinkai Di, Le Yu, Haigang Yang","doi":"10.1145/3174243.3175003","DOIUrl":"https://doi.org/10.1145/3174243.3175003","url":null,"abstract":"In recent years, Convolutional Neural Networks (CNNs) have been used widely in many artificial intelligence (AI) related fields. Of many implementation platforms for CNNs, FPGA is regarded as an optimal platform because of its high power-efficiency and flexibility. Although various FPGA accelerators have been proposed to realize CNN, some of them are implemented by High-Level Synthesis such as in OpenCL. This may result in inefficiency in operation performance and resource utilization. Therefore, we propose to parameterize the RTL design at both algorithm and hardware implementation levels. Four types of parallelism are considered to model the parameterized design in terms of the input feature map, the output feature map, the layer and the convolution kernel. Meanwhile a library covering convolution layer, fully-connected layer, pooling layer, control module is established to cater for various CNN models. Further, an algorithm is proposed to find an optimal level of parallelism dedicated to limited resources. As a case study, four typical CNNs are implemented on Stratix III EP3SL110, taking up on-chip memory. Compared with some existing works using the automated design flow, the implementations obtained by the proposed approach have achieved up to 17.13× GOPS. To the best estimate, our design has also achieved 1.33× resource efficiency and 3.61× power efficiency.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133365176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 6: High Level Synthesis 2","authors":"G. Constantinides","doi":"10.1145/3252941","DOIUrl":"https://doi.org/10.1145/3252941","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"178 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114245050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Stitt, Abhay Gupta, Madison N. Emas, David Wilson, A. Baylis
{"title":"Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA Systems","authors":"G. Stitt, Abhay Gupta, Madison N. Emas, David Wilson, A. Baylis","doi":"10.1145/3174243.3174262","DOIUrl":"https://doi.org/10.1145/3174243.3174262","url":null,"abstract":"Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications, an important subset of digital signal processing, demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7x improvement over previous work. We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81x and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7x. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115767564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluation of OpenCL Performance-oriented Optimizations for Streaming Kernels on the FPGA: (Abstract Only)","authors":"Zheming Jin, H. Finkel","doi":"10.1145/3174243.3174967","DOIUrl":"https://doi.org/10.1145/3174243.3174967","url":null,"abstract":"The streaming applications efficiently and High-level synthesis (HLS) tools allow people without complex hardware design knowledge to evaluate an application on FPGAs, there is an opportunity and a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we evaluate the overhead of the OpenCL infrastructure on the Nallatech 385A FPGA board that features an Arria 10 GX1150 FPGA. Then we explore the implementation space and discuss the performance optimization techniques for the streaming kernels using the OpenCL-to-FPGA HLS tool. On the target platform, the infrastructure overhead requires 12% of the FPGA memory and logic resources. The latency of the single work-item kernel execution is 11 us and the maximum frequency of a kernel implementation is around 300 MHz. The experimental results of the streaming kernels show FPGA resources, such as block RAMs and DSPs, can limit the kernel performance before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2 to 10. The combination of the two techniques can achieve the best performance. To improve the performance of compute unit duplication, the local work size needs to be tuned and the optimal value can increase the performance by a factor of 3 to 70 compared to the default value.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123574605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}