{"title":"Increasing Network Size and Training Throughput of FPGA Restricted Boltzmann Machines Using Dropout","authors":"Jiang Su, David B. Thomas, P. Cheung","doi":"10.1109/FCCM.2016.23","DOIUrl":"https://doi.org/10.1109/FCCM.2016.23","url":null,"abstract":"Restricted Boltzmann Machines (RBMs) are widely used in modern machine learning tasks. Existing implementations are limited in network size and training throughput by available DSP resources. In this work we propose a new algorithm and architecture for FPGAs called dropout-RBM (dRBM) system. Compared to the state-of-art design methods on the same FPGA, dRBM with a dropout rate 0.5 doubles the maximum affordable network size using only half of DSP and BRAM resources. This is achieved by an application of a technique called dropout, which is a relatively new method used to avoid overfitting of data. Here we instead apply dropout as a technique for reducing the required DSPs and BRAM resources, while also having the side-effect of increasing robustness of training. Also to improve the processing throughput, we propose a multi-mode matrix multiplication module that maximizes the DSP efficiency. For the MNIST classificationbenchmark, a Stratix IV EP4SGX530 FPGA running dRBM is 34x faster than a single-precision Matlab implementation running on Intel i7 2.9GHz CPU.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129241512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication Optimization for the 16-Core Epiphany Floating-Point Processor Array","authors":"Nachiket Kapre, Siddhartha","doi":"10.1109/FCCM.2016.15","DOIUrl":"https://doi.org/10.1109/FCCM.2016.15","url":null,"abstract":"The management and optimization of communication in an NoC-based (network-on-chip) bespoke computing platform such as the Parallella (Zynq 7010 + Epiphany-III SoC) is critical for performance and energy-efficiency of floating-point bulk-synchronous workloads. In this paper, we explore the opportunities and capabilities of the Epiphany-III SoC for communication-intensive workloads. Using our communication support library for the Epiphany, we are able to accelerate single-precision BSP workloads like the Sparse Matrix-Vector multiplication (SpMV) on Matrix Market datasets by up to 6.5× and PageRank algorithm on the BerkStan SNAP dataset by up to 8×, while lowering power usage by 2× over optimized ARM-based implementations. When compared to optimized OpenMP x86 mappings, we observe a ≈10× improvement in energy efficiency (GFLOP/s/W) with Epiphany SoC.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133566279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seyyed Mahdi Najmabadi, Zhe Wang, Y. Baroud, S. Simon
{"title":"Online Bandwidth Reduction Using Dynamic Partial Reconfiguration","authors":"Seyyed Mahdi Najmabadi, Zhe Wang, Y. Baroud, S. Simon","doi":"10.1109/FCCM.2016.49","DOIUrl":"https://doi.org/10.1109/FCCM.2016.49","url":null,"abstract":"Online compression of I/O-data streams in Custom Computing Machines will enhance the effective network band-width of computing systems as well as storage bandwidth and capacity. In this paper a self-adaptive dynamic partial reconfigurable architecture for online compression is proposed. The proposed architecture will bring new possibilities in online compression due to its adaptivity to dynamic environments. In this study, network traffic, and input data distribution are considered as two dynamic behaviors. The degree of improvement provided by the architecture depends on data distribution, bandwidth, and available resources. Our analysis shows an improvement of up to 20% in compression ratios in comparison to non-adaptive approaches.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133801703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kohei Nagasu, K. Sano, Fumiya Kono, N. Nakasato, A. Vazhenin, S. Sedukhin
{"title":"Parallelism for High-Performance Tsunami Simulation with FPGA: Spatial or Temporal?","authors":"Kohei Nagasu, K. Sano, Fumiya Kono, N. Nakasato, A. Vazhenin, S. Sedukhin","doi":"10.1109/FCCM.2016.19","DOIUrl":"https://doi.org/10.1109/FCCM.2016.19","url":null,"abstract":"To carry out fast but accurate tsunami simulation after a major earthquake, we have developed an FPGA-based custom computing machine for high-speed but low-power tsunami simulator. We design a stream processing element (SPE) which is hardware based on pipelining and data-flow for tsunami computation. This paper presents design-space exploration for spatial and temporal parallelism of SPEs.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130281048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Classification Accuracy of a Machine Learning Approach for FPGA Timing Closure","authors":"Que Yanghua, Nachiket Kapre, Harnhua Ng, K. Teo","doi":"10.1109/FCCM.2016.28","DOIUrl":"https://doi.org/10.1109/FCCM.2016.28","url":null,"abstract":"We can use Cloud Computing and Machine Learning to help deliver timing closure of FPGA designs using InTime [2], [3]. This approach requires no modification to the input RTL and relies exclusively on manipulating the CAD tool parameters that drive the optimization heuristics. By running multiple combinations of the parameters in parallel, we learn from results and identify which parameters caused an improvement in the final results. By systematically building a classification model and training it with the results of the parallel CAD runs, we can build an accurate estimation flow for helping identify which parameters are more likely to improve the timing. In this paper, we consider strategies for improving the predictive accuracy of our classifier models to help guide the CAD run towards timing convergence. With ensemble learning we are able to increase average AUC score from 0.74 to 0.79, which could also translate into 2.7× savings in machine learning effort.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122875208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mau-Chung Frank Chang, Yu-Ting Chen, J. Cong, Po-Tsang Huang, Chun-Liang Kuo, Cody Hao Yu
{"title":"The SMEM Seeding Acceleration for DNA Sequence Alignment","authors":"Mau-Chung Frank Chang, Yu-Ting Chen, J. Cong, Po-Tsang Huang, Chun-Liang Kuo, Cody Hao Yu","doi":"10.1109/FCCM.2016.21","DOIUrl":"https://doi.org/10.1109/FCCM.2016.21","url":null,"abstract":"The advance of next-generation sequencing technology has dramatically reduced the cost of genome sequencing. However, processing and analyzing huge amounts of data collected from sequencers introduces significant computation challenges, these have become the bottleneck in many research and clinical applications. For such applications, read alignment is usually one of the most compute-intensive steps. Billions of reads generated from the sequencer need to be aligned to the long reference genome. Recent state-of-the-art software read aligners follow the seed-andextend model. In this paper we focus on accelerating the first seeding stage, which generates the seeds using the supermaximal exact match (SMEM) seeding algorithm. The two main challenges for accelerating this process are 1) how to process a huge number of short reads with high throughput, and 2) how to hide the frequent and long random memory access when we try to fetch the value of the reference genome. In this paper, we propose a scalable array-based architecture, which is composed by many processing engines (PEs) to process large amounts of data simultaneously for the demand of high throughput. Furthermore, we provide a tight software/hardware integration that realizes the proposed architecture on the Intel-Altera HARP system. With a 16-PE accelerator engine, we accelerate the SMEM algorithm by 4x, and the overall SMEM seeding stage by 26% when compared with 16-thread CPU execution. We further analyze the performance bottleneck of the design due to extensive DRAM accesses and discuss the possible improvements that are worthwhile to be explored in the future.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121354963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Initiation Interval Aware Resource Sharing for FPGA DSP Blocks","authors":"Bajaj Ronak, Suhaib A. Fahmy","doi":"10.1109/FCCM.2016.40","DOIUrl":"https://doi.org/10.1109/FCCM.2016.40","url":null,"abstract":"Resource sharing attempts to minimise usage of hardware blocks by mapping multiple operations onto same block at the cost of an increase in schedule length and initiation interval (II). Sharing multi-cycle high-throughput DSP blocks using traditional approaches results in significantly high II, determined by structure of dataflow graph of the design, thus limiting achievable throughput. We have developed a resource sharing technique that minimises the number of DSP blocks and schedule length given an II constraint.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115078303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Hardware Merge Sorter","authors":"Wei Song, Dirk Koch, M. Luján, J. Garside","doi":"10.1109/FCCM.2016.34","DOIUrl":"https://doi.org/10.1109/FCCM.2016.34","url":null,"abstract":"Sorting has tremendous usage in the applications that handle massive amount of data. Existing techniques accelerate sorting using multiprocessors or GPGPUs where a data set is partitioned into disjunctive subsets to allow multiple sorting threads working in parallel. Hardware sorters implemented in FPGAs have the potential of providing high-speed and low-energy solutions but the partition algorithms used in software systems are so data dependent that they cannot be easily adopted. The speed of most current sequential sorters still hangs around 1 number/cycle. Recently a new hardware merge sorter broke this speed limit by merging a large number of sorted sequences at a speed proportional to the number of sequences. This paper significantly improves its area and speed scalability by allowing stalls and variable sorting rate. A 32-port parallel merge-tree that merges 32 sequences is implemented in a Virtex-7 FPGA. It merges sequences at an average rate of 31.05 number/cycle and reduces the total sorting time by 160 times compared with traditional sequential sorters.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126046970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amine Ait Si Ali, X. Zhai, A. Amira, F. Bensaali, N. Ramzan
{"title":"Heterogeneous Implementation of ECG Encryption and Identification on the Zynq SoC","authors":"Amine Ait Si Ali, X. Zhai, A. Amira, F. Bensaali, N. Ramzan","doi":"10.1109/FCCM.2016.44","DOIUrl":"https://doi.org/10.1109/FCCM.2016.44","url":null,"abstract":"This paper presents an innovative and safe connected health solution for human identification. The system consists of the encryption and decryption of ECG signals using the advanced encryption standard (AES) as well as the recognition of individuals based on ECG biometrics. Heterogeneous and efficient implementation of the proposed system has been performed on a Xilinx ZC702 Zynq based prototyping board. Various IP-cores have been created based on the high level synthesis (HLS) implementation of the AES cipher, AES decipher and ECG identification blocks. The proposed hardware implementation has shown promising results since it met the real-time requirements and outclassed current field programmable gate array (FPGA) based systems in multiple key metrics including power consumption, processing time and hardware resources usage. The implemented system needs 10.71 ms to process one ECG sample and consumes 107mW while using only 30% of all available on-chip resources.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124062128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Apache Spark Big Data Analysis with FPGAs","authors":"Ehsan Ghasemi, P. Chow","doi":"10.1109/FCCM.2016.33","DOIUrl":"https://doi.org/10.1109/FCCM.2016.33","url":null,"abstract":"Summary form only given. Apache Spark has become one of the most popular engines for big data processing. Spark provides a platform-independent, high-abstraction programming paradigm for large-scale data processing by leveraging the Java frame-work. Though it provides software portability across various machines, Java also limits the performance of distributed environments, such as Spark. While it may be unrealistic to rewrite platforms like Spark in a faster language, a more viable approach to mitigate its poor performance is to accelerate the computations while still working within the Java-based framework. This work demonstrates the feasibility of incorporating FPGA acceleration into Spark, and uses a MapReduce implementation of the k-means clustering algorithm to show that acceleration is possible even when using a hardware platform that is not well-optimized for performance. An important feature of our approach is that the use of FPGAs is completely transparent to the user through the use of library functions, which is a common way by which users access functions provided by Spark. Power users can further develop other computations using high-level synthesis.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124729316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}