{"title":"Kung Fu Data Energy - Minimizing Communication Energy in FPGA Computations","authors":"E. Kadrić, K. Mahajan, A. DeHon","doi":"10.1109/FCCM.2014.66","DOIUrl":"https://doi.org/10.1109/FCCM.2014.66","url":null,"abstract":"The energy in FPGA computations can be dominated by data communication energy, either in the form of memory references or data movement on interconnect (e.g., over 75% of energy for single processor Gaussian Mixture Modeling, Window Filtering, and FFT). In this paper, we explore how to use data placement and parallelism to reduce communication energy. We further introduce a new architecture for embedded memories, the Continuous Hierarchy Memory (CHM), and show that it increases the opportunities to reduce energy by strategic data placement. For three common FPGA tasks in signal and image processing (Gaussian Mixture Modeling, Window Filters, and FFTs), we show that data movement energy can vary over a factor of 9. The best solutions exploit parallelism and hierarchy and are 1.8-6.0× more energy-efficient than designs that place all data in a large memory bank. With the CHM, we can get an additional 10% improvement for full voltage logic and 30-80% when operating the computation at reduced voltage.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116010305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, Peipei Zhou
{"title":"A Fully Pipelined and Dynamically Composable Architecture of CGRA","authors":"J. Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, Peipei Zhou","doi":"10.1109/FCCM.2014.12","DOIUrl":"https://doi.org/10.1109/FCCM.2014.12","url":null,"abstract":"Future processor chips will not be limited by the transistor resources, but will be mainly constrained by energy efficiency. Reconfigurable fabrics bring higher energy efficiency than CPUs via customized hardware that adapts to user applications. Among different reconfigurable fabrics, coarse-grained reconfigurable arrays (CGRAs) can be even more efficient than fine-grained FPGAs when bit-level customization is not necessary in target applications. CGRAs were originally developed in the era when transistor resources were more critical than energy efficiency. Previous work shares hardware among different operations via modulo scheduling and time multiplexing of processing elements. In this work, we focus on an emerging scenario where transistor resources are rich. We develop a novel CGRA architecture that enables full pipelining and dynamic composition to improve energy efficiency by taking full advantage of abundant transistors. Several new design challenges are solved. We implement a prototype of the proposed architecture in a commodity FPGA chip for verification. Experiments show that our architecture can fully exploit the energy benefits of customization for user applications in the scenario of rich transistor resources.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125208214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast, Power-Efficient Biophotonic Simulations for Cancer Treatment Using FPGAs","authors":"Jeffrey Cassidy, L. Lilge, Vaughn Betz","doi":"10.1109/FCCM.2014.45","DOIUrl":"https://doi.org/10.1109/FCCM.2014.45","url":null,"abstract":"Biophotonics, the study of light propagation through living tissue, is important for many medical applications ranging from imaging and detection through therapy for conditions such as cancer. Effective medical use of light depends on simulating its propagation through highly-scattering tissue. Monte Carlo simulation of photon migration has been adopted as the “gold standard” for its ability to capture complicated geometries and model all of the relevant problem physics. This accuracy and generality comes at a high computational cost, which limits the technique's utility. Greatly generalizing previous work, we present the first and only hardware-accelerated Monte Carlo biophotonic simulator that can accept complicated geometries described by tetrahedral meshes. Implemented on an Altera Stratix V FPGA, it achieves high performance (4x) and extremely high energy efficiency (67x) compared to a tightly-optimized multi-threaded CPU implementation, with demonstrated potential to expand the performance gains even further to 15-20x, which would enable important clinical and research applications.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124334833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Aguilar-Pelaez, Samuel Bayliss, Alex I. Smith, F. Winterstein, D. Ghica, David B. Thomas, G. Constantinides
{"title":"Compiling Higher Order Functional Programs to Composable Digital Hardware","authors":"E. Aguilar-Pelaez, Samuel Bayliss, Alex I. Smith, F. Winterstein, D. Ghica, David B. Thomas, G. Constantinides","doi":"10.1109/FCCM.2014.69","DOIUrl":"https://doi.org/10.1109/FCCM.2014.69","url":null,"abstract":"This work demonstrates the capabilities of a high-level synthesis tool-chain that allows the compilation of higher order functional programs to gate-level hardware descriptions. Higher order programming allows functions to take functions as parameters. In a hardware context, the latency-insensitive interfaces generated between compiled modules enable late-binding with libraries of pre-existing functions at the place-and-route compilation stage. We demonstrate the completeness and utility of our approach using a case study; a recursive k-means clustering algorithm. The algorithm features complex data-dependent control flow and opportunities to exploit both coarse and fine-grained parallelism.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125089477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers","authors":"K. Schneider, Adrian Willenbücher","doi":"10.1109/FCCM.2014.24","DOIUrl":"https://doi.org/10.1109/FCCM.2014.24","url":null,"abstract":"Signed-digit (SD) numbers generalize traditional radix numbers by allowing negative digits within a certain range. Typically, this leads to redundant number representations that can be used to avoid the carry propagation problem of addition of radix numbers. Unfortunately, as proved by Avizienis, the standard algorithm for carry-free addition of SD numbers does not work for the binary case. In this paper, we therefore construct a special algorithm for the carry-free addition and subtraction of binary SD numbers, i.e., addition and subtraction of n-digit numbers are performed with circuits of depth O(1) and size O(n). This is possible by computing in addition to the transfer digits used by the standard algorithm one additional bit that allows us to distinguish relevant cases to avoid propagation of dependencies. The additional bit and the transfer digit used to compute the sum digit at position i depend only on the summands' digits at positions i and i - 1 so that all sum digits can be computed with a hardware circuit of a depth that is independent of the number of digits. We first explain the basics of the standard addition algorithm to derive the additional information needed to fix the algorithm for the binary case. After proving the correctness of our algorithm, we present experimental results that show that our implementation clearly outperforms two's complement addition even for small numbers, and saves 50% of the required chip area compared to other carry-free implementations.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116557388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experiments in Mapping Expressions to DSP Blocks","authors":"Bajaj Ronak, Suhaib A. Fahmy","doi":"10.1109/FCCM.2014.34","DOIUrl":"https://doi.org/10.1109/FCCM.2014.34","url":null,"abstract":"Mapping complex mathematical expressions to DSP blocks by relying on synthesis from pipelined code is inefficient and results in significantly reduced throughput. We have developed a tool to demonstrate the benefit of considering the structure and pipeline arrangement of the DSP block in mapping of functions. Implementations where the structure of the DSP block is considered during pipelining achieve double the throughput of other methods, demonstrating that the structure of the DSP block must be considered when scheduling complex expressions.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132194302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA Implementation of EM Algorithm for 3D CT Reconstruction","authors":"Young-kyu Choi, J. Cong, Di Wu","doi":"10.1109/FCCM.2014.48","DOIUrl":"https://doi.org/10.1109/FCCM.2014.48","url":null,"abstract":"Although the expectation maximization (EM)based 3D computed tomography (CT) reconstruction algorithm lowers radiation exposure, its long execution time hinders practical usage. To accelerate this process, we introduce a novel external memory bandwidth reduction strategy by reusing both the sinogram and the voxel intensity. Also, a customized computing engine based on field-programmable gate array (FPGA) is presented to increase the effective memory bandwidth. Experiments on actual patient data show that 85X speedup can be achieved over single-threaded CPU.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130555550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mapping Tasks to a Dynamically Reconfigurable Coarse Grained Array","authors":"M. S. Moghaddam, K. Paul, M. Balakrishnan","doi":"10.1109/FCCM.2014.20","DOIUrl":"https://doi.org/10.1109/FCCM.2014.20","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) have become popular in recent times as the increased transistor densities have enabled greater integration of increasingly complex “compute cores”. These devices pack massive compute power and can be effectively used to build efficient solutions for applications which have a significant degree of parallelism. In many cases, these CGRAs are also partially reconfigurable. Clearly to make effective use of these highly “parallel compute platforms”, a good mapping flow is required to map the parallelism that is present in a target application.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122125006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Nurvitadhi, G. Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, J. Hoe, José F. Martínez, Carlos Guestrin
{"title":"GraphGen: An FPGA Framework for Vertex-Centric Graph Computation","authors":"E. Nurvitadhi, G. Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, J. Hoe, José F. Martínez, Carlos Guestrin","doi":"10.1109/FCCM.2014.15","DOIUrl":"https://doi.org/10.1109/FCCM.2014.15","url":null,"abstract":"Vertex-centric graph computations are widely used in many machine learning and data mining applications that operate on graph data structures. This paper presents GraphGen, a vertex-centric framework that targets FPGA for hardware acceleration of graph computations. GraphGen accepts a vertex-centric graph specification and automatically compiles it onto an application-specific synthesized graph processor and memory system for the target FPGA platform. We report design case studies using GraphGen to implement stereo matching and handwriting recognition graph applications on Terasic DE4 and Xilinx ML605 FPGA boards. Results show up to 14.6× and 2.9× speedups over software on Intel Core i7 CPU for the two applications, respectively.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126181955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sven Hager, F. Winkler, B. Scheuermann, Klaus Reinhardt
{"title":"Building Optimized Packet Filters with COFFi","authors":"Sven Hager, F. Winkler, B. Scheuermann, Klaus Reinhardt","doi":"10.1109/FCCM.2014.38","DOIUrl":"https://doi.org/10.1109/FCCM.2014.38","url":null,"abstract":"Many companies and institutions employ packet filter firewalls in order to effectively regulate network traffic. Unfortunately, the constant growth of network bandwidth makes the task of matching packet headers against potentially large rulesets more difficult, and prohibits the sole use of entirely software-based firewalls which cannot cope with such huge amounts of traffic. Instead, high-speed firewalls are often implemented in ASICs which offer a high degree of parallelism, many opportunities for operation pipelining, and low-latency access to network data. However, due to their static nature, ASICs must provide generic filtering circuitry that is hardly able to take full advantage of firewall ruleset properties, thus leading to a waste of hardware resources.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129301929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}