{"title":"An Efficient Architecture for Floating-Point Eigenvalue Decomposition","authors":"Xinying Wang, Joseph Zambreno","doi":"10.1109/FCCM.2014.27","DOIUrl":"https://doi.org/10.1109/FCCM.2014.27","url":null,"abstract":"Eigenvalue decomposition (EVD) is a widely-used factorization tool to perform principal component analysis, and has been employed for dimensionality reduction and pattern recognition in many scientific and engineering applications, such as image processing, text mining and wireless communications. EVD is considered computationally expensive, and as software implementations have not been able to meet the performance requirements of many real-time applications, the use of reconfigurable computing technology has shown promise in accelerating this type of computation. In this paper, we present an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. Our experimental results using an FPGA-based hybrid acceleration system indicate the efficiency of our novel array architecture, with dimension-dependent speedups over an optimized software implementation that range from 1.5× to 15.45× in terms of computation time.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122333640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Throughput Fixed-Point Object Detection on FPGAs","authors":"Xiaoyin Ma, W. Najjar, A. Roy-Chowdhury","doi":"10.1109/FCCM.2014.40","DOIUrl":"https://doi.org/10.1109/FCCM.2014.40","url":null,"abstract":"Computer vision applications make extensive use of floating-point number representation, both single and double precision. The major advantage of floating-point representation is the very large range of values that can be represented with a limited number of bits. Most CPU, and all GPU designs have been extensively optimized for short latency and high-throughput processing of floating-point operations. On an FPGA, the bit-width of operands is a major determinant of its resource utilization, the achievable clock frequency and hence its throughput. By using a fixed-point representation with fewer bits, an application developer could implement more processing units and a higher-clock frequency and a dramatically larger throughput. However, smaller bit-widths may lead to inaccurate or incorrect results. Object and human detection are fundamental problems in computer vision and a very active research area. In these applications a high throughput and an economy of resources are highly desirable features allowing the applications to be embedded in mobile or fielddeployable equipment. The Histogram of Oriented Gradients (HOG) algorithm [1], developed for human detection and expanded to object detection, is one of the most successful and popular algorithm in its class. In this algorithm, object descriptors are extracted from detection window with grids of overlapping blocks. Each block is divided into cells in which histograms of intensity gradients are collected as HOG features. Vectors of histograms are normalized and passed to a Support Vector Machine (SVM) classifier to recognize a person or an object.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114568407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sanae, Yuko Hara-Azumi, S. Yamashita, Y. Nakashima
{"title":"Better-Than-DMR Techniques for Yield Improvement","authors":"S. Sanae, Yuko Hara-Azumi, S. Yamashita, Y. Nakashima","doi":"10.1109/FCCM.2014.21","DOIUrl":"https://doi.org/10.1109/FCCM.2014.21","url":null,"abstract":"In this work, we first study LUT optimization in PPCs for increasing their area-efficiency for yield improvement. We focus on the fact that although 22n configurations are available for an-input LUT, such full programmability is not needed, i.e., one configuration is enough for bypassing one specific fault. Then, we optimize away too rich programmability of LUTs exploiting application features in order to reduce the area cost without degrading the fault bypassability from the original PPC.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127588310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Fowers, Kalin Ovtcharov, K. Strauss, Eric S. Chung, G. Stitt
{"title":"A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication","authors":"J. Fowers, Kalin Ovtcharov, K. Strauss, Eric S. Chung, G. Stitt","doi":"10.1109/FCCM.2014.23","DOIUrl":"https://doi.org/10.1109/FCCM.2014.23","url":null,"abstract":"Sparse matrix-vector multiplication (SMVM) is a crucial primitive used in a variety of scientific and commercial applications. Despite having significant parallelism, SMVM is a challenging kernel to optimize due to its irregular memory access characteristics. Numerous studies have proposed the use of FPGAs to accelerate SMVM implementations. However, most prior approaches focus on parallelizing multiply-accumulate operations within a single row of the matrix (which limits parallelism if rows are small) and/or make inefficient uses of the memory system when fetching matrix and vector elements. In this paper, we introduce an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low. This system compares favorably with prior FPGA SMVM implementations. For the over 700 University of Florida sparse matrices we evaluated, it also performs within about two thirds of CPU SMVM performance on average, even though it has 2.4x lower DRAM memory bandwidth, and within almost one third of GPU SVMV performance on average, even at 9x lower memory bandwidth. Additionally, it consumes only 25W, for power efficiencies 2.6x and 2.3x higher than CPU and GPU, respectively, based on maximum device power.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133045431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Luu, Conor McCullough, Sen Wang, Safeen Huda, Bo Yan, Charles Chiasson, K. Kent, J. Anderson, Jonathan Rose, Vaughn Betz
{"title":"On Hard Adders and Carry Chains in FPGAs","authors":"J. Luu, Conor McCullough, Sen Wang, Safeen Huda, Bo Yan, Charles Chiasson, K. Kent, J. Anderson, Jonathan Rose, Vaughn Betz","doi":"10.1109/FCCM.2014.25","DOIUrl":"https://doi.org/10.1109/FCCM.2014.25","url":null,"abstract":"Hardened adder and carry logic is widely used in commercial FPGAs to improve the efficiency of arithmetic functions. There are many design choices and complexities associated with such hardening, including circuit design, FPGA architectural choices, and the CAD flow. There has been very little study, however, on these choices and hence we explore a number of possibilities for hard adder design. We also highlight optimizations during front-end elaboration that help ameliorate the restrictions placed on logic synthesis by hardened arithmetic. We show that hard adders and carry chains, when used for simple adders, increase performance by a factor of four or more, but on larger benchmark designs that contain arithmetic, improve overall performance by roughly 15%. We measure an average area increase of 5% for architectures with carry chains but believe that better logic synthesis should reduce this penalty. Interestingly, we show that adding dedicated inter-logic-block carry links or fast carry look-ahead hardened adders result in only minor delay improvements for complete designs.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124040923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanbiao Li, Dafang Zhang, Xian Yu, Jing Long, W. Liang
{"title":"From GPU to FPGA: A Pipelined Hierarchical Approach to Fast and Memory-Efficient NDN Name Lookup","authors":"Yanbiao Li, Dafang Zhang, Xian Yu, Jing Long, W. Liang","doi":"10.1109/FCCM.2014.39","DOIUrl":"https://doi.org/10.1109/FCCM.2014.39","url":null,"abstract":"Summary form only given. Named Data Networking (NDN) is an emerging future Internet architecture with an alternative communication paradigm. For NDN, name lookup, just like IP address lookup for TCP/IP, plays an important role in forwarding. However, performing Longest Prefix Matching (LPM) to NDN names is more challenging. Recently, Graphic Processing Units (GPUs) have been shown to be of value in supporting wire speed name lookup, but the latency resulted by batching and transferring names is not so encouraging. On the other hand, in the area of IP address lookup, FPGA is widely used to implement Static Radom Accessing Memory (SRAM)-based pipeline for fast lookup and controllable latency. Thus, in this paper, we study how to accelerate NDN name lookup using FPGA-based pipeline.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130937704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GROK-INT: Generating Real On-Chip Knowledge for Interconnect Delays Using Timing Extraction","authors":"Benjamin Gojman, A. DeHon","doi":"10.1109/FCCM.2014.31","DOIUrl":"https://doi.org/10.1109/FCCM.2014.31","url":null,"abstract":"With continued scaling, all transistors are no longer created equal. The delay of a length 4 horizontal routing segment at coordinates (23,17) will differ from one at (12,14) in the same FPGA and from the same segment in another FPGA. The vendor tools give conservative values for these delays, but knowing exactly what these delays are can be invaluable. In this paper, we show how to obtain this information, inexpensively, using only components that already exist on the FPGA (configurable PLLs, registers, logic, and interconnect). The techniques we present are general and can be used to measure the delays of any resource on any FPGA with these components. We provide general algorithms for identifying the set of useful delay components, the set of measurements necessary to compute these delay components, and the calculations necessary to perform the computation. We demonstrate our techniques on the interconnect for an Altera Cyclone III (65nm). As a result, we are able to quantify over a 100 ps spread in delays for nominally identical routing segments on a single FPGA.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116078948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Kersey, S. Yalamanchili, Hyojong Kim, Nimit Nigania, Hyesoon Kim
{"title":"Harmonica: An FPGA-Based Data Parallel Soft Core","authors":"C. Kersey, S. Yalamanchili, Hyojong Kim, Nimit Nigania, Hyesoon Kim","doi":"10.1109/FCCM.2014.53","DOIUrl":"https://doi.org/10.1109/FCCM.2014.53","url":null,"abstract":"General-purpose GPUs or GPGPUs have taken their place in the market, being present in 38 of the Top 500 supercomputers [5]. In the same way that the emergence of FPGAs in the 1980s led to a demand for soft cores with instruction sets similar to the CPUs of the day, we anticipate a similar demand in the 2010s for soft cores with GPGPU instruction sets. These architectures are distinguished by their SIMT, single-instruction-multiple-thread, execution model, acheiving throughput by running multiple threads of execution simultaneously across multiple functional units, keeping separate register values for each lane of execution.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115953256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Architectural Approach to Characterizing and Eliminating Sources of Inefficiency in a Soft Processor Design","authors":"Kaveh Aasaraai, Andreas Moshovos","doi":"10.1109/FCCM.2014.51","DOIUrl":"https://doi.org/10.1109/FCCM.2014.51","url":null,"abstract":"This work takes an architectural approach to systematically characterize components and mechanisms that are the main sources of low operating clock frequency when implementing a typical pipelined general purpose processor on an FPGA. Several previous works have addressed specific implementation inefficiencies, however mostly on a case-by-case basis. Accordingly. there is a need to systematically characterize the sources of inefficiency in soft processor designs. Such a characterization serves to deepen our understanding of FPGA implementation trade-offs and can serve as the starting point for developing FPGA-friendly designs that achieve higher performance and/or lower area. We start with a typical 5-stage pipelined architecture that is optimized for custom logic implementation and that focuses on correctness, modularity, and speed of development.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"28 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114097031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory Optimized Re-gridding for Non-uniform Fast Fourier Transform on FPGAs","authors":"Umer I. Cheema, G. Nash, R. Ansari, A. Khokhar","doi":"10.1109/FCCM.2014.35","DOIUrl":"https://doi.org/10.1109/FCCM.2014.35","url":null,"abstract":"Summary form only given. The Discrete Fourier Transform (DFT) can be viewed as the Fourier Transform of a periodic and regularly sampled signal as commonly defined in equation 1. The Non-Uniform Discrete Fourier Transform (NuDFT) is a generalization of the DFT for data that may not be regularly sampled in spatial or temporal dimensions. This flexibility allows for benefits in situation where sensor placement cannot be guaranteed to be regular or where prior knowledge of the informational content could allow for better sampling patterns than a regular one. NuDFT is used in applications such as Synthetic Aperture Radar (SAR), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). The NuDFT definition is shown in equation 2. Here the sample locations are points si in the set S. Each point, si has a complex value consisting of location or frequency components six and siy. The location or frequency components are, of course, not restriced to a discrete sampling grid.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130985734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}