{"title":"Floating-Point DSP Block Architecture for FPGAs","authors":"M. Langhammer, B. Pasca","doi":"10.1145/2684746.2689071","DOIUrl":"https://doi.org/10.1145/2684746.2689071","url":null,"abstract":"This work describes the architecture of a new FPGA DSP block supporting both fixed and floating point arithmetic. Each DSP block can be configured to provide one single precision IEEE-754 floating multiplier and one IEEE-754 floating point adder, or when configured in fixed point mode, the block is completely backwards compatible with current FPGA DSP blocks. The DSP block operating frequency is similar in both modes, in the region of 500MHz, offering up to 2 GMACs fixed point and 1 GFLOPs performance per block. In floating point mode, support for multi-block vector modes are provided, where multiple blocks can be seamlessly assembled into any size real or complex dot products. By efficient reuse of the fixed point arithmetic modules, as well as the fixed point routing, the floating point features have only minimal power and area impact. We show how these blocks are implemented in a modern Arria 10 FPGA family, offering over 1 TFLOPs using only embedded structures, and how scaling to multiple TFLOPs densities is possible for planned devices.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"372 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120877789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gorker Alp Malazgirt, Nehir Sönmez, A. Yurdakul, O. Unsal, A. Cristal
{"title":"Accelerating Complete Decision Support Queries Through High-Level Synthesis Technology (Abstract Only)","authors":"Gorker Alp Malazgirt, Nehir Sönmez, A. Yurdakul, O. Unsal, A. Cristal","doi":"10.1145/2684746.2689151","DOIUrl":"https://doi.org/10.1145/2684746.2689151","url":null,"abstract":"Recently, with the rise of Internet of Things and Big Data, acceleration of database analytics in order to have faster query processing capabilities has gained significant attention. At the same time, High-Level Synthesis (HLS) technology has matured and is now a promising approach to design such hardware accelerators. In this work, we use a modern HLS, Vivado to design high-performance database accelerators for filtering, aggregation, sorting, merging and join operations. Later, we use these as building blocks to implement an acceleration system for in-memory databases on a Virtex-7 FPGA, detailed enough to run full TPC-H benchmarks completely in hardware. Presenting performance, area and memory requirements, we show up to 140x speedup compared to a software DBMS, and demonstrate that HLS technology is indeed a very appropriate match for database acceleration.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132740103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deshanand P. Singh, B. Pasca, Tomasz S. Czajkowski
{"title":"High-Level Design Tools for Floating Point FPGAs","authors":"Deshanand P. Singh, B. Pasca, Tomasz S. Czajkowski","doi":"10.1145/2684746.2689079","DOIUrl":"https://doi.org/10.1145/2684746.2689079","url":null,"abstract":"This tutorial describes tools for efficiently implementing floating point applications on FPGAs. We present both the SDK for OpenCL and DSP Builder Advanced Blockset and show that they can be effectively used to implement many floating point applications. The methods for optimizing application performance are also described. In this tutorial we focus on a few applications, including Fast Fourier transform, matrix multiplication, finite impulse response filter and a Cholesky decomposition. In all cases we show what the tools are capable of achieving, and more importantly how a user can take advantage of the various floating-point centric features that are made available. We also discuss how these tools can automatically use FPGA architectural features such as hardened floating-point DSP available on Altera Arria 10 family.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127585974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unlocking FPGAs Using High Level Synthesis Compiler Technologies","authors":"F. Martinez-Vallina, Henry Styles","doi":"10.1145/2684746.2721403","DOIUrl":"https://doi.org/10.1145/2684746.2721403","url":null,"abstract":"FPGA devices have long been the standard for massively parallel computing fabrics with a low power footprint. Unfortunately, the complexity associated with an FPGA design has limited the rate of adoption by software application programmers. Recent advances in compiler and FPGA fabric capabilities are reversing this trend and there is a growing adoption of FPGAs for algorithmic workloads such as data analytics, feature detection in images, adaptive beam forming, etc. One of the pillars of this shift is the Vivado HLS compiler, which enables the compilation of algorithms captured in C and C++into efficient FPGA implementations. This talk focuses on how the HLS compiler creates algorithm specific compute architectures and how these elements are then used in an OpenCL based system level design abstraction. The evolution of these hardware design abstractions into software centric specifications enable application developers to leverage the flexibility of the FPGA fabric without the constraints typically found in fixed parallel architectures such as multi-core CPUs/GPUs.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130510303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Efficient High-Order FIR Filtering through Reconfigurable Stochastic Processing (Abstract Only)","authors":"Mohammed Alawad, Mingjie Lin","doi":"10.1145/2684746.2689129","DOIUrl":"https://doi.org/10.1145/2684746.2689129","url":null,"abstract":"High-order FIR filtering is widely used in many important DSP applications in order to achieve filtering stability and linear-phase property. This paper presents a hardware- and energy-efficient approach to implementing energy-efficient high-order FIR filtering through reconfigurable stochastic processing. We exploit a basic probabilistic principle of summing independent random variables to achieve approximate FIR filtering without costly multiplications. Our new multiplierless approach has two distinctive advantages when compared with the conventional multiplier-based or DA-based FIR filtering methods. First, our new probabilistic architecture is especially effective for high-order FIR filtering because it bypasses costly multiplications and does not rely on large size of memory to store store pre-computed coefficient products. Second, this new probabilistic convolver is significantly more robust or fault tolerant than the conventional architecture because all signal values will be represented and computed probabilistically, and local signal corruption can not easily destroy the overall probabilistic patterns, therefore achieving much higher error tolerance. For example, our proposed approach allows our proposed FIR architecture, for a standard 128-tap FIR filter, to achieve about 9 times and 4 times less power consumption than the conventional multiplier-based and DA-based design, respectively. Additionally, when compared with the state-of-the-art systolic DA-based design, our design can achieve about 3 times reduction in hardware usage.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122949257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impact of Memory Architecture on FPGA Energy Consumption","authors":"E. Kadrić, David Lakata, A. DeHon","doi":"10.1145/2684746.2689062","DOIUrl":"https://doi.org/10.1145/2684746.2689062","url":null,"abstract":"FPGAs have the advantage that a single component can be configured post-fabrication to implement almost any computation. However, designing a one-size-fits-all memory architecture causes an inherent mismatch between the needs of the application and the memory sizes and placement on the architecture. Nonetheless, we show that an energy-balanced design for FPGA memory architecture (memory block size(s), memory banking, and spacing between memory banks) can guarantee that the energy is always within a factor of 2 of the optimally-matched architecture. On a combination of the VTR 7 benchmarks and a set of tunable benchmarks, we show that an architecture with internally-banked 8Kb and 256Kb memory blocks has a 31% worst-case energy overhead (8% geomean). In contrast, monolithic 16Kb memories (comparable to 18Kb and 20Kb memories used in commercial FPGAs) have a 147% worst-case energy overhead (24% geomean). Furthermore, on benchmarks where we can tune the parallelism in the implementation to improve energy (FFT, Matrix-Multiply, GMM, Sort, Window Filter), we show that we can reduce the energy overhead by another 13% (25% for the geomean).","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115856128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design Space Exploration of L1 Data Caches for FPGA-Based Multiprocessor Systems","authors":"Eric Matthews, Nicholas C. Doyle, Lesley Shannon","doi":"10.1145/2684746.2689083","DOIUrl":"https://doi.org/10.1145/2684746.2689083","url":null,"abstract":"Combining multi-processing with the high level of configurability possible with FPGA-based soft-processors, this paper presents a multiprocessing framework based on the MicroBlaze soft-processor that provides multicore support and fully coherent, independently configurable Level 1 Caches with Linux multicore support. This architecture allows for fine-grain configurability of the system, allowing for FPGA resources to be better optimized for a specific embedded application. We use our framework to explore the L1 Data Cache configuration, developing a metric for efficiency based on resource usage and static application runtime. We find that a Pseudo-Random replacement policy is consistently the more efficient choice for FPGA systems.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124835498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Software-Driven Hardware Development","authors":"Myron King, Jamey Hicks, J. Ankcorn","doi":"10.1145/2684746.2689064","DOIUrl":"https://doi.org/10.1145/2684746.2689064","url":null,"abstract":"The cost and complexity of hardware-centric systems can often be reduced by using software to perform tasks which don't appear on the critical path. Alternately, the performance of software can sometimes be improved by using special purpose hardware to implement tasks which do appear on the critical path. Whatever the motivation, most modern systems are composed of both hardware and software components. Given the importance of the connection between hardware and software in these systems, it is surprising how little automated and machine-checkable support there is for co-design space exploration. This paper presents the Connectal framework, which enables the development of hardware accelerators for software applications by generating hardware/software interface implementations from abstract Interface Design Language (IDL) specifications. Connectal generates stubs to support asynchronous remote method invocation from software to software, hardware to software, software to hardware, and hardware to hardware. For high-bandwidth communication, the Connectal framework provides comprehensive support for shared memory between hardware and software components, removing the repetitive work of processor bus interfacing from project tasks. This framework is released as open software under an MIT license, making it available for use in any projects.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123813564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Designer's Day Session 2","authors":"Zhiru Zhang","doi":"10.1145/3251647","DOIUrl":"https://doi.org/10.1145/3251647","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"24 3 Suppl 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123961282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bridging Architecture and Programming for Throughput-Oriented Vision Processing (Abstract Only)","authors":"Amir Momeni, H. Tabkhi, G. Schirner, D. Kaeli","doi":"10.1145/2684746.2689140","DOIUrl":"https://doi.org/10.1145/2684746.2689140","url":null,"abstract":"With the expansion of OpenCL support across many heterogeneous devices (including FPGAs, GPUs and CPUs), the programmability of these systems has been significantly increased. At the same time, new questions arise about which device should be targeted for each OpenCL software kernel. Once we select a device, then we are left to customize the application, selecting the right granularity of parallelism and frequency of host-to-device communication. In this paper, we study the impact of source-level decisions on the overall execution time when developing OpenCL program across different heterogeneous devices. We focus on two mainstream architecture classes (GPUs and FPGAs), and consider throughput-oriented advanced vision processing. To guide this exploration, we propose a new vertical classification for selecting the grain of parallelism for advanced vision processing applications. To carry out this study we have selected the Mean-shift object tracking algorithm as a representative candidate of advanced vision algorithms. Overall, our evaluation demonstrates that fine-grained parallelism can greatly benefit FPGA execution (up to a 4X speed-up), while a combination of coarse-grained and fine-grained parallelism achieves the best performance on a GPU (up to a 6X speed-up). Also, there can be a large benefit if we can execute both the parallel and serial parts of the program on a FPGA (up to a 21X speed-up).","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124154473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}