{"title":"RapidPath: Accelerating Constrained Shortest Path Finding in Graphs on FPGA (Abstract Only)","authors":"Chao Wang, Xi Li, Qi Guo, Xuehai Zhou","doi":"10.1145/2684746.2689135","DOIUrl":"https://doi.org/10.1145/2684746.2689135","url":null,"abstract":"Emerging applications, such as Software Defined Network (SDN), Social Media, and Location Based System (LBS), are typical big graph based applications. Due to the explosive network flood, it is essential to speedup the computation process in the big graph application, such as Constrained Shortest Path Finding (CSPF) algorithm is one of the most challenging part. Meanwhile, FPGA has been an effective and efficient platform in novel big data architectures and systems, due to its computing power and low power consumption. It enables the researchers to deploy massive accelerators within one single chip. In this paper, we present RapidPath, an acceleration method for CSPF algorithm in software defined networks, which decomposes a large and complex system of programs into small single-purpose source code libraries that perform specialized tasks in parallel. Only the CSPF step is implemented in hardware and the rest steps run on the processor. We have built a prototyping system on Zynq with CSPF case studies. The ARM processor uses a shared memory with the FPGA based accelerator using DMA based channels. Control signals are transferred via AXI bus interfaces. Experimental results depict that RapidPath is able to achieve up to 43.75X speedup at 128 nodes, comparing to the software execution (without cache) on Xilinx Zynq board. Furthermore, hardware cost and overheads reveal that the RapidPath architecture can achieve high speedup with insignificant cost.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132966477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote Speech","authors":"S. Neuendorffer","doi":"10.1145/3251649","DOIUrl":"https://doi.org/10.1145/3251649","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"461 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122163945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Resource-Aware Throughput Optimization for High-Level Synthesis","authors":"Peng Li, Peng Zhang, L. Pouchet, J. Cong","doi":"10.1145/2684746.2689065","DOIUrl":"https://doi.org/10.1145/2684746.2689065","url":null,"abstract":"With the emergence of robust high-level synthesis tools to automatically transform codes written in high-level languages into RTL implementations, the programming productivity when synthesising accelerators improves significantly. However, although the state-of-the-art high-level synthesis tools can offer high-quality designs for simple nested loop kernels, there is still a significant performance gap between the synthesized and the optimal design for real world complex applications with multiple loops. In this work we first demonstrate that maximizing the throughput of each individual loop is not always the most efficient approach to achieving the maximum system-level throughput. More area efficient non-fully pipelined design variants may outperform the fully-pipelined version by enabling larger degrees of parallelism. We develop an algorithm to determine the optimal resource usage and initiation intervals for each loop in the applications to achieve maximum throughput within a given area budget. We report experimental results on eight applications, showing an average of 31% performance speedup over state-of-the-art HLS solutions.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125941388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Customizable and High Performance Matrix Multiplication Kernel on FPGA (Abstract Only)","authors":"Jie Wang, J. Cong","doi":"10.1145/2684746.2689147","DOIUrl":"https://doi.org/10.1145/2684746.2689147","url":null,"abstract":"Matrix multiplication (MM) is an important kernel in many application domains, including scientific computing, image processing, machine learning, etc. Numerous accelerator designs have been proposed for higher throughput and energy efficiency. In this paper we present a customizable FPGA accelerator of matrix multiplication. We also develop a design automation flow to generate the optimal design configuration with the highest throughput given the matrix size and target FPGA platform. It can be integrated with HLS tools as a basic parameterizable library component. Experiments show that for 512×512 single precision MM, we can achieve as high as 358 GFLOPs on the Xilinx Virtix-7 XC7VX485T-2, which outperforms any published state-of-the-art FPGA accelerator design by at least 28.3%.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127447705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"200 MS/s ADC implemented in a FPGA employing TDCs","authors":"H. Homulle, F. Regazzoni, E. Charbon","doi":"10.1145/2684746.2689070","DOIUrl":"https://doi.org/10.1145/2684746.2689070","url":null,"abstract":"Analog signals are used in many applications and systems, such as cyber physical systems, sensor networks and automotive applications. These are also applications where the use of FPGAs is continuously growing. To date, however there is no direct integration between FPGAs, which are digital, and the analog world (except for the newest generation of FPGAs). Currently, an external analog-to-digital converter (ADC) has to be added to the system, thus limiting its overall compactness and flexibility. To address this issue we propose a novel architecture implementing a high speed ADC in reconfigurable devices. The system exploits picosecond resolution time-to-digital converters (TDCs) to reach a conversion as fast as its clock speed. The resulting analog-through-time-to-digital converter (ATDC) can achieve a sampling rate of 200 MS/s with a 7 bit resolution for signals ranging from 0 to 2.5 V. Except for the external resistor needed for the analog reference ramp, the system is fully integrated inside the target FPGA. Moreover, our design can be easily scaled for multichannel ADCs, proving the suitability of reconfigurable devices for applications requiring a deep integration between analog and digital world.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122990766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Coefficient Address Generation Algorithm for Split-Radix FFT (Abstract Only)","authors":"Z. Qian, M. Margala","doi":"10.1145/2684746.2689134","DOIUrl":"https://doi.org/10.1145/2684746.2689134","url":null,"abstract":"Split-Radix Fast Fourier Transform (SRFFT) has the lowest number of arithmetic operations among all the FFT algorithms. Since arithmetic operations dramatically contribute to the dynamic power consumption, SRFFT is an ideal candidate for the implementation of a low power FFT processor. In the design of such processors, an efficient addressing scheme for FFT data as well as coefficients is required. The signal flow graph of split-radix algorithm is the same as radix-2 FFT except for the location and value of coefficients, therefore conventional radix-2 FFT data address generation scheme could also be applied to SRFFT. However, the mixed radix property of SRFFT algorithm leads to irregular locations of coefficients and forbids any conventional address generation algorithm. This paper presents a novel coefficient address generation algorithm for shared-memory based SRFFT processor. The core part of the proposed algorithm is to use two control variables to track trivial and non-trivial multiplications. We found the relationship between the value of the control variables and the butterfly and pass counter. The corresponding hardware implementation is simple consisting of a shift register and a dual port RAM bank. Compared to look-up table approach, which pre-computes the addresses of all coefficients and stores the addresses in memory units, the proposed algorithm is scalable and only requires small amount of memory to find the correct addresses of coefficients.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131712523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stuart Byma, Naif Tarafdar, T. Xu, H. Bannazadeh, A. Leon-Garcia, P. Chow
{"title":"Expanding OpenFlow Capabilities with Virtualized Reconfigurable Hardware","authors":"Stuart Byma, Naif Tarafdar, T. Xu, H. Bannazadeh, A. Leon-Garcia, P. Chow","doi":"10.1145/2684746.2689086","DOIUrl":"https://doi.org/10.1145/2684746.2689086","url":null,"abstract":"We present a novel method of using cloud-based virtualized reconfigurable hardware to enhance the functionality of OpenFlow Software-Defined Networks. OpenFlow is a capable and popular SDN implementation, but when users require new or unsupported packet-processing, software processing in the OpenFlow controller cannot provide multi-gigabit rates. Our method sees packet flows redirected through virtualized hardware with custom-designed packet-processing engines that can add new capabilities to an OpenFlow network, while retaining line-rate processing. A case study shows this can be achieved with virtually no loss in throughput and minimal latency overheads.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131730136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 1: Computer-aided Design","authors":"H. Schmit","doi":"10.1145/3251650","DOIUrl":"https://doi.org/10.1145/3251650","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114820664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Silicon Verification using High-Level Design Tools (Abstract Only)","authors":"Tomasz S. Czajkowski","doi":"10.1145/2684746.2689131","DOIUrl":"https://doi.org/10.1145/2684746.2689131","url":null,"abstract":"Modern FPGAs comprise ever more complex blocks to enable a wide variety of customer applications. Verification of the complex blocks can be a time consuming process, especially at the late stages of the release cycle. A key challenge is the time it takes to create circuits that can run on a target device to test a given block. This paper demonstrates how High-Level Design tools, such as Altera SDK for OpenCL, can be utilized to aid in this work to verify the operation of complex hardened blocks. As a proof of concept, we present the methodology used to verify the correctness of hardened single-precision floating point adder, subtractor and multiplier units on Altera Arria 10 FPGA in a single day. Each design comprised an instance of a hardened floating point unit, either an adder, subtractor or a multiplier, and a functional equivalent there of implemented purely using Lookup Tables (LUTs). Both the hardened module instance and the LUT implementation were generated from OpenCL description using Altera SDK for OpenCL. The results for each computation were compared between the two implementations and any single discrepancy constituted a test failure. To simplify the test, the I/O for each design comprised LEDs (for pass/fail/running/done status) and two switches -- start and reset. The test design for adder, subtractor and a multiplier were all written in OpenCL, the compilation of each design took approximately 30 minutes for each test design. Each design tested 4 billion test vectors, generated on-chip using a Mersenne Twister, and each test completed within 30 seconds. All tests passed verification in hardware.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"110 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123335735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}