Zhiduo Liu, Aaron Severance, Satnam Singh, G. Lemieux
{"title":"Accelerator compiler for the VENICE vector processor","authors":"Zhiduo Liu, Aaron Severance, Satnam Singh, G. Lemieux","doi":"10.1145/2145694.2145732","DOIUrl":"https://doi.org/10.1145/2145694.2145732","url":null,"abstract":"This paper describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for Microsoft Accelerator, a high-level data parallel library for C++ and C#. This allows us to automatically compile high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show the compiler can generate scalable parallel code with execution times that are comparable to hand-written VENICE assembly code. On data-parallel applications, VENICE at 100MHz on an Altera DE3 platform runs at speeds comparable to one core of a 3.5GHz Intel Xeon W3690 processor, beating it in performance on four of six benchmarks by up to 3.2x.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"79 1","pages":"229-232"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84071093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiling high throughput network processors","authors":"M. Lavasani, Larry R. Dennison, Derek Chiou","doi":"10.1145/2145694.2145709","DOIUrl":"https://doi.org/10.1145/2145694.2145709","url":null,"abstract":"Gorilla is a methodology for generating FPGA-based solutions especially well suited for data parallel applications with fine grain irregularity. Irregularity simultaneously destroys performance and increases power consumption on many data parallel processors such as General Purpose Graphical Processor Units (GPGPUs). Gorilla achieves high performance and low power through the use of FPGA-tailored parallelization techniques and application-specific hardwired accelerators, processing engines, and communication mechanisms. Automatic compilation from a stylized C language and templates that define the hardware structure coupled with the intrinsic flexibility of FPGAs provide high performance, low power, and programmability.\u0000 Gorilla's capabilities are demonstrated through the generation of a family of core-router network processors processing up to 100Gbps (200MPPS for 64B packets) supporting any mix of IPv4, IPv6, and Multi-Protocol Label Switching (MPLS) packets on a single FPGA with off-chip IP lookup tables. A 40Gbps version of that network processor was run with an embedded test rig on a Xilinx Virtex-6 FPGA, verifying for performance and correctness. Its measured power consumption is comparable to full custom, commercial network processors. In addition, it is demonstrated how Gorilla can be used to generate merged virtual routers, saving FPGA resources.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"13 1","pages":"87-96"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81407894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reconfigurable architecture and automated design flow for rapid FPGA-based LDPC code emulation","authors":"Haoran Li, Youn Sung Park, Zhengya Zhang","doi":"10.1145/2145694.2145722","DOIUrl":"https://doi.org/10.1145/2145694.2145722","url":null,"abstract":"Multitude of design freedoms of LDPC codes and practical decoders require fast simulations. FPGA emulation is attractive but inaccessible due to its design complexity. We propose a library and script based approach to automate the construction of FPGA emulations. Code parameters and design parameters are programmed either during run time or by script in design time. We demonstrate the architecture and design flow using the LDPC codes for the latest wireless communication standards: each emulation model was auto-constructed within one minute and the peak emulation throughput reached 3.8 Gb/s on a BEE3 platform.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"167-170"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81020244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-RR: an enhanced FPGA architecture with RRAM-based reconfigurable interconnects (abstract only)","authors":"J. Cong, Bingjun Xiao","doi":"10.1145/2145694.2145751","DOIUrl":"https://doi.org/10.1145/2145694.2145751","url":null,"abstract":"In this study, we explore the use of Resistive RAMs (RRAMs) as candidates for programmable interconnects in FPGAs. An RRAM cell can be programmed between high resistance state and low resistance state, with an on/off ratio close to MOSFET. It provides an opportunity to use an RRAM as a routing switch at a much smaller area cost than its CMOS counterpart. RRAMs can be fabricated over CMOS circuits using CMOS-compatible processes to have a more compact gate array. Our recent work (presented in NanoArch'2011) demonstrated significant potential of area, delay, and power reduction from using RRAMs in FPGAs. But some design problems remain open. The programming of RRAM switches integrated in interconnects is one important problem. We show that the high-level architecture of programming circuits for RRAM switches should be modified to avoid potential logic hazard. Also the programming cells used in previous works have an area overhead even larger than RRAM itself. We manage to reduce this overhead significantly with utilization of the non-arbitrary pattern of RRAM integration in FPGA interconnects. In addition we suggest a novel buffering solution for FPGA interconnects in light of the low area cost of RRAM-based routing switch. We propose on-demand buffer insertion, where buffers can be connected to interconnects via RRAMs to dynamically reflect the demand of the netlist to map onto FPGA. Compared to conventional buffering solution which are pre-determined during fabrication and can only be optimized for general case, our solution shows further area savings and performance improvement. The resulting FPGA architecture using RRAM for programmable interconnects is named FPGA-RR. We provide a complete CAD flow for FPGA-RR.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"57 1","pages":"268"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76523478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Asaad, Ralph Bellofatto, B. Brezzo, C. Haymes, M. Kapur, B. Parker, Thomas Roewer, P. Saha, T. Takken, J. Tierno
{"title":"A cycle-accurate, cycle-reproducible multi-FPGA system for accelerating multi-core processor simulation","authors":"S. Asaad, Ralph Bellofatto, B. Brezzo, C. Haymes, M. Kapur, B. Parker, Thomas Roewer, P. Saha, T. Takken, J. Tierno","doi":"10.1145/2145694.2145720","DOIUrl":"https://doi.org/10.1145/2145694.2145720","url":null,"abstract":"Software based tools for simulation are not keeping up with the demands for increased chip and system design complexity. In this paper, we describe a cycle-accurate and cycle-reproducible large-scale FPGA platform that is designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBM's 45 nm SOI CMOS technology. This paper discusses the challenges for constructing such large-scale FPGA platforms, including design partitioning, clocking & synchronization, and debugging support, as well as our approach for addressing these challenges without sacrificing cycle accuracy and cycle reproducibility. The resulting fullchip simulation of the Bluegene/Q compute node ASIC runs at a simulated processor clock speed of 4 MHz, over 100,000 times faster than the logic level software simulation of the same design. The vast increase in simulation speed provides a new capability in the design cycle that proved to be instrumental in logic verification as well as early software development and performance validation for Bluegene/Q.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"153-162"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90388928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Limit study of energy & delay benefits of component-specific routing","authors":"Nikil Mehta, Raphael Rubin, A. DeHon","doi":"10.1145/2145694.2145710","DOIUrl":"https://doi.org/10.1145/2145694.2145710","url":null,"abstract":"As feature sizes scale toward atomic limits, parameter variation continues to increase, leading to increased margins in both delay and energy. The possibility of very slow devices on critical paths forces designers to increase transistor sizes, reduce clock speed and operate at higher voltages than desired in order to meet timing. With post-fabrication configurability, FPGAs have the opportunity to use slow devices on non-critical paths while selecting fast devices for critical paths. To understand the potential benefit we might gain from component-specific mapping, we quantify the margins associated with parameter variation in FPGAs over a wide range of predictive technologies (45nm-12nm) and gate sizes and show how these margins can be significantly reduced by delay-aware, component-specific routing. For the Toronto 20 benchmark set, we show that component-specific routing can eliminate delay margins induced by variation and reduce energy for energy minimal designs by 1.42-1.98×. We further show that these benefits increase as technology scales.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"34 1","pages":"97-106"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73327494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Thermal-aware logic block placement for 3D FPGAs considering lateral heat dissipation (abstract only)","authors":"Juinn-Dar Huang, Ya-Shih Huang, Mi-Yu Hsu, Han-Yuan Chang","doi":"10.1145/2145694.2145749","DOIUrl":"https://doi.org/10.1145/2145694.2145749","url":null,"abstract":"Three-dimensional (3D) integration is an attractive and promising technology to keep Moore's Law alive, whereas the thermal issue also presents a critical challenge for 3D integrated circuits. Meanwhile, accurate thermal analysis is very time-consuming and thus can hardly be incorporated into most of placement algorithms generally performing numerous iterative refinement steps. As a consequence, in this paper, we first present a fine-grained grid-based thermal model for the 3D regular FPGA architecture and also highlight that lateral heat dissipation paths can no longer be assumed negligible. Then we propose two fast thermal-aware placement algorithms for 3D FPGAs, Standard Deviation (SD) and MineSweeper (MS), in which rapid thermal evaluation instead of slow detailed analysis is utilized. Moreover, both take the lateral heat dissipation into consideration and focus on distributing heat sources more evenly within a layer in a 3D FPGA to avoid creating hotspots. Experimental results show that SD and MS achieve 12.1%/7.6% reduction in maximum temperature and 82%/56% improvement in temperature deviation compared with a classical thermal-unaware placement method only at the cost of minor increase in wirelength and delay. Moreover, MS merely consumes 4% more runtime for producing thermal-aware placement solutions.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"268"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73325653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A real-time stereo vision system using a tree-structured dynamic programming on FPGA","authors":"Minxi Jin, T. Maruyama","doi":"10.1145/2145694.2145698","DOIUrl":"https://doi.org/10.1145/2145694.2145698","url":null,"abstract":"Many hardware systems for stereo vision have been proposed. Their processing speed is very fast, but the algorithms used in them are limited in order to achieve the high processing speed by simplifying the sequences of the memory accesses and operations. The error rates by them can not compete with those by software programs. In this paper, we describe an FPGA implementation of a tree-structured dynamic programming algorithm. The computational complexity of this algorithm is higher than those by previous hardware systems, but the processing speed of our system is still fast enough for real-time applications, and its error rate is competitive with software algorithms.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"21-24"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73468018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OpenCL memory infrastructure for FPGAs (abstract only)","authors":"S. Chin, P. Chow","doi":"10.1145/2145694.2145756","DOIUrl":"https://doi.org/10.1145/2145694.2145756","url":null,"abstract":"Programming models assist developers in creating high performance computing systems by forming a higher level abstraction of the target platform. OpenCL has emerged as a standard programming model for heterogeneous systems and there has been recent activity combining OpenCL and FPGAs. This work introduces memory infrastructure for FPGAs and is designed for OpenCL style computation, complementing previous work. An Aggregating Memory Controller is implemented in hardware and aims to maximize bandwidth to external, large, high-latency, high-bandwidth memories by finding the minimal number of external memory burst requests from a vector of requests. A template processing array with soft-processor and hand-coded hardware elements was also designed to drive the memory controller. The Aggregating Memory Controller is described in terms of operation and future scalability and the created processing array is described as a flexible structure that can support many types of processing solutions. A hardware prototype of the memory controller and processing array was implemented on a Virtex-5 LX110T FPGA. Two micro-benchmarks were run on both the soft-processor elements and the hand-coded hardware cores to exercise the memory controller. Results for effective memory bandwidth within the system show that the high-latency can be hidden using the Aggregating Memory Controller by increasing the number of threads within the processing array.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"14 1","pages":"269-270"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77722925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incremental clustering applied to radar deinterleaving: a parameterized FPGA implementation","authors":"Scott Bailie, M. Leeser","doi":"10.1145/2145694.2145699","DOIUrl":"https://doi.org/10.1145/2145694.2145699","url":null,"abstract":"ICED (Incremental Clustering of Evolving Data) is a novel incremental clustering algorithm designed for data whose characteristics change over time. ICED is an unsupervised clustering technique that assumes no prior knowledge of the incoming data, and supports removing clusters that contain stale data. The user controls the FPGA implementation through a combination of compile time parameters (number of clusters) and run time parameters (distance threshold, fade cycle length). ICED has been applied to a radar application: pulse deinterleaving. ICED is the first implementation of incremental clustering on an FPGA of which we are aware. The implementation runs 39 times faster than an equivalent C implementation on a 3GHz Intel Xeon processor, and is capable of processing radar data in real time.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"346 1","pages":"25-28"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79667431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}