Zhiduo Liu, Aaron Severance, Satnam Singh, G. Lemieux
{"title":"Accelerator compiler for the VENICE vector processor","authors":"Zhiduo Liu, Aaron Severance, Satnam Singh, G. Lemieux","doi":"10.1145/2145694.2145732","DOIUrl":"https://doi.org/10.1145/2145694.2145732","url":null,"abstract":"This paper describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for Microsoft Accelerator, a high-level data parallel library for C++ and C#. This allows us to automatically compile high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show the compiler can generate scalable parallel code with execution times that are comparable to hand-written VENICE assembly code. On data-parallel applications, VENICE at 100MHz on an Altera DE3 platform runs at speeds comparable to one core of a 3.5GHz Intel Xeon W3690 processor, beating it in performance on four of six benchmarks by up to 3.2x.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"79 1","pages":"229-232"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84071093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiling high throughput network processors","authors":"M. Lavasani, Larry R. Dennison, Derek Chiou","doi":"10.1145/2145694.2145709","DOIUrl":"https://doi.org/10.1145/2145694.2145709","url":null,"abstract":"Gorilla is a methodology for generating FPGA-based solutions especially well suited for data parallel applications with fine grain irregularity. Irregularity simultaneously destroys performance and increases power consumption on many data parallel processors such as General Purpose Graphical Processor Units (GPGPUs). Gorilla achieves high performance and low power through the use of FPGA-tailored parallelization techniques and application-specific hardwired accelerators, processing engines, and communication mechanisms. Automatic compilation from a stylized C language and templates that define the hardware structure coupled with the intrinsic flexibility of FPGAs provide high performance, low power, and programmability.\u0000 Gorilla's capabilities are demonstrated through the generation of a family of core-router network processors processing up to 100Gbps (200MPPS for 64B packets) supporting any mix of IPv4, IPv6, and Multi-Protocol Label Switching (MPLS) packets on a single FPGA with off-chip IP lookup tables. A 40Gbps version of that network processor was run with an embedded test rig on a Xilinx Virtex-6 FPGA, verifying for performance and correctness. Its measured power consumption is comparable to full custom, commercial network processors. In addition, it is demonstrated how Gorilla can be used to generate merged virtual routers, saving FPGA resources.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"13 1","pages":"87-96"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81407894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reconfigurable architecture and automated design flow for rapid FPGA-based LDPC code emulation","authors":"Haoran Li, Youn Sung Park, Zhengya Zhang","doi":"10.1145/2145694.2145722","DOIUrl":"https://doi.org/10.1145/2145694.2145722","url":null,"abstract":"Multitude of design freedoms of LDPC codes and practical decoders require fast simulations. FPGA emulation is attractive but inaccessible due to its design complexity. We propose a library and script based approach to automate the construction of FPGA emulations. Code parameters and design parameters are programmed either during run time or by script in design time. We demonstrate the architecture and design flow using the LDPC codes for the latest wireless communication standards: each emulation model was auto-constructed within one minute and the peak emulation throughput reached 3.8 Gb/s on a BEE3 platform.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"167-170"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81020244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-RR: an enhanced FPGA architecture with RRAM-based reconfigurable interconnects (abstract only)","authors":"J. Cong, Bingjun Xiao","doi":"10.1145/2145694.2145751","DOIUrl":"https://doi.org/10.1145/2145694.2145751","url":null,"abstract":"In this study, we explore the use of Resistive RAMs (RRAMs) as candidates for programmable interconnects in FPGAs. An RRAM cell can be programmed between high resistance state and low resistance state, with an on/off ratio close to MOSFET. It provides an opportunity to use an RRAM as a routing switch at a much smaller area cost than its CMOS counterpart. RRAMs can be fabricated over CMOS circuits using CMOS-compatible processes to have a more compact gate array. Our recent work (presented in NanoArch'2011) demonstrated significant potential of area, delay, and power reduction from using RRAMs in FPGAs. But some design problems remain open. The programming of RRAM switches integrated in interconnects is one important problem. We show that the high-level architecture of programming circuits for RRAM switches should be modified to avoid potential logic hazard. Also the programming cells used in previous works have an area overhead even larger than RRAM itself. We manage to reduce this overhead significantly with utilization of the non-arbitrary pattern of RRAM integration in FPGA interconnects. In addition we suggest a novel buffering solution for FPGA interconnects in light of the low area cost of RRAM-based routing switch. We propose on-demand buffer insertion, where buffers can be connected to interconnects via RRAMs to dynamically reflect the demand of the netlist to map onto FPGA. Compared to conventional buffering solution which are pre-determined during fabrication and can only be optimized for general case, our solution shows further area savings and performance improvement. The resulting FPGA architecture using RRAM for programmable interconnects is named FPGA-RR. We provide a complete CAD flow for FPGA-RR.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"57 1","pages":"268"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76523478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Asaad, Ralph Bellofatto, B. Brezzo, C. Haymes, M. Kapur, B. Parker, Thomas Roewer, P. Saha, T. Takken, J. Tierno
{"title":"A cycle-accurate, cycle-reproducible multi-FPGA system for accelerating multi-core processor simulation","authors":"S. Asaad, Ralph Bellofatto, B. Brezzo, C. Haymes, M. Kapur, B. Parker, Thomas Roewer, P. Saha, T. Takken, J. Tierno","doi":"10.1145/2145694.2145720","DOIUrl":"https://doi.org/10.1145/2145694.2145720","url":null,"abstract":"Software based tools for simulation are not keeping up with the demands for increased chip and system design complexity. In this paper, we describe a cycle-accurate and cycle-reproducible large-scale FPGA platform that is designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBM's 45 nm SOI CMOS technology. This paper discusses the challenges for constructing such large-scale FPGA platforms, including design partitioning, clocking & synchronization, and debugging support, as well as our approach for addressing these challenges without sacrificing cycle accuracy and cycle reproducibility. The resulting fullchip simulation of the Bluegene/Q compute node ASIC runs at a simulated processor clock speed of 4 MHz, over 100,000 times faster than the logic level software simulation of the same design. The vast increase in simulation speed provides a new capability in the design cycle that proved to be instrumental in logic verification as well as early software development and performance validation for Bluegene/Q.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"153-162"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90388928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A real-time stereo vision system using a tree-structured dynamic programming on FPGA","authors":"Minxi Jin, T. Maruyama","doi":"10.1145/2145694.2145698","DOIUrl":"https://doi.org/10.1145/2145694.2145698","url":null,"abstract":"Many hardware systems for stereo vision have been proposed. Their processing speed is very fast, but the algorithms used in them are limited in order to achieve the high processing speed by simplifying the sequences of the memory accesses and operations. The error rates by them can not compete with those by software programs. In this paper, we describe an FPGA implementation of a tree-structured dynamic programming algorithm. The computational complexity of this algorithm is higher than those by previous hardware systems, but the processing speed of our system is still fast enough for real-time applications, and its error rate is competitive with software algorithms.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"21-24"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73468018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OpenCL memory infrastructure for FPGAs (abstract only)","authors":"S. Chin, P. Chow","doi":"10.1145/2145694.2145756","DOIUrl":"https://doi.org/10.1145/2145694.2145756","url":null,"abstract":"Programming models assist developers in creating high performance computing systems by forming a higher level abstraction of the target platform. OpenCL has emerged as a standard programming model for heterogeneous systems and there has been recent activity combining OpenCL and FPGAs. This work introduces memory infrastructure for FPGAs and is designed for OpenCL style computation, complementing previous work. An Aggregating Memory Controller is implemented in hardware and aims to maximize bandwidth to external, large, high-latency, high-bandwidth memories by finding the minimal number of external memory burst requests from a vector of requests. A template processing array with soft-processor and hand-coded hardware elements was also designed to drive the memory controller. The Aggregating Memory Controller is described in terms of operation and future scalability and the created processing array is described as a flexible structure that can support many types of processing solutions. A hardware prototype of the memory controller and processing array was implemented on a Virtex-5 LX110T FPGA. Two micro-benchmarks were run on both the soft-processor elements and the hand-coded hardware cores to exercise the memory controller. Results for effective memory bandwidth within the system show that the high-latency can be hidden using the Aggregating Memory Controller by increasing the number of threads within the processing array.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"14 1","pages":"269-270"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77722925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A configurable architecture to limit wakeup current in dynamically-controlled power-gated FPGAs","authors":"A. Bsoul, S. Wilton","doi":"10.1145/2145694.2145737","DOIUrl":"https://doi.org/10.1145/2145694.2145737","url":null,"abstract":"A dynamically-controlled power-gated (DCPG) FPGA architecture has recently been proposed to reduce static energy dissipation during idle periods. During a power mode transition from an off state to on state, the wakeup current drawn from power supplies causes a voltage droop on the power distribution network of a device. If not handled appropriately, this current and the associated voltage droop could cause malfunction of the design and/or the device. In DCPG FPGAs, the amount of wakeup current is not known beforehand as the structures of power-gated modules are application dependent; thus, a configurable solution is required to handle wakeup current. In this paper we propose a programmable wakeup architecture for DCPG FPGAs. The proposed solution has two levels: a fixed intra-region level and a configurable inter-region level. The architecture ensures that a power-gated module can be turned on such that the wakeup current constraints are not violated. We study the area and power overheads of the proposed solution. Our results show that the area overhead of the proposed inrush current limiting architecture is less than 2% for a power gating region of size 3x3 or 4x4 tiles, and the leakage power saved is more than 85% in a region of size 4x4 tiles.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"99 1","pages":"245-254"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80961522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CONNECT: re-examining conventional wisdom for designing nocs in the context of FPGAs","authors":"Michael Papamichael, J. Hoe","doi":"10.1145/2145694.2145703","DOIUrl":"https://doi.org/10.1145/2145694.2145703","url":null,"abstract":"An FPGA is a peculiar hardware realization substrate in terms of the relative speed and cost of logic vs. wires vs. memory. In this paper, we present a Network-on-Chip (NoC) design study from the mindset of NoC as a synthesizable infrastructural element to support emerging System-on-Chip (SoC) applications on FPGAs. To support our study, we developed CONNECT, an NoC generator that can produce synthesizable RTL designs of FPGA-tuned multi-node NoCs of arbitrary topology. The CONNECT NoC architecture embodies a set of FPGA-motivated design principles that uniquely influence key NoC design decisions, such as topology, link width, router pipeline depth, network buffer sizing, and flow control. We evaluate CONNECT against a high-quality publicly available synthesizable RTL-level NoC design intended for ASICs. Our evaluation shows a significant gain in specializing NoC design decisions to FPGAs' unique mapping and operating characteristics. For example, in the case of a 4x4 mesh configuration evaluated using a set of synthetic traffic patterns, we obtain comparable or better performance than the state-of-the-art NoC while reducing logic resource cost by 58%, or alternatively, achieve 3-4x better performance for approximately the same logic resource usage. Finally, to demonstrate CONNECT's flexibility and extensive design space coverage, we also report synthesis and network performance results for several router configurations and for entire CONNECT networks.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"14 1","pages":"37-46"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90178568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Securing netlist-level FPGA design through exploiting process variation and degradation","authors":"J. Zheng, M. Potkonjak","doi":"10.1145/2145694.2145716","DOIUrl":"https://doi.org/10.1145/2145694.2145716","url":null,"abstract":"The continuously widening gap between the Non-Recurring Engineering(NRE) and Recurring Engineering (RE) costs of producing Integrated Circuit (IC) products in the past few decades gives high incentives to unauthorized cloning and reverse-engineering of ICs. Existing IC Digital Rights Management (DRM) schemes often demands high overhead in area, power, and performance, or require non-volatile storage. Our goal is to develop a novel Intellectual Property (IP) protection technique that offers universal protection to both Application-Specific Integrated Circuits (ASIC) and Field-Programmable Gate-Arrays (FPGAs) from unauthorized manufacturing and reverse engineering. In this paper we show a proof-of-concept implementation of the basic elements of the technique, as well as a case study of applying the anti-cloning technique to a nontrivial FPGA design.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"10 1","pages":"129-138"},"PeriodicalIF":0.0,"publicationDate":"2012-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90306345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}