{"title":"Session details: Technical Session 7: Circuit Design","authors":"Vaughn Betz","doi":"10.1145/3251657","DOIUrl":"https://doi.org/10.1145/3251657","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127848490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Designer's Day Session 3","authors":"P. Lysaght","doi":"10.1145/3251648","DOIUrl":"https://doi.org/10.1145/3251648","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133749480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cost-Effective Memory Architecture to Achieve Flexible Configuration and Efficient Data Transmission for Coarse-Grained Reconfigurable Array (Abstract Only)","authors":"Chen Yang, Leibo Liu, S. Yin, Shaojun Wei","doi":"10.1145/2684746.2689103","DOIUrl":"https://doi.org/10.1145/2684746.2689103","url":null,"abstract":"The memory architecture has a significant effect on the flexibility and performance of a coarse-grained reconfigurable array (CGRA), which can be restrained due to configuration overhead and large latency of data transmission. Multi-context structure and data preloading method are widely used in popular CGRAs as a solution to bandwidth bottlenecks of context and data. However, these two schemes cannot balance the computing performance, area overhead, and flexibility. This paper proposed group-based context cache and multi-level data memory architectures to alleviate the bottleneck problems. The group-based context cache was designed to dynamically transfer and buffer context inside CGRA in order to relieve the off-chip memory access for contexts at runtime. The multi-level data memory was designed to add data memories to different CGRA hierarchies, which were used as data buffers for reused input data and intermediate data. The proposed memory architectures are efficient and cost-effective so that performance improvement can be achieved at the cost of minor area overhead. Experiments of H.264 video decoding program and scale invariant feature transform algorithm achieved performance improvements of 19% and 23%, respectively. Further, the complexity of the applications running on CGRA is no longer restricted by the capacity of the on-chip context memory, thereby achieving flexible configuration for CGRA. The memory architectures proposed in this paper were based on a generic CGRA architecture derived from the characteristics found in the majority of existing popular CGRAs. As such, they can be applied to universal CGRAs.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132956640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Navid Rahmanikia, A. Amiri, Hamid Noori, Farhad Mehdipour
{"title":"Exploring Efficiency of Ring Oscillator-Based Temperature Sensor Networks on FPGAs (Abstract Only)","authors":"Navid Rahmanikia, A. Amiri, Hamid Noori, Farhad Mehdipour","doi":"10.1145/2684746.2689104","DOIUrl":"https://doi.org/10.1145/2684746.2689104","url":null,"abstract":"Due to technology advances and complexity of designs, thermal issue is a bottleneck in electronics designs. Various dynamic thermal management techniques have been proposed to address this issue. To effectively apply thermal management techniques, providing an accurate thermal map of chips is highly required. For this goal, a network of temperature sensors ought to be provided. There are various implementations for temperature sensors and network of sensors on Field Programmable Gate Arrays (FPGAs). This work defines and formulates four metrics and criteria, in terms of area, thermal, and power overheads and thermal map accuracy for exploring and evaluating efficiency of different implementations of Ring Oscillator-based Temperature Sensor (ROTS) networks on FPGAs and reports the comparison results for 12 networks with various sensor configurations. According to our metrics and experiments, the sensor that it is composed of NOT gates with open latches and RNS ring counter has lower thermal and power overheads compared to other configurations. Moreover, in this work, a new ROTS is presented that occupies 25% less resources than the most compact temperature sensor. Also, it provides 1.72 times higher sensitivity than the best sensitive ROTS design.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125050809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 3: Architecture 1","authors":"Jonathan Rose","doi":"10.1145/3251652","DOIUrl":"https://doi.org/10.1145/3251652","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126184236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Wu, P. Gu, Yen-Lung Chen, C. Liu, S. Pamarti, Chang Wu, Lei He
{"title":"Toward Wave Digital Filter based Analog Circuit Emulation on FPGA (Abstract Only)","authors":"Wei Wu, P. Gu, Yen-Lung Chen, C. Liu, S. Pamarti, Chang Wu, Lei He","doi":"10.1145/2684746.2689143","DOIUrl":"https://doi.org/10.1145/2684746.2689143","url":null,"abstract":"Software simulation of analog and mixed-signal circuits often takes a long computing time. Unlike digital circuits that can be validated by FPGA emulation, there is no winning emulation solution for analog circuits. As the first step to applying wave digital filter (WDF) to emulate post-layout analog circuits, we present how to map linear and nonlinear components in an original circuit to WDFs with exactly same behaviors. To validate, we implement the emulation circuit (i.e., WDFs) in FPGA. To be more specific, each emulation time step is executed as a finite state machine, while all the computing resource, e.g. floating point units (FPU), are shared as a resource pool and used only when it is necessary, which result in a very small resource consumption on FPGA. Virtually perfect match is obtained between the Verilog and SPICE simulations for a number of primitive analog circuits, indicating the high accuracy of the proposed emulation. In terms of runtime, the WDF implementation is about 3-4x faster than HSPICE on a small two-stage differential amplifier circuit. And better speedup can be anticipated when it scales to larger circuits because of the underlying binary tree structure of the WDF implementation.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126737803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mapping-Aware Constrained Scheduling for LUT-Based FPGAs","authors":"Mingxing Tan, Steve Dai, Udit Gupta, Zhiru Zhang","doi":"10.1145/2684746.2689063","DOIUrl":"https://doi.org/10.1145/2684746.2689063","url":null,"abstract":"Scheduling plays a central role in high-level synthesis, as it inserts clock boundaries into the untimed behavioral model and greatly impacts the performance, power, and area of the synthesized circuits. While current scheduling techniques can make use of pre-characterized delay values of individual operations, it is difficult to obtain accurate timing estimation on a cluster of operations without considering technology mapping. This limitation is particularly pronounced for FPGAs where a large logic network can be mapped to only a few levels of look-up tables (LUT). In this paper, we propose MAPS, a mapping-aware constrained scheduling algorithm for LUT-based FPGAs. Instead of simply summing up the estimated delay values of individual operations, MAPS jointly performs technology mapping and scheduling, creating the opportunity for more aggressive operation chaining to minimize latency and reduce area. We show that MAPS can produce a latency-optimal solution, while supporting a variety of design timing requirements expressed in a system of difference constraints. We also present an efficient incremental scheduling technique for MAPS to effectively handle resource constraints. Experimental results with real-life benchmarks demonstrate that our proposed algorithm achieves very promising improvements in performance and resource usage when compared to a state-of-the-art commercial high-level synthesis tool targeting Xilinx FPGAs.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128676522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Viswanathan, R. B. Atitallah, J. Dekeyser, Benjamin Nakache, M. Nakache
{"title":"A Parallel And Scalable Multi-FPGA based Architecture for High Performance Applications (Abstract Only)","authors":"V. Viswanathan, R. B. Atitallah, J. Dekeyser, Benjamin Nakache, M. Nakache","doi":"10.1145/2684746.2689115","DOIUrl":"https://doi.org/10.1145/2684746.2689115","url":null,"abstract":"Several industrial applications are becoming highly sophisticated and distributed as they capture and process real-time data from several sources at the same time. Furthermore, availability of acquisition channels such as I/O interfaces per FPGA, also dictates how applications are partitioned over several devices. Thus computationally intensive, resource consuming functions are implemented on multiple hardware accelerators, making low-latency communication to be a crucial factor. In such applications, communication between multiple devices means using high-speed point-to-point protocols with little flexibility in terms of communication scalability. The problem with the current systems is that, they are usually built to meet the needs of a specific application, i.e., lacks flexibility to change the communication topology or upgrade hardware resources. This leads to obsolescence, hardware redesign cost, and also wastes computing power. Taking this into consideration, we propose a scalable, modular and customizable computing platform, with a parallel full-duplex communication network, that redefines the computation and communication paradigm in such applications. We have implemented a scalable distributed secure H.264 encoding application with 3 channels over 3 customizable FPGA modules. In a distributed architecture, the inter-FPGA communication time is almost completely overshadowed by the overall execution time for bigger data-sets, and is comparable to the overall execution time of a non-distributed architecture, for the same implementation scaled down to 1 channel for 1 FPGA. This makes our architecture highly scalable and suitable for high-performance streaming applications. With 3 detachable FPGA modules, each sending and receive data simultaneously at 3 GB/s each, we measured the total net unidirectional traffic at any given time in the system is 9 GB/s, making the total net bidirectional bandwidth for 6 modules to be 36 GB/s.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128541791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Takasu, Yoichi Tomioka, Takashi Aoki, H. Kitazawa
{"title":"An FPGA Implementation of Multi-stream Tracking Hardware using 2D SIMD Array (Abstract Only)","authors":"R. Takasu, Yoichi Tomioka, Takashi Aoki, H. Kitazawa","doi":"10.1145/2684746.2689119","DOIUrl":"https://doi.org/10.1145/2684746.2689119","url":null,"abstract":"Worldwide, many surveillance systems are in operation for crime deterrence purposes. An effective system should be characterized by requiring low-power consumption, a small storage capacity, and little human effort. Multi-stream tracking on field programmable gate array (FPGA) is important for such surveillance systems. In this paper, we propose multi-stream tracking hardware that can extract moving objects and their motion vectors from a multi-stream received from 64 cameras in real time. The key technology for multi-stream processing is as follows. (1) In order to avoid maintaining the background, we apply a frame difference method. Moreover, the flows of object are calculated by block matching. The flows are effective for analyzing human motion. (2) In order to avoid a bus bottleneck and memory contention in the communication between processing elements (PEs), synchronous shift data transfer (SSDT), which transfers data in the same direction for all PEs, is applied. In this paper, an extended SSDT is proposed for communication between PEs when multi-blocks are processed in one PE. (3) C++ based integrated control code development tool is shown. Control code written in C++ language can easily be assembled and verified by the tool. We implemented the proposed hardware on a Stratix V 5SGXEA7K2F40C2N device. The operating frequency is 50 MHz and the average number of clocks for processing a set of four frames of QVGA images is 394k clocks. The proposed hardware achieved 520 fps, and can process multi-stream video from 64 cameras. The execution time on 3.4 GHz Core i7-3770 CPU was 8.4 fps. Therefore, the proposed hardware was about 62 times faster than that CPU.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129995471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Designer's Day Session 1","authors":"Satwant Singh","doi":"10.1145/3251646","DOIUrl":"https://doi.org/10.1145/3251646","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124319664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}