Paulo Matias, R. T. Guariento, L. Almeida, J. Slaets
{"title":"Low-Resource Bluespec Design of a Modular Acquisition and Stimulation System for Neuroscience (Abstract Only)","authors":"Paulo Matias, R. T. Guariento, L. Almeida, J. Slaets","doi":"10.1145/2684746.2689137","DOIUrl":"https://doi.org/10.1145/2684746.2689137","url":null,"abstract":"We have compared two different resource arbitration architectures in our developed data acquisition and stimuli generator system for neuroscience research, entirely specified in a high-level Hardware Description Language (HDL). One of them was designed with a decoupled and latency insensitive modular approach, allowing for easier code reuse, while the other adopted a centralized scheme, constructed specifically for our application. The usage of a high-level HDL allowed straightforward and stepwise code modifications to transform one architecture into the other. Despite the logic complexity penalty of synthesizing our hardware from a highly abstract language, both architectures were implemented in a very small programmable logic device without even consuming all the hardware resources. While the decoupled design has shown more resilience to input activity bursts, the centralized one gave an economy of about 10-15% in the device logic element usage. This system is not only useful for neuroscience protocols that require timing determinism and synchronous stimuli generation, but has also demonstrated that high-level languages can be effectively used for synthesizing hardware in small programmable devices.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125201047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RapidSmith 2: A Framework for BEL-level CAD Exploration on Xilinx FPGAs","authors":"Travis Haroldsen, B. Nelson, B. Hutchings","doi":"10.1145/2684746.2689085","DOIUrl":"https://doi.org/10.1145/2684746.2689085","url":null,"abstract":"RapidSmith is an open-source framework that allows for the exploration of novel approaches to the FPGA CAD flow for Xilinx devices. However, RapidSmith has poor support for manipulating designs below the slice level. In this paper, we highlight many of the projects RapidSmith enables and present extensions incorporated into \"RapidSmith 2\" that expose LUTs and flip-flops for direct manipulation in custom-built CAD tools. To demonstrate the utility of RapidSmith 2 we present the results of work to identify BELs in a design which must be clustered together and a tool that does pre-packing clustering accordingly.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116985478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Technology Mapping into General Programmable Cells","authors":"A. Mishchenko, R. Brayton, Wenyi Feng, J. Greene","doi":"10.1145/2684746.2689082","DOIUrl":"https://doi.org/10.1145/2684746.2689082","url":null,"abstract":"Field-Programmable Gate Arrays (FPGA) implement logic functions using programmable cells, such as K-input lookup-tables (K-LUTs). A K-LUT can implement any Boolean function with K inputs and one output. Methods for mapping into K-LUTs are extensively researched and widely used. Recently, cells other than K LUTs have been explored, for example, those composed of several LUTs and those combining LUTs with several gates. Known methods for mapping into these cells are specialized and complicated, requiring a substantial effort to evaluate custom cell architectures. This paper presents a general approach to efficiently map into single-output K-input cells containing LUTs, MUXes, and other elementary gates. Cells with to 16 inputs can be handled. The mapper is fully automated and takes a logic network and a symbolic description of a programmable cell, and produces an optimized network composed of instances of the given cell. Past work on delay/area optimization during mapping is applicable and leads to good quality of results.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127335370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Bai, Y. Tsuji, A. Morioka, M. Miyamura, T. Sakamoto, M. Tada, N. Banno, K. Okamoto, N. Iguchi, H. Hada
{"title":"Architecture of Reconfigurable-Logic Cell Array with Atom Switch: Cluster Size & Routing Fabrics (Abstract Only)","authors":"X. Bai, Y. Tsuji, A. Morioka, M. Miyamura, T. Sakamoto, M. Tada, N. Banno, K. Okamoto, N. Iguchi, H. Hada","doi":"10.1145/2684746.2689122","DOIUrl":"https://doi.org/10.1145/2684746.2689122","url":null,"abstract":"Emerging nonvolatile memories (NVMs) have a potential to overcome the issues in the conventional static random-access memory (SRAM) based reconfigurable logic cell arrays (RLCAs). Replacing a CMOS switch element composed of a SRAM and a pass transistor by a NVM reduces chip size. And non-volatility reduces the stand-by power. More importantly, the compactness of NVM allows fine-grain logic cells (small cluster size), which advantageously enables a highly efficient cell usage, resulting in compact circuit for applications. In this paper, we investigate the fine-grain cell architecture using atom switch which is one of the NVMs. We evaluate the effect of the cluster size and the segment length on the atom-switch-based RLCA to confirm the optimal point considering area-delay product. Cluster size is optimized to be 4, which is smaller than that in the conventional SRAM- and multiplexer-based RLCA. This optimization is originated from the fact that the inter-delay among clusters is only twice of the intra-delay in cluster for atom-switch-based RLCA with routing block formed by crossbar switches because of very small capacitance and resistance of atom switches. On the other hand, the segment length is optimized to be 4, which is the same as that in the conventional SRAM- and multiplexer-based RLCA.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125024538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Winterstein, Kermin Fleming, Hsin-Jung Yang, Samuel Bayliss, G. Constantinides
{"title":"MATCHUP: Memory Abstractions for Heap Manipulating Programs","authors":"F. Winterstein, Kermin Fleming, Hsin-Jung Yang, Samuel Bayliss, G. Constantinides","doi":"10.1145/2684746.2689073","DOIUrl":"https://doi.org/10.1145/2684746.2689073","url":null,"abstract":"Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an FPGA accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that generates parallel application-specific multi-scratchpad architectures including on-chip caches. Our program analysis identifies non-overlapping memory regions, supported by private scratchpads, and regions which are shared by parallel units after parallelization and which are supported by coherent scratchpads and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this work is the focus on programs using dynamic, pointer-based data structures and dynamic memory allocation which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. We demonstrate our technique with three case studies of applications using dynamically allocated data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 10x speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid scratchpad architecture.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125190392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Area Optimization of Arithmetic Units by Component Sharing for FPGAs (Abstract Only)","authors":"S. Tang, G. Lemieux","doi":"10.1145/2684746.2689146","DOIUrl":"https://doi.org/10.1145/2684746.2689146","url":null,"abstract":"Floating point implementation has been a hot topic in recent FPGA research. This paper describes a method to optimize area of combined floating point and integer arithmetic unit through sharing the largest component in each operation on an FPGA. Specifically, the operations included are: addition, subtraction, multiplication, division, shift left/right, rotate left/right, as well as integer-to-floating-point and floating-point-to-integer conversion. The resource usage for the fused unit is compared with the segregated units that are multiplexed. Result shows a significant area reduction achieved using this technique with minimal performance penalty.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126118446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks","authors":"Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, J. Cong","doi":"10.1145/2684746.2689060","DOIUrl":"https://doi.org/10.1145/2684746.2689060","url":null,"abstract":"Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"836 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116423197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ramethy: Reconfigurable Acceleration of Bisulfite Sequence Alignment","authors":"James Arram, W. Luk, P. Jiang","doi":"10.1145/2684746.2689066","DOIUrl":"https://doi.org/10.1145/2684746.2689066","url":null,"abstract":"This paper proposes a novel reconfigurable architecture for accelerating DNA sequence alignment. This architecture is applied to bisulfite sequence alignment, a stage in recently developed bioinformatics pipelines for cancer and non-invasive prenatal diagnosis. Alignment is currently the bottleneck in such pipelines, accounting for over 50% of the total analysis time. Our design, Ramethy (Reconfigurable Acceleration of METHYlation data analysis), performs alignment of short reads with up to two mismatches. Ramethy is based on the FM-index, which we optimise to reduce the number of search steps and improve approximate matching performance. We implement Ramethy on a 1U Maxeler MPC-X1000 data flow node consisting of 8 Altera Stratix-V FPGAs. Measured results show a 14.9 times speedup compared to soap2 running with 16 threads on dual Intel Xeon E5-2650 CPUs, and 3.8 times speedup compared to soap3-dp running on an NVIDIA GTX 580 GPU. Upper-bound performance estimates for the MPC-X1000 indicate a maximum speedup of 88.4 times and 22.6 times compared to soap2 and soap3-dp respectively. In addition to runtime, Ramethy consumes over an order of magnitude lower energy while having accuracy identical to soap2 and soap3-dp, making it a strong candidate for integration into bioinformatics pipelines.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122638311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James Chacko, Cem Sahin, Douglas Pfiel, Nagarajan Kandasamy, K. Dandekar
{"title":"Rapid Prototyping of Wireless Physical Layer Modules Using Flexible Software/Hardware Design Flow","authors":"James Chacko, Cem Sahin, Douglas Pfiel, Nagarajan Kandasamy, K. Dandekar","doi":"10.1145/2684746.2689084","DOIUrl":"https://doi.org/10.1145/2684746.2689084","url":null,"abstract":"This paper describes a step by step approach in designing wireless physical layer modules starting from a software implementation in MATLAB to a hardware implementation using Xilinx SysGen and ModelSim. The described design flow promotes baseband physical layer research by providing high flexibility and speed to the process of module creation verification and deployment. The novelty introduced into our system lies within the flexible components created using this design flow, which enables on-the-fly modification of multiple parameters to suit various wireless protocols.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121971142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marko Jacovic, James Chacko, Doug Pfeil, Nagarajan Kandasamy, K. Dandekar
{"title":"FPGA Implementation of Trained Coarse Carrier Frequency Offset Estimation and Correction for OFDM Signals (Abstract Only)","authors":"Marko Jacovic, James Chacko, Doug Pfeil, Nagarajan Kandasamy, K. Dandekar","doi":"10.1145/2684746.2689128","DOIUrl":"https://doi.org/10.1145/2684746.2689128","url":null,"abstract":"This paper develops an FPGA implementation of a trained coarse Carrier Frequency Offset estimation and correction scheme using MATLAB System Generator. The designed system is capable of supporting variable FFT sizes for Orthogonal Frequency Division Multiplexing signals and different pilot symbol structures making it compatible with a large number of wireless communication standards, unlike other work that is protocol specific. This design stands out from its more common implementations as it requires only one pilot symbol to be considered for synchronization by using a data-aided modified correlation scheme, allowing for an increase in throughput. The Bit Error Rate of the corrected signal received over an Additive White Gaussian Noise channel is compared to the case without correction. This scheme demonstrated increased performance throughput since only a single pilot symbol was used.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128236952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}