{"title":"Session details: Technical Session 2: Cooling and Clocking","authors":"P. Cheung","doi":"10.1145/3250860","DOIUrl":"https://doi.org/10.1145/3250860","url":null,"abstract":"","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133752657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grace Zgheib, M. Lortkipanidze, Muhsen Owaida, D. Novo, P. Ienne
{"title":"FPRESSO: Enabling Express Transistor-Level Exploration of FPGA Architectures","authors":"Grace Zgheib, M. Lortkipanidze, Muhsen Owaida, D. Novo, P. Ienne","doi":"10.1145/2847263.2847280","DOIUrl":"https://doi.org/10.1145/2847263.2847280","url":null,"abstract":"In theory, tools like VTR---a retargetable toolchain mapping circuits onto easily-described hypothetical FPGA architectures---could play a key role in the development of wildly innovative FPGA architectures. In practice, however, the experiments that one can conduct with these tools are severely limited by the ability of FPGA architects to produce reliable delay and area models---these depend on transistor-level design techniques which require a different set of skills. In this paper, we introduce a novel approach, which we call Fpresso, to model the delay and area of a wide range of largely different FPGA architectures quickly and with reasonable accuracy. We take inspiration from the way a standard-cell flow performs large scale transistor-size optimization and apply the same concepts to FPGAs, only at a coarser granularity. Skilled users prepare for fpresso locally optimized libraries of basic components with a variety of driving strengths. Then, ordinary users specify arbitrary FPGA architectures as interconnects of basic components. This is globally optimized within minutes through an ordinary logic synthesis tool which chooses the most fitting version of each cell and adds buffers wherever appropriate. The resulting delay and area characteristics can be automatically used for VTR. Our results show that fpresso provides models that are on average within some 10-20% of those by a state-of-the-art FPGA optimization tool and is orders of magnitude faster. Although the modelling error may appear relatively high,we show that it seldom results in misranking a set of architectures, thus indicating a reasonable modeling faithfulness.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-throughput Architecture for Lossless Decompression on FPGA Designed Using HLS (Abstract Only)","authors":"Jie Lei, Yu-Ting Chen, Yunsong Li, J. Cong","doi":"10.1145/2847263.2847305","DOIUrl":"https://doi.org/10.1145/2847263.2847305","url":null,"abstract":"In the field of big data applications, lossless data compression and decompression can play an important role in improving the data center's efficiency in storage and distribution of data. To avoid becoming a performance bottleneck, they must be accelerated to have a capability of high speed data processing. As FPGAs begin to be deployed as compute accelerators in the data centers for its advantages of massive parallel customized processing capability, power efficiency and hardware reconfiguration. It is promising and interesting to use FPGAs for acceleration of data compression and decompression. The conventional development of FPGA accelerators using hardware description language costs much more design efforts than that of CPUs or GPUs. High level synthesis (HLS) can be used to greatly improve the design productivity. In this paper, we present a solution for accelerating lossless data decompression on FPGA by using HLS. With a pipelined data-flow structure, the proposed decompression accelerator can perform static Huffman decoding and LZ77 decompression at a very high throughput rate. According to the experimental results conducted on FPGA with the Calgary Corpus data benchmark, the average data throughput of the proposed decompression core achieves to 4.6 Gbps while running at 200 MHz.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"56 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132159070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenchao Qian, Christopher Babecki, Robert Karam, S. Bhunia
{"title":"ENFIRE: An Energy-efficient Fine-grained Spatio-temporal Reconfigurable Computing Fabric (Abstact Only)","authors":"Wenchao Qian, Christopher Babecki, Robert Karam, S. Bhunia","doi":"10.1145/2847263.2847325","DOIUrl":"https://doi.org/10.1145/2847263.2847325","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are well-established as fine-grained hardware reconfigurable computing platforms. However, FPGA energy usage is dominated by programmable interconnects, which have poor scalability across different technology generations. In this work, we propose ENFIRE, a novel, energy-efficient, fine-grained, spatio-temporal, memory-based reconfigurable computing framework that provides the flexibility of bit-level information processing, which is not available in conventional coarse-grain reconfigurable architectures (CGRAs). A dense two-dimensional memory array is the main computing element in the proposed framework, which stores not only the data to be processed, but also the functional behavior of a mapped application in the form of lookup tables (LUTs) of various input/output sizes. Spatially distributed configurable computing elements (CEs) communicate with each other based on data dependencies using a mesh network, while execution inside each CE occurs in a temporal manner. A custom software framework has also been co-developed which enables application mapping to a set of CEs. By finding the right balance between spatial and temporal computing, it can achieve a highly energy-efficient mapping, significantly reducing the programmable interconnect overhead when compared with FPGA. Simulation results show an improvement of 7.6X in overall energy, 1.6X in energy efficiency, 1.1X in leakage energy, and 5.3X in Unified Energy-Efficiency, a metric that considers energy and area together, compared with comparable FPGA implementations for a set of random logic benchmarks.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133934497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Database Query Processing on OpenCL-based FPGAs (Abstract Only)","authors":"Ze-ke Wang, Hui Yan Cheah, Johns Paul, Bingsheng He, Wei Zhang","doi":"10.1145/2847263.2847295","DOIUrl":"https://doi.org/10.1145/2847263.2847295","url":null,"abstract":"The release of OpenCL support for FPGAs represents a significant improvement in extending database applications to the reconfigurable domain. Taking advantage of the programmability offered by the OpenCL HLS tool, an OpenCL database can be easily ported and re-designed for FPGAs. A single SQL query in these database systems usually consists of multiple operators, and each one of these operators in turn consists of multiple OpenCL kernels. Due to the specific properties of FPGAs, each OpenCL kernel can have different optimization combinations (in terms of CU and SIMD) which is critical to the overall performance of query processing. In this paper, we propose an efficient method to implement database operators on OpenCL-based FPGAs. We use a cost model to determine the optimum query plan for an input query. Our cost model has two components: unit cost and query plan generation. The unit cost component generates multiple (unit cost, resource utilization) pairs for each kernel. The query plan generation component employs a dynamic programming approach to generate the optimum query plan which consider the possibilities to use multiple FPGA images. The experiments show that 1) our cost model can accurately predict the performance of each feasible query plan for the input query, and is able to guide the generation of the optimum query plan, 2) our optimized query plan achieves a performance speedup 1.5X-4X over the state-of-the-art query processing on OpenCL-based FPGAs.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124588470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatically Optimizing the Latency, Area, and Accuracy of C Programs for High-Level Synthesis","authors":"Xitong Gao, John Wickerson, G. Constantinides","doi":"10.1145/2847263.2847282","DOIUrl":"https://doi.org/10.1145/2847263.2847282","url":null,"abstract":"Loops are pervasive in numerical programs, so high-level synthesis (HLS) tools use state-of-the-art scheduling techniques to pipeline them efficiently. Still, the run time performance of the resultant FPGA implementation is limited by data dependences between loop iterations. Some of these dependence constraints can be alleviated by rewriting the program according to arithmetic identities (e.g. associativity and distributivity), memory access reductions, and control flow optimisations (e.g. partial loop unrolling). HLS tools cannot safely enable such rewrites by default because they may impact the accuracy of floating-point computations and increase area usage. In this paper, we introduce the first open-source program optimizer for automatically rewriting a given program to optimize latency while controlling for accuracy and area. Our tool, SOAP3, reports a multi-dimensional Pareto frontier that the programmer can use to resolve the trade-off according to their needs. When applied to a suite of PolyBench and Livermore Loops benchmarks, our tool has generated programs that enjoy up to a 12x speedup, with a simultaneous 7x increase in accuracy, at a cost of up to 4x more LUTs.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128489015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DCPUF","authors":"Jing Ye, Yu Hu, Xiaowei Li","doi":"10.1145/2847263.2847312","DOIUrl":"https://doi.org/10.1145/2847263.2847312","url":null,"abstract":"With the development of Integrated Circuit (IC), it is a growing trend that the CPU and the FPGA are integrated into one chip. To improve the security of CPU+FPGA IC, we explore the reconfigurable feature of FPGA to implement a novel Dynamically Configured Physical Unclonable Function (DCPUF). PUF is a hardware security primitive that utilizes unpredictable process variations to produce particular challenge-response pairs, so even the chips with the same design would produce different responses for the same challenge. In the DCPUF, the FPGA configuration bits, which are specifically designed with dedicated placement and routing constraint, constitute the challenge. When a challenge is input to a CPU+FPGA IC, the CPU uses it to configure or partially configure the FPGA, and then waits for the FPGA to reply a response. In comparison with existing PUFs, the DCPUF has three major advantages: (1) different from existing PUFs with fixed designs, the logic of DCPUF is dynamically configured for each challenge, i.e. the circuits for producing different responses are different, leading to higher security; (2) much more electronic parameters affected by process variation are leveraged to make DCPUF more robust against attacks; (3) for CPU+FPGA IC, no extra hardware is needed. The experiments on real CPU+FPGA ICs show the proposed DCPUF keeps good randomness and stability.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114177461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CASK: Open-Source Custom Architectures for Sparse Kernels","authors":"Paul Grigoras, P. Burovskiy, W. Luk","doi":"10.1145/2847263.2847338","DOIUrl":"https://doi.org/10.1145/2847263.2847338","url":null,"abstract":"Sparse matrix vector multiplication (SpMV) is an important kernel in many scientific applications. To improve the performance and applicability of FPGA based SpMV, we propose an approach for exploiting properties of the input matrix to generate optimised custom architectures. The architectures generated by our approach are between 3.8 to 48 times faster than the worst case architectures for each matrix, showing the benefits of instance specific design for SpMV.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115179939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Lewis, Gordon R. Chiu, J. Chromczak, David R. Galloway, Benjamin Gamsa, Valavan Manohararajah, Ian Milton, Tim Vanderhoek, John Van Dyken
{"title":"The Stratix™ 10 Highly Pipelined FPGA Architecture","authors":"D. Lewis, Gordon R. Chiu, J. Chromczak, David R. Galloway, Benjamin Gamsa, Valavan Manohararajah, Ian Milton, Tim Vanderhoek, John Van Dyken","doi":"10.1145/2847263.2847267","DOIUrl":"https://doi.org/10.1145/2847263.2847267","url":null,"abstract":"This paper describes architectural enhancements in the Altera Stratix? 10 HyperFlex? FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-flops in the routing to enable a high degree of pipelining. In contrast to the earlier architectural exploration of pipelining in pass-transistor based architectures, the direct drive routing fabric in Stratix-style FPGAs enables an extremely low-cost pipeline register. The presence of ubiquitous flip-flops simplifies circuit retiming and improves performance. The availability of predictable retiming affects all stages of the cluster, place and route flow. Ubiquitous flip-flops require a low-cost clock network with sufficient flexibility to enable pipelining of dozens of clock domains. Different cost/performance tradeoffs in a pipelined fabric and use of a 14nm process, lead to other modifications to the routing fabric and the logic element. User modification of the design enables even higher performance, averaging 2.3X faster in a small set of designs.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128171168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 4: Applications and System-level Tools","authors":"J. Hoe","doi":"10.1145/3250862","DOIUrl":"https://doi.org/10.1145/3250862","url":null,"abstract":"","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121942249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}