Timothy Linscott, Benjamin Gojman, Raphael Rubin, A. DeHon
{"title":"Pitfalls and Tradeoffs in Simultaneous, On-Chip FPGA Delay Measurement","authors":"Timothy Linscott, Benjamin Gojman, Raphael Rubin, A. DeHon","doi":"10.1145/2847263.2847334","DOIUrl":"https://doi.org/10.1145/2847263.2847334","url":null,"abstract":"Recent work shows how to use on-chip structures to measure the fabricated delays of fine-grained resources on modern FPGAs. We show that simultaneous measurement of multiple, disjoint paths will result in different measured delays from isolated configurations that measure a single path. On the Cyclone III, we show differences as large as +/-33ps on 2ns-long paths, even if the simultaneously configured logic is not active. This is over 20x the measurement precision used on these devices and over 50% of the observed delay spread in prior work. We characterize the magnitude of the impact of simultaneous measurements and identify strategies and cases that can reduce the difference. Furthermore, we provide a potential explanation for our observations in terms of self-heating and the configurable clock network architecture. These experiments point to phenomena that must be characterized to better formulate on-chip FPGA delay measurements and to properly interpret their results.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130728511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nadesh Ramanathan, John Wickerson, F. Winterstein, G. Constantinides
{"title":"A Case for Work-stealing on FPGAs with OpenCL Atomics","authors":"Nadesh Ramanathan, John Wickerson, F. Winterstein, G. Constantinides","doi":"10.1145/2847263.2847343","DOIUrl":"https://doi.org/10.1145/2847263.2847343","url":null,"abstract":"We provide a case study of work-stealing, a popular method for run-time load balancing, on FPGAs. Following the Cederman-Tsigas implementation for GPUs, we synchronize work-items not with locks, mutexes or critical sections, but instead with the atomic operations provided by Altera's OpenCL SDK. We evaluate work-stealing for FPGAs by synthesizing a K-means clustering algorithm on an Altera P385 D5 board, both with work-stealing and with a statically-partitioned load. When block RAM utilization is maximised in both cases, we find that work-stealing leads to a 1.5x speedup. This demonstrates that the ability to do load balancing at run-time can outweigh the drawback of using `expensive' atomics on FPGAs. We hope that our case study will stimulate further research into the high-level synthesis of fine-grained, lock-free, concurrent programs.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124778208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial Debug & Debug Without Re-programming in FPGAs: On-Chip debugging in FPGAs","authors":"P. Shanker","doi":"10.1145/2847263.2847286","DOIUrl":"https://doi.org/10.1145/2847263.2847286","url":null,"abstract":"SmartFusion2 Family of FPGAs from MicroSemi introduces novel Silicon technology that enables minimally intrusive, spatial debug capabilities. Spatial debug concerns itself with observing and controlling sequential elements in the user?s Design Under Test (DUT) at an instant of time, i.e. in a specific clock cycle. This capability is made possible by the in-situ, always available probe network running at 50MHz in Smartfusion2. Observing and controlling DUT is less intrusive than conventional methods. Furthermore, no instrumentation and no re-programming of the FPGA device is required. This reduces the number of debug iterations (test re-runs) and accelerates design bring-up in the lab. This session showcases a technique to debug pseudo-static signals, i.e. sequential elements that remain static over a duration of time spanning many clock cycles of probe network (50MHz). Partial or entire set of sequential logic in the DUT can be read out via the JTAG or the SPI interface, while the DUT is running. This technique of observation is non-intrusive. A method to debug DUT using clock halting is presented. In such a method, the clock of the DUT is halted based on a trigger signal that is external or internal to the DUT. The trigger signal can be dynamically chosen without re-programming the device. Once the trigger fires, and clock is halted using a glitchless clock gate, any portion of the sequential logic in the DUT can be written to (altered) and then if required, the user clock can be gated ON to resume normal operation. Though somewhat intrusive, this technique of controlling hard to reach DUT states is invaluable in certain debug situations.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122056138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shlomi Alkalay, Hari Angepat, Adrian M. Caulfield, Eric S. Chung, Oren Firestein, M. Haselman, S. Heil, K. Holohan, M. Humphrey, Tamás Juhász, P. Kaur, S. Lanka, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Andrew Putnam, Raja Seera, Rimon Tadros, J. Thong, Lisa Woods, Derek Chiou, D. Burger
{"title":"Agile Co-Design for a Reconfigurable Datacenter","authors":"Shlomi Alkalay, Hari Angepat, Adrian M. Caulfield, Eric S. Chung, Oren Firestein, M. Haselman, S. Heil, K. Holohan, M. Humphrey, Tamás Juhász, P. Kaur, S. Lanka, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Andrew Putnam, Raja Seera, Rimon Tadros, J. Thong, Lisa Woods, Derek Chiou, D. Burger","doi":"10.1145/2847263.2847287","DOIUrl":"https://doi.org/10.1145/2847263.2847287","url":null,"abstract":"In 2015, a team of software and hardware developers at Microsoft shipped the world?s first commercial search engine accelerated using FPGAs in the datacenter. During the sprint to production, new algorithms in the Bing ranking service were ported into FPGAs and deployed to a production bed within several weeks of conception, leading to significant gains in latency and throughput. The fast turnaround time of new features demanded by an agile software culture would not have been possible without a disciplined and effective approach to co-design in the datacenter. This talk will describe some of the learnings and best practices developed from this unique experience.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125225129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic-Based Spin-Programmable Gate Array with Emerging MTJ Device Technology (Abstract Only)","authors":"Yu Bai, Mingjie Lin","doi":"10.1145/2847263.2847317","DOIUrl":"https://doi.org/10.1145/2847263.2847317","url":null,"abstract":"This paper describes the stochastic-based Spin-Programmable Gate Array (SPGA), an innovative architecture attempting to exploit the stochastic switching behavior newly found in emerging spintronic devices for reconfigurable computing. While many recently studies have investigated using Spin Transfer Torque Memory (STTM) devices to replace configuration memory in FPGAs, our study, for the first time, attempts to use the quantum-induced stochastic property exhibited by spintronic devices directly for reconfiguration and logic computation. Specifically, the SPGA was designed from scratch for high performance, routability, and ease-of-use. It supports variable granularity multiple-input-multiple-output (MIMO) logic blocks and variable-length bypassing interconnects with a symmetrical structure. Due to its unconventional architectural features, the SPGA requires several major modifications to be made in the standard VPR placement/routing CAD flow, which include a new technology mapping algorithm based on computing (k, l)-cut, a new placement algorithm, and a modified delay-based routing procedure. Our mixed mode simulation results have shown that, with FPGA architecture innovations, on average, a SPGA can further achieve more than 10x improvement in logic density, about 5x improvement in average net delay, and about 5x improvement in the critical path delay for the largest 12 MCNC benchmark circuits over an island-style baseline FPGA with spintronic configuration bits.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122638367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Testing FPGA Local Interconnects Based on Repeatable Configuration Modules (Abstract Only)","authors":"Zhendi Yang, Jian Wang, Meng Yang, Jinmei Lai","doi":"10.1145/2847263.2847309","DOIUrl":"https://doi.org/10.1145/2847263.2847309","url":null,"abstract":"This paper provides a novel technique for testing FPGA local interconnects based on repeatable configuration modules (RCMs). In order to fully detect all the possible faults, local interconnects together with the adjacent logic blocks in an FPGA are programmed to form a set of RCMs that are repeatable all over the FPGA array. After the RCMs for configurable logic blocks (CLBs) and other types of embedded cores (such as digital signal processor, block random access memory) are constructed, test configurations are generated by connecting the RCMs one by one throughout the whole FPGA array. The number of test configurations depends on the structure of the FPGA and the exact types of hard cores inside the FPGA. Experimental results show that a total of 47 test configurations are sufficient to achieve 96.2% fault coverage for Xilinx XC4VLX200 FPGA local interconnects. This project is supported by the State Key Laboratory of ASIC and System, Fudan University, No. 2015MS007.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"185 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125780284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Wirthlin, Andrew M. Keller, Chase McCloskey, Parker Ridd, David Lee, J. Draper
{"title":"SEU Mitigation and Validation of the LEON3 Soft Processor Using Triple Modular Redundancy for Space Processing","authors":"M. Wirthlin, Andrew M. Keller, Chase McCloskey, Parker Ridd, David Lee, J. Draper","doi":"10.1145/2847263.2847278","DOIUrl":"https://doi.org/10.1145/2847263.2847278","url":null,"abstract":"Processors are an essential component in most satellite payload electronics and handle a variety of functions including command handling and data processing. There is growing interest in implementing soft processors on commercial FPGAs within satellites. Commercial FPGAs offer reconfigurability, large logic density, and I/O bandwidth; however, they are sensitive to ionizing radiation and systems developed for space must implement single-event upset mitigation to operate reliably. This paper investigates the improvements in reliability of a LEON3 soft processor operating on a SRAM-based FPGA when using triple-modular redundancy and other processor-specific mitigation techniques. The improvements in reliability provided by these techniques are validated with both fault injection and heavy ion radiation tests. The fault injection experiments indicate an improvement of 51× and the radiation testing results demonstrate an average improvement of 10×. Orbit failure rate estimations were computed and suggest that the TMR LEON3 processor has a mean-time to failure of over 76 years in a geosynchronous orbit.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131058374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Physical Design of 3D FPGAs Embedded with Micro-channel-based Fluidic Cooling","authors":"Zhiyuan Yang, Ankur Srivastava","doi":"10.1145/2847263.2847275","DOIUrl":"https://doi.org/10.1145/2847263.2847275","url":null,"abstract":"Through Silicon Via (TSV) based 3D integration technology is a promising technology to increase the performance of FPGAs by achieving shorter global wire-length and higher logic density. However, 3D FPGAs also suffer from severe thermal problems due to the increase in power density and thermal resistance. Moreover, past work has shown that leakage power can account for 40% of the total power at current technology nodes and leakage power increases non-linearly with temperature. This intensifies the thermal problem in 3D FPGAs and more aggressive cooling methods such as micro-channel based fluidic cooling are required to fully exploit their benefits. The interaction between micro-channel heat sink design and the performance of a 3D FPGA is very complicated and a comprehensive approach is required to identify the optimal design of 3D FPGAs subject to thermo-electrical constraints. In this work, we propose an analysis framework for 3D FPGAs embedded with micro-channel-based fluidic cooling to study the impact of channel density on cooling and performance. According to our simulation results, we provide guidelines for designing 3D FPGAs embedded with micro-channel cooling and identify the optimal design for each benchmark. Compared to naive 3D FPGA designs which use fixed thermal heat sink, the optimal design identified using our framework can improve the operating frequency and energy efficiency by up to 80.3% and 124.0%.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130980672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James J. Davis, Eddie Hung, Joshua M. Levine, Edward A. Stott, P. Cheung, G. Constantinides
{"title":"Knowledge is Power: Module-level Sensing for Runtime Optimisation (Abstact Only)","authors":"James J. Davis, Eddie Hung, Joshua M. Levine, Edward A. Stott, P. Cheung, G. Constantinides","doi":"10.1145/2847263.2847316","DOIUrl":"https://doi.org/10.1145/2847263.2847316","url":null,"abstract":"We propose the compile-time instrumentation of coexisting modules?IP blocks, accelerators, etc.?implemented in FPGAs. The efficient mapping of tasks to execution units can then be achieved, for power and/or timing performance, by tracking dynamic power consumption and/or timing slack online at module-level granularity. Our proposed instrumentation is transparent, thereby not affecting circuit functionality. Power and timing overheads have proven to be small and tend to be outweighed by the exposed runtime benefits. Dynamic power consumption can be inferred through the measurement of switching activity on indicative, frequently toggling nets. Online analysis is able to derive a live power breakdown by building and updating a model fed with per-module activity counts and system-wide power consumption. Such a model can be continuously refined and its use allows the tracking of unpredictable phenomena, including degradation. Online measurement of slack in critical (and near-critical) paths facilitates the safe erosion of static timing analysis-derived guardbands. This then enables the co-optimisation of power and timing performance under given external operating constraints, including those which change over time. Assuming functional compatibility, high-priority tasks would suit execution within modules with excess slack. This could be reduced via dynamic frequency scaling, thereby increasing throughput.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132648878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Activity Aware Placement Approach For 3D FPGAs (Abstract Only)","authors":"Girish Deshpande, D. Bhatia","doi":"10.1145/2847263.2847322","DOIUrl":"https://doi.org/10.1145/2847263.2847322","url":null,"abstract":"In order to cope with increasing demand for higher logic densities and shrinking feature sizes, there has been a concerted effort by academia and industry towards the design of three dimensional integrated circuits (3D ICs). Various architectural approaches have been investigated over the past few years in order to realize functional 3D ICs. A majority of such research has been focused on devices such as memories, caches and other application specific circuits. Not much work has been done in the FPGA community on the exploration of 3D FPGAs both at the architectural and EDA levels. This work aims to look at placement methodologies and metrics for island style 3D FPGAs from a thermal perspective. The novelty of our approach lies in the fact that unlike previous related works on 3D FPGA placement which rely solely on wirelength and TSV (Through Silicon Via)-count minimization to evaluate placement, we propose a 3D placer that also takes into consideration, the transition density of each net to ensure a more thermally balanced spatial distribution of nets on the chip. This placement methodology tries to place nets which exhibit higher transition densities on the lower most layer of the FPGA. The lowest layer is typically closest to the heat sink and placing nets with higher switching activity on this layer will aid heat dissipation in a more effective manner and reduce hot spots on the chip. This placer was tested on a four layer 3D FPGA model using MCNC benchmarks and on average, around 40 % of high activity nets were placed on the lowest layer as compared to a placer that did not employ transition density based cost scaling during placement.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133647432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}