{"title":"Accelerating FPGA Routing Through Algorithmic Enhancements and Connection-aware Parallelization","authors":"Yun Zhou, Dries Vercruyce, D. Stroobandt","doi":"10.1145/3406959","DOIUrl":"https://doi.org/10.1145/3406959","url":null,"abstract":"Routing is a crucial step in Field Programmable Gate Array (FPGA) physical design, as it determines the routes of signals in the circuit, which impacts the design implementation quality significantly. It can be very time-consuming to successfully route all the signals of large circuits that utilize many FPGA resources. Attempts have been made to shorten the routing runtime for efficient design exploration while expecting high-quality implementations. In this work, we elaborate on the connection-based routing strategy and algorithmic enhancements to improve the serial FPGA routing. We also explore a recursive partitioning-based parallelization technique to further accelerate the routing process. To exploit more parallelism by a finer granularity in both spatial partitioning and routing, a connection-aware routing bounding box model is proposed for the source-sink connections of the nets. It is built upon the location information of each connection’s source, sink, and the geometric center of the net that the connection belongs to, different from the existing net-based routing bounding box that covers all the pins of the entire net. We present that the proposed connection-aware routing bounding box is more beneficial for parallel routing than the existing net-based routing bounding box. The quality and runtime of the serial and multi-threaded routers are compared to the router in VPR 7.0.7. The large heterogeneous Titan23 designs that are targeted to a detailed representation of the Stratix IV FPGA are used for benchmarking. With eight threads, the parallel router using the connection-aware routing bounding box model reaches a speedup of 6.1× over the serial router in VPR 7.0.7, which is 1.24× faster than the one using the existing net-based routing bounding box model, while reducing the total wire-length by 10% and the critical path delay by 7%.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Partitioning and Scheduling with Module Merging on Dynamic Partial Reconfigurable FPGAs","authors":"Qi Tang, Zhe Wang, Biao Guo, Li-Hua Zhu, Jibo Wei","doi":"10.1145/3403702","DOIUrl":"https://doi.org/10.1145/3403702","url":null,"abstract":"Field programmable gate array (FPGA) is ubiquitous nowadays and is applied to many areas. Dynamic partial reconfiguration (DPR) is introduced to most modern FPGAs, enabling changing the function of a part of the FPGA by dynamically loading new bitstreams to the logic regions without affecting the function of other parts of the FPGA. However, delivering the powerful capacity of the DPR FPGA to the user depends on the efficient partitioning and scheduling technology. This article proposes the module merging technique for the partitioning and scheduling problem to reduce the reconfiguration overhead and improve the schedule performance. An exact approach based on the integer linear programming (ILP) for the partitioning and scheduling problem with module merging is proposed. The ILP-based approach is capable of solving the problem optimally, and can be used to further improve the performance of schedules produced by other non-optimal algorithms; however, it is time-consuming to solve large-scale problems. Therefore, a K-sliced-ILP algorithm based on the methodology of divide-and-conquer is proposed, which is able to reduce the time complexity significantly with the solution quality being degraded marginally. Experiments are carried out with a set of real-life applications, and the result demonstrates the effectiveness of the proposed methods.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"71 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124385970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuzhi Zhang, Xiaozhe Shao, George Provelengios, Naveen Kumar Dumpala, Lixin Gao, R. Tessier
{"title":"CoNFV","authors":"Xuzhi Zhang, Xiaozhe Shao, George Provelengios, Naveen Kumar Dumpala, Lixin Gao, R. Tessier","doi":"10.1145/3409113","DOIUrl":"https://doi.org/10.1145/3409113","url":null,"abstract":"Network function virtualization (NFV) is a powerful networking approach that leverages computing resources to perform a time-varying set of network processing functions. Although microprocessors can be used for this purpose, their performance limitations and lack of specialization present implementation challenges. In this article, we describe a new heterogeneous hardware-software NFV platform called CoNFV that provides scalability and programmability while supporting significant hardware-level parallelism and reconfiguration. Our computing platform takes advantage of both field-programmable gate arrays (FPGAs) and microprocessors to implement numerous virtual network functions (VNF) that can be dynamically customized to specific network flow needs. The most distinctive feature of our system is the use of global network state to coordinate NFV operations. Traffic management and hardware reconfiguration functions are performed by a global coordinator that allows for the rapid sharing of network function states and continuous evaluation of network function needs. With the help of state sharing mechanism offered by the coordinator, customer-defined VNF instances can be easily migrated between heterogeneous middleboxes as the network environment changes. A resource allocation and scheduling algorithm dynamically assesses resource deployments as network flows and conditions are updated. We show that our deployment algorithm can successfully reallocate FPGA and microprocessor resources in a fraction of a second in response to changes in network flow capacity and network security threats including intrusion.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129842734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian Sabogal, A. George, Christopher M. Wilson
{"title":"Reconfigurable Framework for Environmentally Adaptive Resilience in Hybrid Space Systems","authors":"Sebastian Sabogal, A. George, Christopher M. Wilson","doi":"10.1145/3398380","DOIUrl":"https://doi.org/10.1145/3398380","url":null,"abstract":"Due to ongoing innovations in both sensor technology and spacecraft autonomy, onboard space processing continues to be outpaced by the escalating computational demands required for next-generation missions. Commercial-off-the-shelf, hybrid system-on-chips, combining fixed-logic CPUs with reconfigurable-logic FPGAs, present numerous architectural advantages that address onboard computing challenges. However, commercial devices are highly susceptible to space radiation and require dependable computing strategies to mitigate radiation-induced single-event effects. Depending upon the mission, the dynamics of the near-Earth space-radiation environment expose spacecraft to radiation fluxes that can vary by several orders of magnitude. By adopting an adaptive approach to dependable computing, spacecraft computers can reconfigure system resources to efficiently accommodate changing environmental conditions to maximize system performance while satisfying availability constraints throughout the mission. In this article, we propose Hybrid, Adaptive, Reconfigurable Fault Tolerance (HARFT), a reconfigurable framework for environmentally adaptive resilience in hybrid space systems. Furthermore, we describe a methodology to model adaptive systems, represented as phased-mission systems using Markov chains, subject to the near-Earth space-radiation environment, using a combination of orbital perturbation, geomagnetic field, and single-event effect rate prediction tools. We apply this methodology to evaluate the HARFT architecture using various static and adaptive strategies for several orbital case studies and demonstrate the achievable performability gains.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"161 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126128100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing OpenCL-Based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling","authors":"Jiandong Mu, Wei Zhang, Hao Liang, Sharad Sinha","doi":"10.1145/3397514","DOIUrl":"https://doi.org/10.1145/3397514","url":null,"abstract":"Recent success in applying convolutional neural networks (CNNs) to object detection and classification has sparked great interest in accelerating CNNs using hardware-like field-programmable gate arrays (FPGAs). However, finding an efficient FPGA design for a given CNN model and FPGA board is not trivial since a strong background in hardware design and detailed knowledge of the target board are required. In this work, we try to solve this problem by design space exploration with a collaborative framework. Our framework consists of three main parts: FPGA design generation, coarse-grained modeling, and fine-grained modeling. In the FPGA design generation, we propose a novel data structure, LoopTree, to capture the details of the FPGA design for CNN applications without writing down the source code. Different LoopTrees, which indicate different FPGA designs, are automatically generated in this process. A coarse-grained model will evaluate LoopTrees at the operation level, e.g., add, mult, and so on, so that the most efficient LoopTrees can be selected. A fine-grained model, which is based on the source code, will then refine the selected design in a cycle-accurate manner. A set of comprehensive OpenCL-based designs have been implemented on board to verify our framework. An average estimation error of 8.87% and 4.8% has been observed for our coarse-grained model and fine-grained model, respectively. This is much lower than the prevalent operation-statistics-based estimation, which is obtained according to a predefined formula for specific loop schedules.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133408113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiyuan Shao, Chenhao Liu, Ruoshi Li, Xiaofei Liao, Hai Jin
{"title":"Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms","authors":"Zhiyuan Shao, Chenhao Liu, Ruoshi Li, Xiaofei Liao, Hai Jin","doi":"10.1145/3391920","DOIUrl":"https://doi.org/10.1145/3391920","url":null,"abstract":"Graph processing is one of the important research topics in the big-data era. To build a general framework for graph processing by using a DRAM-based FPGA board with deep memory hierarchy, one of the reasonable methods is to partition a given big graph into multiple small subgraphs, represent the graph with a two-dimensional grid, and then process the subgraphs one after another to divide and conquer the whole problem. Such a method (grid-graph processing) stores the graph data in the off-chip memory devices (e.g., on-board or host DRAM) that have large storage capacities but relatively small bandwidths, and processes individual small subgraphs one after another by using the on-chip memory devices (e.g., FFs, BRAM, and URAM) that have small storage capacities but superior random access performances. However, directly exchanging graph (vertex and edge) data between the processing units in FPGA chip with slow off-chip DRAMs during grid-graph processing leads to limited performances and excessive data transmission amounts between the FPGA chip and off-chip memory devices. In this article, we show that it is effective in improving the performance of grid-graph processing on DRAM-based FPGA hardware accelerators by leveraging the flexibility and programmability of FPGAs to build application-specific caching mechanisms, which bridge the performance gaps between on-chip and off-chip memory devices, and reduce the data transmission amounts by exploiting the localities on data accessing. We design two application-specific caching mechanisms (i.e., vertex caching and edge caching) to exploit two types of localities (i.e., vertex locality and subgraph locality) that exist in grid-graph processing, respectively. Experimental results show that with the vertex caching mechanism, our system (named as FabGraph) achieves up to 3.1× and 2.5× speedups for BFS and PageRank, respectively, over ForeGraph when processing medium graphs stored in the on-board DRAM. With the edge caching mechanism, the extension of FabGraph (named as FabGraph+) achieves up to 9.96× speedups for BFS over FPGP when processing large graphs stored in the host DRAM.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128331933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohamed Eldafrawy, Andrew Boutros, S. Yazdanshenas, Vaughn Betz
{"title":"FPGA Logic Block Architectures for Efficient Deep Learning Inference","authors":"Mohamed Eldafrawy, Andrew Boutros, S. Yazdanshenas, Vaughn Betz","doi":"10.1145/3393668","DOIUrl":"https://doi.org/10.1145/3393668","url":null,"abstract":"Reducing the precision of deep neural network (DNN) inference accelerators can yield large efficiency gains with little or no accuracy degradation compared to half or single precision floating-point by enabling more multiplication operations per unit area. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of FPGAs very valuable. We propose three types of logic block architectural enhancements and fully evaluate a total of six architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the LUT fracturability and adding two adders to the ALM (4-bit Adder Double Chain architecture) leads to a 1.5× area reduction for arithmetic heavy machine learning (ML) kernels, while increasing their speed. In addition, this architecture also reduces the logic area of general applications by 6%, while increasing the critical path delay by only 1%. However, our highest impact option, which adds a 9-bit shadow multiplier to the logic clusters, reduces the area and critical path delay of ML kernels by 2.4× and 1.2×, respectively. These large gains come at a cost of 15% logic area increase for general applications.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121657033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin Edward Murray, Oleg Petelin, Sheng Zhong, Jia Min Wang, Mohamed Eldafrawy, Jean-Philippe Legault, Eugene Sha, Aaron G. Graham, Jean Wu, Matthew James Peter Walker, Hanqing Zeng, Panagiotis Patros, Jason Luu, Kenneth Blair Kent, Vaughn Betz
{"title":"VTR 8","authors":"Kevin Edward Murray, Oleg Petelin, Sheng Zhong, Jia Min Wang, Mohamed Eldafrawy, Jean-Philippe Legault, Eugene Sha, Aaron G. Graham, Jean Wu, Matthew James Peter Walker, Hanqing Zeng, Panagiotis Patros, Jason Luu, Kenneth Blair Kent, Vaughn Betz","doi":"10.1145/3388617","DOIUrl":"https://doi.org/10.1145/3388617","url":null,"abstract":"Developing Field-programmable Gate Array (FPGA) architectures is challenging due to the competing requirements of various application domains and changing manufacturing process technology. This is compounded by the difficulty of fairly evaluating FPGA architectural choices, which requires sophisticated high-quality Computer Aided Design (CAD) tools to target each potential architecture. This article describes version 8.0 of the open source Verilog to Routing (VTR) project, which provides such a design flow. VTR 8 expands the scope of FPGA architectures that can be modelled, allowing VTR to target and model many details of both commercial and proposed FPGA architectures. The VTR design flow also serves as a baseline for evaluating new CAD algorithms. It is therefore important, for both CAD algorithm comparisons and the validity of architectural conclusions, that VTR produce high-quality circuit implementations. VTR 8 significantly improves optimization quality (reductions of 15% minimum routable channel width, 41% wirelength, and 12% critical path delay), run-time (5.3× faster) and memory footprint (3.3× lower). Finally, we demonstrate VTR is run-time and memory footprint efficient, while producing circuit implementations of reasonable quality compared to highly-tuned architecture-specific industrial tools—showing that architecture generality, good implementation quality, and run-time efficiency are not mutually exclusive goals.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128081810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model-based Design of Hardware SC Polar Decoders for FPGAs","authors":"Yann Delomier, B. Gal, J. Crenne, C. Jégo","doi":"10.1145/3391431","DOIUrl":"https://doi.org/10.1145/3391431","url":null,"abstract":"Polar codes are a new error correction code family that should be benchmarked and evaluated in comparison to LDPC and turbo-codes. Indeed, recent advances in the 5G digital communication standard recommended the use of polar codes in EMBB control channels. However, in many cases, the implementation of efficient FEC hardware decoders is challenging. Specialised knowledge is required to enable and facilitate testing, rapid design iterations, and fast prototyping. In this article, a model-based design methodology to generate efficient hardware SC polar code decoders is presented. With HLS design process and tools, we demonstrate how FPGA system designers can quickly develop complex hardware systems with good performances. The favourable impact of design space exploration is underlined on achievable performances when a relevant computation model is used. The flexibility of the abstraction layers is evaluated. Hardware decoder generation efficiency is assessed and compared to competing approaches. It is shown that the fine-tuning of computation parallelism, bit length, pruning level, and working frequency help to design high-throughput decoders with moderate hardware complexities. Decoding throughputs higher than 300 Mbps are achieved on an Xilinx Virtex-7 device and on an Altera Stratix IV device.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131655320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maciej Besta, Marc Fischer, Tal Ben-Nun, D. Stanojevic, J. D. F. Licht, T. Hoefler
{"title":"Substream-Centric Maximum Matchings on FPGA","authors":"Maciej Besta, Marc Fischer, Tal Ben-Nun, D. Stanojevic, J. D. F. Licht, T. Hoefler","doi":"10.1145/3377871","DOIUrl":"https://doi.org/10.1145/3377871","url":null,"abstract":"Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graph algorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4× speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115651176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}