{"title":"FPGA '22: The 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Virtual Event, USA, 27 February 2022 - 1 March 2022","authors":"","doi":"10.1145/3490422","DOIUrl":"https://doi.org/10.1145/3490422","url":null,"abstract":"","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78063150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Young-Kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, Jason Cong
{"title":"HBM Connect: High-Performance HLS Interconnect for FPGA HBM.","authors":"Young-Kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, Jason Cong","doi":"10.1145/3431920.3439301","DOIUrl":"https://doi.org/10.1145/3431920.3439301","url":null,"abstract":"<p><p>With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, fully utilizing the available bandwidth may not be an easy task. If an application requires multiple processing elements to access multiple HBM channels, we observed a significant drop in the effective bandwidth. The existing high-level synthesis (HLS) programming environment had limitation in producing an efficient communication architecture. In order to solve this problem, we propose HBM Connect, a high-performance customized interconnect for FPGA HBM board. Novel HLS-based optimization techniques are introduced to increase the throughput of AXI bus masters and switching elements. We also present a high-performance customized crossbar that may replace the built-in crossbar. The effectiveness of HBM Connect is demonstrated using Xilinx's Alveo U280 HBM board. Based on bucket sort and merge sort case studies, we explore several design spaces and find the design point with the best resource-performance trade-off. The result shows that HBM Connect improves the resource-performance metrics by 6.5X-211X.</p>","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"2021 ","pages":"116-126"},"PeriodicalIF":0.0,"publicationDate":"2021-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3431920.3439301","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25576106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, Jason Cong
{"title":"AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs.","authors":"Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, Jason Cong","doi":"10.1145/3431920.3439289","DOIUrl":"10.1145/3431920.3439289","url":null,"abstract":"<p><p>Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable frequency between an HLS design and a handcrafted RTL one. A key factor that limits the timing quality of the HLS outputs is the difficulty in accurately estimating the interconnect delay at the HLS level. This problem becomes even worse when large HLS designs are implemented on the latest multi-die FPGAs. To tackle this challenge, we propose AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation. First, our approach provides HLS with a view on the global physical layout of the design, allowing HLS to more easily identify and pipeline the long wires, especially those crossing the die boundaries. Second, by exploiting the flexibility of HLS pipelining, the floorplanner is able to distribute the design logic across multiple dies on the FPGA device without degrading clock frequency. This prevents the placer from aggressively packing the logic on a single die which often results in local routing congestion that eventually degrades timing. Since pipelining may introduce additional latency, we further present analysis and algorithms to ensure the added latency will not compromise the overall throughput. AutoBridge can be integrated into the existing CAD toolflow for Xilinx FPGAs. In our experiments with a total of 43 design configurations, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The tool is available at https://github.com/Licheng-Guo/AutoBridge.</p>","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"2021 ","pages":"81-92"},"PeriodicalIF":0.0,"publicationDate":"2021-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8041363/pdf/nihms-1684764.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25606543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA '21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Virtual Event, USA, February 28 - March 2, 2021","authors":"","doi":"10.1145/3431920","DOIUrl":"https://doi.org/10.1145/3431920","url":null,"abstract":"","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76913622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, February 23-25, 2020","authors":"","doi":"10.1145/3373087","DOIUrl":"https://doi.org/10.1145/3373087","url":null,"abstract":"","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79643385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. R. Chalamalasetti, Kevin T. Lim, M. Wright, Alvin AuYoung, Parthasarathy Ranganathan, M. Margala
{"title":"An FPGA memcached appliance","authors":"S. R. Chalamalasetti, Kevin T. Lim, M. Wright, Alvin AuYoung, Parthasarathy Ranganathan, M. Margala","doi":"10.1145/2435264.2435306","DOIUrl":"https://doi.org/10.1145/2435264.2435306","url":null,"abstract":"Providing low-latency access to large amounts of data is one of the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These systems are critical and often deployed across hundreds or thousands of servers. However, these systems are not well matched for commodity servers, as they require significant CPU resources to achieve reasonable network bandwidth, yet the core Memcached functions do not benefit from the high performance of standard server CPUs. In this paper, we demonstrate the design of an FPGA-based Memcached appliance. We take Memcached, a complex software system, and implement its core functionality on an FPGA. By leveraging the FPGA's design and utilizing its customizable logic to create a specialized appliance we are able to tightly integrate networking, compute, and memory. This integration allows us to overcome many of the bottlenecks found in standard servers. Our design provides performance on-par with baseline servers, but consumes only 9% of the power of the baseline. Scaled out, we see benefits at the data center level, substantially improving the performance-per-dollar while improving energy efficiency by 3.2X to 10.9X.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"32 1","pages":"245-254"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77762303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel run-time auto-reconfigurable FPGA architecture for fast fault recovery with backward compatibility (abstract only)","authors":"H. Baig, Jeong-A Lee","doi":"10.1145/2435264.2435325","DOIUrl":"https://doi.org/10.1145/2435264.2435325","url":null,"abstract":"A self-repairing fault-tolerant FPGA architecture is developed which is also compatible with existing island-style routing network. Due to this backward compatibility, the proposed architecture can not only be implemented easily in the existing FPGA devices but a new fault-tolerant FPGA device can also be fabricated utilizing the existing island-style routing architecture. A generic fault-tolerant Computation Cell is developed which can be incorporated in existing FPGA CLB (Configurable Logic Block) having 8 LUTs at least. The proposed fault-tolerant FPGA architecture is comprised of Computation Tiles each of which consists of computation cells which are able to heal themselves from transient errors. Computation Tile also contains stem cells which help computation cells to recover from permanent errors all at once. This architecture is centrally controlled by an on-chip fault-tolerant core whose main responsibility is to define the healing priority when an error occurs in more than one of the computation tile at the same time. It also communicates with the external PC software which identifies the faulty tile and reconfigures it through dynamic partial reconfiguration. The robust operation of a proposed architecture is implemented and verified on XILINX Virtex-5 FPGA device. From our proposed fault-tolerant scheme of utilizing the existing routing strategies together with partial reconfiguration of stem cells we achieved a number of benefits, including a fast fault recovery and avoidance of using complicated routing strategies, as compared to recently developed fault-tolerant FPGA architectures.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"274 1","pages":"270"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76743428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunru Bai, Abigail Fuentes-Rivera, Mingjie Lin, Mike Riera
{"title":"Exploiting algorithmic-level memory parallelism in distributed logic-memory architecture through hardware-assisted dynamic graph (abstract only)","authors":"Yunru Bai, Abigail Fuentes-Rivera, Mingjie Lin, Mike Riera","doi":"10.1145/2435264.2435333","DOIUrl":"https://doi.org/10.1145/2435264.2435333","url":null,"abstract":"Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures. Being \"memory-centric\", the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to (A2). All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"3 1","pages":"273"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82726811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Lewis, David Cashman, M. Chan, J. Chromczak, G. Lai, Andy Lee, Tim Vanderhoek, Haiming Yu
{"title":"Architectural enhancements in Stratix V™","authors":"D. Lewis, David Cashman, M. Chan, J. Chromczak, G. Lai, Andy Lee, Tim Vanderhoek, Haiming Yu","doi":"10.1145/2435264.2435292","DOIUrl":"https://doi.org/10.1145/2435264.2435292","url":null,"abstract":"This paper describes architectural enhancements in the Altera Stratix-V\" FPGA architecture, built on a 28nm TSMC process, together with the data supporting those choices. Among the key features are time borrowing flip-flops, a doubling of the number of flip-flops per LUT compared to previous Stratix architectures, a simplified embedded 20kb dual-port RAM block, and error correction that can correct up to 8 adjacent errors. Arithmetic performance is significantly improved using a fast adder with two levels of multi-bit skip. We also describe how the routing architecture and layout is optimized for the 28nm process to take advantage of a wider range of wire thicknesses offered on the different layers, and improvements in performance and routability are obtained without dramatic changes to the repeated floorplan of the logic plus routing fabric.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"21 3 1","pages":"147-156"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82913781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Zuo, Yun Liang, Peng Li, K. Rupnow, Deming Chen, J. Cong
{"title":"Improving high level synthesis optimization opportunity through polyhedral transformations","authors":"Wei Zuo, Yun Liang, Peng Li, K. Rupnow, Deming Chen, J. Cong","doi":"10.1145/2435264.2435271","DOIUrl":"https://doi.org/10.1145/2435264.2435271","url":null,"abstract":"High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry in design expertise, and shorter design time. State-of-the-art high level synthesis now includes a wide variety of powerful optimizations that implement efficient hardware. These optimizations can implement some of the most important features generally performed in manual designs including parallel hardware units, pipelining of execution both within a hardware unit and between units, and fine-grained data communication. We may generally classify the optimizations as those that optimize hardware implementation within a code block (intra-block) and those that optimize communication and pipelining between code blocks (inter-block). However, both optimizations are in practice difficult to apply. Real-world applications contain data-dependent blocks of code and communicate through complex data access patterns. Existing high level synthesis tools cannot apply these powerful optimizations unless the code is inherently compatible, severely limiting the optimization opportunity. In this paper we present an integrated framework to model and enable both intra- and inter-block optimizations. This integrated technique substantially improves the opportunity to use the powerful HLS optimizations that implement parallelism, pipelining, and fine-grained communication. Our polyhedral model-based technique systematically defines a set of data access patterns, identifies effective data access patterns, and performs the loop transformations to enable the intra- and inter-block optimizations. Our framework automatically explores transformation options, performs code transformations, and inserts the appropriate HLS directives to implement the HLS optimizations. Furthermore, our framework can automatically generate the optimized communication blocks for fine-grained communication between hardware blocks. Experimental evaluation demonstrates that we can achieve an average of 6.04X speedup over the high level synthesis solution without our transformations to enable intra- and inter-block optimizations.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"27 1","pages":"9-18"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88951574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}