{"title":"Triggered Scheduling: Efficient Detection of Dataflow Network Idleness on Heterogeneous Systems","authors":"Mahyar Emami, E. Bezati, J. Janneck, J. Larus","doi":"10.1145/3431920.3439473","DOIUrl":"https://doi.org/10.1145/3431920.3439473","url":null,"abstract":"Hardware-software codesign for FPGAs requires flexible and changeable boundaries between hardware and software. Design space exploration is facilitated by expressing programs in a language that can be compiled for both CPU and FPGA execution. Such an approach requires efficient and general communication mechanisms between hardware and software. We present a practical solution to this problem for heterogeneous programs expressed in CAL, an actor based language running on a PCIe-based FPGA system where communication between a processor and FPGA is relatively expensive. We show how a network of continuously executing software and hardware actors with fine-grained communication can be expressed as a coprocessor model that executes the network in discrete steps with efficient coarse-grained transfers across the PCIe bus. To this end, we present the Triggered Scheduling (TS) algorithm to detect idleness (i.e. lack of forward progress) of a dynamic actor network with unpredictable consumption/production rates. With TS, it is possible to treat a network of actors running on hardware as a coprocessor that can be called by software. We show how TS can be used to build a truly heterogeneous system on a HLS platform. Using 4 large benchmarks, we analyze the performance and resource utilization of the Triggered Scheduling algorithm.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"257 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120886727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zewei Du, Yann Herklotz, Nadesh Ramanathan, John Wickerson
{"title":"Fuzzing High-Level Synthesis Tools","authors":"Zewei Du, Yann Herklotz, Nadesh Ramanathan, John Wickerson","doi":"10.1145/3431920.3439466","DOIUrl":"https://doi.org/10.1145/3431920.3439466","url":null,"abstract":"High-level synthesis (HLS) is becoming an increasingly important part of the computing landscape, even in safety-critical domains where correctness is key. As such, HLS tools are increasingly relied upon. But are they trustworthy? We have subjected three widely used HLS tools - LegUp, Xilinx Vivado HLS, and the Intel HLS Compiler - to a rigorous fuzzing campaign using thousands of random, valid C programs that we generated using a modified version of the Csmith tool. For each C program, we compiled it to a hardware design using the HLS tool under test and checked whether that hardware design generates the same output as an executable generated by the GCC compiler. When discrepancies arose between GCC and the HLS tool under test, we reduced the C program to a minimal example in order to zero in on the potential bug. Our testing campaign has revealed that all three HLS tools can be made either to crash or to generate wrong code when given valid C programs, and thereby underlines the need for these increasingly trusted tools to be more rigorously engineered. Out of 6700 test cases, we found 272 programs that failed in at least one tool, out of which we were able to discern at least 6 unique bugs.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132621960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Global Is the New Local: FPGA Architecture at 5nm and Beyond","authors":"Stefan Nikolic, F. Catthoor, Z. Tokei, P. Ienne","doi":"10.1145/3431920.3439300","DOIUrl":"https://doi.org/10.1145/3431920.3439300","url":null,"abstract":"It takes only high-school physics to appreciate that the resistance of a wire grows with a diminishing cross section, and a quick look at any plot about Moore's law immediately suggests that such cross section must decrease over time. Clearly, everyone can easily imagine that this trend must have a deep influence on FPGA architectures. What is difficult to predict is whether and when well-established architectural ideas will break---and what can replace them. Unfortunately, in architectural research, we often use fairly simplistic models of the underlying technology nodes which limit our ability to visualize the detailed impact of technology evolution. In this paper, we develop, from the available industrial disclosures, a consistent electrical model of the metal stacks of recent and current technologies, as well as future trends. We combine it to a plausible layout strategy to have an accurate idea of how wire characteristics play nowadays into architectural decisions. To demonstrate our models, necessarily speculative due to the paucity of reliable industrial information, we use them to explore the evolution of a typical architectural family across technology nodes and to reevaluate one of the most basic design parameters---namely, cluster size. We notice effects which may in fact explain some recent changes in commercial architectures. We also observe how conventional architectures may fail to take advantage of the performance improvements of future nodes. Although conceptually straightforward, this study signals how profoundly our understanding of FPGAs will be affected by technology while moving towards the 3 nm node.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116688724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA-based 7-ENOB 600 MSample/s ADC without any External Components","authors":"Lukas Leuenberger, D. Amiet, Tao Wei, P. Zbinden","doi":"10.1145/3431920.3439287","DOIUrl":"https://doi.org/10.1145/3431920.3439287","url":null,"abstract":"Analog to digital converters (ADCs) are indispensable nowadays. Analog signals are digitized earlier and earlier in the processing chain to reduce the need for complex analog signal processing. For this reason, ADCs are often integrated directly into field-programmable gate arrays (FPGA) or microprocessors. However, such ADCs are designed for a specific set of requirements with limited flexibility. In this paper, a new structure of an FPGA-based ADC is proposed. The ADC is based on the slope ADC, where a time-to-digital converter (TDC) measures the time from the beginning of a reference slope until the slope reaches the voltage-to-be-measured. Only FPGA-internal elements are used to build the ADC. It is fully reconfigurable and does not require any external components. This innovation offers the flexibility to convert almost any digital input/output (I/O) into an ADC. Considering the very high number of digital I/O ports available in today's FPGA systems, this enables the construction of a massive and powerful ADC array directly on a standard FPGA. The proposed ADC has a resolution of 9.3 bit and achieves an effective number of bits (ENOB) of 7 at a sample rate of 600 MSample/s. The differential nonlinearity (DNL) ranges from -0.9 to 0.9 bit, and the integral nonlinearity (INL) is in the range between -1.1 and 0.9 bit. An alternative version of the ADC operates at 1.2 GSample/s and achieves an ENOB of 5.3.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133231578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sahand Salamat, Armin Haj Aboutalebi, Behnam Khaleghi, Joo Hwan Lee, Y. Ki, T. Simunic
{"title":"NASCENT: Near-Storage Acceleration of Database Sort on SmartSSD","authors":"Sahand Salamat, Armin Haj Aboutalebi, Behnam Khaleghi, Joo Hwan Lee, Y. Ki, T. Simunic","doi":"10.1145/3431920.3439298","DOIUrl":"https://doi.org/10.1145/3431920.3439298","url":null,"abstract":"As the size of data generated every day grows dramatically, the computational bottleneck of computer systems has been shifted toward the storage devices. Thanks to recent developments in storage devices, the interface between the storage and the computational platforms has become the main limitation as it provides limited bandwidth which does not scale when the number of storage devices increases. Interconnect networks limit the performance of the system when independent operations are executing on different storage devices since they do not provide simultaneous accesses to all the storage devices. Offloading the computations to the storage devices eliminates the burden of data transfer from the interconnects. Emerging as a nascent computing trend, near storage computing offloads a portion of computation to the storage devices to accelerate the big data applications. In this paper, we propose a near storage accelerator for database sort, NASCENT, which utilizes Samsung SmartSSD, an NVMe flash drive with an on-board FPGA chip that processes data in-situ. We propose, to the best of our knowledge, the first near storage database sort based on bitonic sort which considers the specifications of the storage devices to increase the scalability of computer systems as the number of storage devices increases. NASCENT improves both performance and energy efficiency as the number of storage devices increases. With 12 SmartSSDs, NASCENT is 7.6x (147.2x) faster and 5.6x (131.4x) more energy efficient than the FPGA (CPU) baseline.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116650526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dirk Koch, N. Dao, Bea Healy, Jing Yu, Andrew Attwood
{"title":"FABulous: An Embedded FPGA Framework","authors":"Dirk Koch, N. Dao, Bea Healy, Jing Yu, Andrew Attwood","doi":"10.1145/3431920.3439302","DOIUrl":"https://doi.org/10.1145/3431920.3439302","url":null,"abstract":"At the end of CMOS-scaling, the role of architecture design is increasingly gaining importance. Supporting this trend, customizable embedded FPGAs are an ingredient in ASIC architectures to provide the advantages of reconfigurable hardware exactly where and how it is most beneficial. To enable this, we are introducing the FABulous embedded open-source FPGA framework. FABulous is designed to fulfill the objectives of ease of use, maximum portability to different process nodes, good control for customization, and delivering good area, power, and performance characteristics of the generated FPGA fabrics. The framework provides templates for logic, arithmetic, memory, and I/O blocks that can be easily stitched together, whilst enabling users to add their own fully customized blocks and primitives. The FABulous ecosystem generates the embedded FPGA fabric for chip fabrication, integrates Yosys, ABC, VPR and nextpnr as FPGA CAD tools, deals with the bitstream generation and after fabrication tests. Additionally, we provide an emulation path for system development. FABulous was demonstrated for an ASIC integrating a RISC-V core with an embedded FPGA fabric for custom instruction set extensions using a TSMC 180nm process and an open-source 45nm process node.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122143132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peipei Zhou, Jiayi Sheng, Cody Hao Yu, Peng Wei, Jie Wang, Di Wu, J. Cong
{"title":"MOCHA","authors":"Peipei Zhou, Jiayi Sheng, Cody Hao Yu, Peng Wei, Jie Wang, Di Wu, J. Cong","doi":"10.1093/acref/9780192803511.013.0797","DOIUrl":"https://doi.org/10.1093/acref/9780192803511.013.0797","url":null,"abstract":"","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123981062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring PGAS Communication for Heterogeneous Clusters with FPGAs","authors":"Varun Sharma, P. Chow","doi":"10.1145/3431920.3439469","DOIUrl":"https://doi.org/10.1145/3431920.3439469","url":null,"abstract":"This work presents a heterogeneous communication library for generic clusters of processors and FPGAs. This library, Shoal, supports the Partitioned Global Address Space (PGAS) memory model for applications. PGAS is a shared memory model for clusters that creates a distinction between local and remote memory access. Through Shoal and its common application programming interface for hardware and software, applications can be more freely migrated to the optimal platform and deployed onto dynamic cluster topologies. The library is tested using a thorough suite of microbenchmarks to establish latency and throughput performance. We also show an implementation of the Jacobi iterative method that demonstrates the ease with which applications can be moved between platforms to yield faster run times.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115374846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PRGA: An Open-Source FPGA Research and Prototyping Framework","authors":"Ang Li, D. Wentzlaff","doi":"10.1145/3431920.3439294","DOIUrl":"https://doi.org/10.1145/3431920.3439294","url":null,"abstract":"Field Programmable Gate Arrays (FPGA) are being used in a fast-growing range of scenarios, and heterogeneous CPU-FPGA systems are being tapped as a possible way to mitigate the challenges posed by the end of Moore's Law. This growth in diverse use cases has fueled the need to customize FPGA architectures for particular applications or application domains. While high-level FPGA models can help explore the FPGA architecture space, as FPGAs move to more advanced design nodes, there is an increased need for low-level FPGA research and prototyping platforms that can be brought all the way to fabrication. This paper presents Princeton Reconfigurable Gate Array (PRGA), a highly customizable, scalable, and complete open-source framework for building custom FPGAs. The framework's core functions include generating synthesizable Verilog from user-specified FPGA architectures, and providing a complete, auto-generated, open-source CAD toolchain for the custom FPGAs. Developed in Python, PRGA provides a user-friendly API and supports use both as a standalone FPGA as well as an embedded FPGA. PRGA is a great platform for FPGA architecture research, FPGA configuration memory research, FPGA CAD tool research, and heterogeneous systems research. It is also a completely open-source framework for designers who need a free and customizable FPGA IP core. An FPGA designed with PRGA is placed and routed using standard cell libraries. The design is evaluated and compared to prior works, providing comparable performance and increased configurability.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131704628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Hui Zhao, Song Fu
{"title":"APCNN: Explore Multi-Layer Cooperation for CNN Optimization and Acceleration on FPGA","authors":"Beilei Jiang, Xianwei Cheng, Sihai Tang, Xu Ma, Zhaochen Gu, Hui Zhao, Song Fu","doi":"10.1145/3431920.3439461","DOIUrl":"https://doi.org/10.1145/3431920.3439461","url":null,"abstract":"In this paper, we introduce APCNN, which explores algorithm-hardware co-design and provides a CNN acceleration framework with multi-layer cooperative optimization and customized design on FPGA. In terms of the algorithm design, the pooling layer is moved before the non-linear activation function and normalization in APCNN, which we prove causes negligible accuracy loss; the pooling layer is then co-optimized with the convolutional layer by means of redundant multiplication elimination, local addition reuse, and global addition reuse. We further design a dedicated accelerator to take full advantage of convolutional-pooling cross-layer optimization to not only accelerate computation but also reduce on-off chip data communication on FPGA. We demonstrate that our novel APCNN can achieve 75% multiplication and 75% addition reduction in the best case. For on-off chip data communication, a max{Row,Col} /(Row x Col) percent of memory footprint can be eliminated, where Row and Col are the number of rows and columns in the activation feature map respectively. We have implemented a prototype of APCNN and evaluated its performance on LeNet-5 and VGG16 using both an accelerator-level cycle and energy model and an RTL implementation. Our experimental results show that APCNN achieves a 2.5× speedup and 4.7× energy efficiency compared with the dense CNN. (This research was supported in part by NSF grants CCF-1563750, OAC-2017564, and CNS-2037982.)","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126097107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}