Gagandeep Singh, D. Diamantopoulos, Juan G'omez-Luna, C. Hagleitner, S. Stuijk, H. Corporaal, O. Mutlu
{"title":"Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric","authors":"Gagandeep Singh, D. Diamantopoulos, Juan G'omez-Luna, C. Hagleitner, S. Stuijk, H. Corporaal, O. Mutlu","doi":"10.1145/3501804","DOIUrl":"https://doi.org/10.1145/3501804","url":null,"abstract":"Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an field-programmable gate array+HBM-based accelerator connected through Open Coherent Accelerator Processor Interface to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by ( 5.3times ) and ( 12.7times ) when running two different compound stencil kernels. NERO reduces the energy consumption by ( 12times ) and ( 35times ) for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/W and 21.01 GFLOPS/W. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133801540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shenghsun Cho, Mrunal Patel, M. Ferdman, Peter Milder
{"title":"Practical Model Checking on FPGAs","authors":"Shenghsun Cho, Mrunal Patel, M. Ferdman, Peter Milder","doi":"10.1145/3448272","DOIUrl":"https://doi.org/10.1145/3448272","url":null,"abstract":"Software verification is an important stage of the software development process, particularly for mission-critical systems. As the traditional methodology of using unit tests falls short of verifying complex software, developers are increasingly relying on formal verification methods, such as explicit state model checking, to automatically verify that the software functions properly. However, due to the ever-increasing complexity of software designs, model checking cannot be performed in a reasonable amount of time when running on general-purpose cores, leading to the exploration of hardware-accelerated model checking. FPGAs have been demonstrated to be promising verification accelerators, exhibiting nearly three orders of magnitude speedup over software. Unfortunately, the “FPGA programmability wall,” particularly the long synthesis and place-and-route times, block the general adoption of FPGAs for model checking. To address this problem, we designed a runtime-programmable pipeline specifically for model checkers on FPGAs to minimize the “preparation time” before a model can be checked. Our design of the successor state generator and the state validator modules enables FPGA-acceleration of model checking without incurring the time-consuming FPGA implementation stages, reducing the preparation time before checking a model from hours to less than a minute, while incurring only a 26% execution time overhead compared to model-specific implementations.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126710393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of Distributed Reconfigurable Robotics Systems with ReconROS","authors":"Christian Lienen, M. Platzner","doi":"10.1145/3494571","DOIUrl":"https://doi.org/10.1145/3494571","url":null,"abstract":"Robotics applications process large amounts of data in real time and require compute platforms that provide high performance and energy efficiency. FPGAs are well suited for many of these applications, but there is a reluctance in the robotics community to use hardware acceleration due to increased design complexity and a lack of consistent programming models across the software/hardware boundary. In this article, we present ReconROS, a framework that integrates the widely used robot operating system (ROS) with ReconOS, which features multithreaded programming of hardware and software threads for reconfigurable computers. This unique combination gives ROS 2 developers the flexibility to transparently accelerate parts of their robotics applications in hardware. We elaborate on the architecture and the design flow for ReconROS and report on a set of experiments that underline the feasibility and flexibility of our approach.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115461741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Software/Hardware Co-Design of Crystals-Dilithium Signature Scheme","authors":"Zhende Zhou, D. He, Zhe Liu, Min Luo, K. Choo","doi":"10.1145/3447812","DOIUrl":"https://doi.org/10.1145/3447812","url":null,"abstract":"As quantum computers become more affordable and commonplace, existing security systems that are based on classical cryptographic primitives, such as RSA and Elliptic Curve Cryptography (ECC), will no longer be secure. Hence, there has been interest in designing post-quantum cryptographic (PQC) schemes, such as those based on lattice-based cryptography (LBC). The potential of LBC schemes is evidenced by the number of such schemes passing the selection of NIST PQC Standardization Process Round-3. One such scheme is the Crystals-Dilithium signature scheme, which is based on the hard module-lattice problem. However, there is no efficient implementation of the Crystals-Dilithium signature scheme. Hence, in this article, we present a compact hardware architecture containing elaborate modular multiplication units using the Karatsuba algorithm along with smart generators of address sequence and twiddle factors for NTT, which can complete polynomial addition/multiplication with the parameter setting of Dilithium in a short clock period. Also, we propose a fast software/hardware co-design implementation on Field Programmable Gate Array (FPGA) for the Dilithium scheme with a tradeoff between speed and resource utilization. Our co-design implementation outperforms a pure C implementation on a Nios-II processor of the platform Altera DE2-115, in the sense that our implementation is 11.2 and 7.4 times faster for signature and verification, respectively. In addition, we also achieve approximately 51% and 31% speed improvement for signature and verification, in comparison to the pure C implementation on processor ARM Cortex-A9 of ZYNQ-7020 platform.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124462392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adriaan Peetermans, Vladimir Rožić, I. Verbauwhede
{"title":"Design and Analysis of Configurable Ring Oscillators for True Random Number Generation Based on Coherent Sampling","authors":"Adriaan Peetermans, Vladimir Rožić, I. Verbauwhede","doi":"10.1145/3433166","DOIUrl":"https://doi.org/10.1145/3433166","url":null,"abstract":"True Random Number Generators (TRNGs) are indispensable in modern cryptosystems. Unfortunately, to guarantee high entropy of the generated numbers, many TRNG designs require a complex implementation procedure, often involving manual placement and routing. In this work, we introduce, analyse, and compare three dynamic calibration mechanisms for the COherent Sampling ring Oscillator based TRNG: GateVar, WireVar, and LUTVar, enabling easy integration of the entropy source into complex systems. The TRNG setup procedure automatically selects a configuration that guarantees the security requirements. In the experiments, we show that two out of the three proposed mechanisms are capable of assuring correct TRNG operation even when an automatic placement is carried out and when the design is ported to another Field-Programmable Gate Array (FPGA) family. We generated random bits on both a Xilinx Spartan 7 and a Microsemi SmartFusion2 implementation that, without post processing, passed the AIS-31 statistical tests at a throughput of 4.65 Mbit/s and 1.47 Mbit/s, respectively.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"369 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126709337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hogervorst, R. Nane, G. Marchiori, T. Qiu, Markus Blatt, A. Rustad
{"title":"Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-Enabled Field-Programmable Gate Arrays","authors":"T. Hogervorst, R. Nane, G. Marchiori, T. Qiu, Markus Blatt, A. Rustad","doi":"10.1145/3476229","DOIUrl":"https://doi.org/10.1145/3476229","url":null,"abstract":"Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the utmost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. Field-Programmable Gate Arrays could accelerate scientific computing because of the possibility to fully customize the memory hierarchy important in irregular applications such as iterative linear solvers. In this article, we study the potential of using Field-Programmable Gate Arrays in High-Performance Computing because of the rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of High-Bandwidth Memories on board. To perform this study, we propose a novel Sparse Matrix-Vector multiplication unit and an ILU0 preconditioner tightly integrated with a BiCGStab solver kernel. We integrate the developed preconditioned iterative solver in Flow from the Open Porous Media project, a state-of-the-art open source reservoir simulator. Finally, we perform a thorough evaluation of the FPGA solver kernel in both stand-alone mode and integrated in the reservoir simulator, using the NORNE field, a real-world case reservoir model using a grid with more than 105 cells and using three unknowns per cell.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121024971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-scale Cellular Automata on FPGAs","authors":"Nikolaos Kyparissas, A. Dollas","doi":"10.1145/3423185","DOIUrl":"https://doi.org/10.1145/3423185","url":null,"abstract":"Cellular automata (CA) are discrete mathematical models discovered in the 1940s by John von Neumann and Stanislaw Ulam and have been used extensively in many scientific disciplines ever since. The present work evolved from a Field Programmable Gate Array– (FPGA) based design to simulate urban growth into a generic architecture that is automatically generated by a framework to efficiently compute complex cellular automata with large 29 × 29 neighborhoods in Cartesian or toroidal grids, with 16 or 256 states per cell. The new architecture and the framework are presented in detail, including results in terms of modeling capabilities and performance. Large neighborhoods greatly enhance CA modeling capabilities, such as the implementation of anisotropic rules. Performance-wise, the proposed architecture runs on a medium-size FPGA up to 51 times faster vs. a CPU running highly optimized C code. Compared to GPUs the speedup is harder to quantify, because CA results have been reported on GPU implementations with neighborhoods up to 11 × 11, in which case FPGA performance is roughly on par with GPU; however, based on published GPU trends, for 29 × 29 neighborhoods the proposed architecture is expected to have better performance vs. a GPU, at one-10th the energy requirements. The architecture and sample designs are open source available under the creative commons license.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"287 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125880281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PipeArch","authors":"Kaan Kara, G. Alonso","doi":"10.1145/3418465","DOIUrl":"https://doi.org/10.1145/3418465","url":null,"abstract":"Data processing systems based on FPGAs offer high performance and energy efficiency for a variety of applications. However, these advantages are achieved through highly specialized designs. The high degree of specialization leads to accelerators with narrow functionality and designs adhering to a rigid execution flow. For multi-tenant systems this limits the scope of applicability of FPGA-based accelerators, because, first, supporting a single operation is unlikely to have any significant impact on the overall performance of the system, and, second, serving multiple users satisfactorily is difficult due to simplistic scheduling policies enforced when using the accelerator. Standard operating system and database management system features that would help address these limitations, such as context-switching, preemptive scheduling, and thread migration are practically non-existent in current FPGA accelerator efforts. In this work, we propose PipeArch, an open-source project1 for developing FPGA-based accelerators that combine the high efficiency of specialized hardware designs with the generality and functionality known from conventional CPU threads. PipeArch provides programmability and extensibility in the accelerator without losing the advantages of SIMD-parallelism and deep pipelining. PipeArch supports context-switching and thread migration, thereby enabling for the first time new capabilities such as preemptive scheduling in FPGA accelerators within a high-performance data processing setting. We have used PipeArch to implement a variety of machine learning methods for generalized linear model training and recommender systems showing empirically their advantages over a high-end CPU and even over fully specialized FPGA designs.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117280362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An OpenGL Compliant Hardware Implementation of a Graphic Processing Unit Using Field Programmable Gate Array–System on Chip Technology","authors":"Alexander E. Beasley, C. Clarke, R. Watson","doi":"10.1145/3410357","DOIUrl":"https://doi.org/10.1145/3410357","url":null,"abstract":"FPGA-SoC technology provides a heterogeneous platform for advanced, high-performance systems. The System on Chip (SoC) architecture combines traditional single and multiple core processor topologies with flexible FPGA fabric. Dynamic reconfiguration allows the hardware accelerators to be changed at run-time. This article presents a novel OpenGL compliant GPU design implemented on an FPGA. The design uses an FPGA-SoC environment allowing the embedded processor to offload graphics operation onto a more suitable architecture. To the authors’ knowledge, this is a first. The graphics processor consists of GLSL compliant shaders, an efficient Barycentric Rasterizer, and a draw mode manager. Performance analysis shows the throughput of the shaders to be hundreds of millions of vertices per second. The design uses both pipelining and resource reuse to optimise throughput and resource use, allowing implementation on a low-cost, FPGA device. Pixel processing rates from this implementation are almost 80% higher than other FPGA implementations. Power consumption compared with comparative embedded devices shows the FPGA consuming as little as 2% of the power of a Mali device, and an up to 11.9-fold increase in efficiency compared to an Nvidia RTX 2060 - Turing architecture device.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"37 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123658993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to Special Section on FCCM 2019","authors":"A. DeHon","doi":"10.1145/3410373","DOIUrl":"https://doi.org/10.1145/3410373","url":null,"abstract":"The International Symposium on Field-Programmable Custom Computing Machines (FCCM) is the original and premier forum for presenting and discussing new research related to computing that exploits the unique features and capabilities of field-programmable gate arrays (FPGAs) and other reconfigurable hardware. Since its first meeting in 1993, FCCM has been the place to present the latest work in architectures, tools, and programming models for field-programmable custom computing machines as well as applications that use such systems. The 27th FCCM met April 28 to May 1, 2019 on the campus of UCSD in San Diego, California. In this special ACM TRETS section, it is our pleasure to present extended versions of a selected set of 2 of the 31 full and 7 short papers accepted and presented at FCCM 2019. With the end of processor frequency scaling and the end of Dennard Scaling, the broader architecture community is starting to acknowledge the role of “accelerators” and applicationcustomized parallel compute engines. FCCM has highlighted and nurtured the body-of-knowledge and tools for these application-customized accelerators throughout its history. In this special section, we present two article that provide tools to address the growing challenges in large-scale FPGAs and computing systems. Both come with open-source releases for their code so they also contribute to the technical infrastructure and building blocks that allow our community to collaboratively advance the field. One of the common complaints holding back FPGA computing is the slow runtime for the basic mapping tools. Compared to processors and GPUs, the hour-long runtimes for FPGA mapping serve to discourage potential adopters. As FPGAs grow in capacity with Moore’s Law, FPGA mapping times have continued to grow. Routing is often one of the large components of FPGA","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125428085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}