Chanaka Ganewattha, Z. Khan, Janne J. Lehtomäki, M. Latva-aho
{"title":"Hardware-accelerated Real-time Drift-awareness for Robust Deep Learning on Wireless RF Data","authors":"Chanaka Ganewattha, Z. Khan, Janne J. Lehtomäki, M. Latva-aho","doi":"10.1145/3563394","DOIUrl":"https://doi.org/10.1145/3563394","url":null,"abstract":"Proactive and intelligent management of network resource utilization (RU) using deep learning (DL) can significantly improve the efficiency and performance of the next generation of wireless networks. However, variations in wireless RU are often affected by uncertain events and change points due to the deviations of real data distribution from that of the original training data. Such deviations, which are known as dataset drifts, can subsequently lead to a shift in the corresponding decision boundary degrading the DL model prediction performance. To address these challenges, we present hardware-accelerated real-time radio frequency (RF) analytics and drift-awareness modules for robust DL predictions. We have prototyped the proposed design on a Zynq-7000 System-on-Chip that contains an FPGA and an embedded ARM processor. We have used Xilinx Vivado design suite for synthesis and analysis of the HDL design for the proposed solution. To detect dataset drifts, the proposed solution adopts a distance-based technique on FPGA to quantify in real-time the change between the prediction distribution obtained from DL predictions and data distribution of input streaming samples. Using various performance metrics, we have extensively evaluated the performance of the proposed solution and shown that it can significantly improve the DL model robustness in the presence of dataset drifts.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 29"},"PeriodicalIF":2.3,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42918939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Young-kyu Choi, Carlos Santillana, Yujia Shen, Adnan Darwiche, J. Cong
{"title":"FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-level Synthesis","authors":"Young-kyu Choi, Carlos Santillana, Yujia Shen, Adnan Darwiche, J. Cong","doi":"10.1145/3561514","DOIUrl":"https://doi.org/10.1145/3561514","url":null,"abstract":"Probabilistic Sentential Decision Diagrams (PSDDs) provide efficient methods for modeling and reasoning with probability distributions in the presence of massive logical constraints. PSDDs can also be synthesized from graphical models such as Bayesian networks (BNs) therefore offering a new set of tools for performing inference on these models (in time linear in the PSDD size). Despite these favorable characteristics of PSDDs, we have found multiple challenges in PSDD’s FPGA acceleration. Problems include limited parallelism, data dependency, and small pipeline iterations. In this article, we propose several optimization techniques to solve these issues with novel pipeline scheduling and parallelization schemes. We designed the PSDD kernel with a high-level synthesis (HLS) tool for ease of implementation and verified it on the Xilinx Alveo U250 board. Experimental results show that our methods improve the baseline FPGA HLS implementation performance by 2,200X and the multicore CPU implementation by 20X. The proposed design also outperforms state-of-the-art BN and Sum Product Network (SPN) accelerators that store the graph information in memory.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 22"},"PeriodicalIF":2.3,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44311674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiang Li, Peter Stanwicks, George Provelengios, R. Tessier, Daniel E. Holcomb
{"title":"Jitter-based Adaptive True Random Number Generation Circuits for FPGAs in the Cloud","authors":"Xiang Li, Peter Stanwicks, George Provelengios, R. Tessier, Daniel E. Holcomb","doi":"10.1145/3487554","DOIUrl":"https://doi.org/10.1145/3487554","url":null,"abstract":"In this article, we present and evaluate a true random number generator (TRNG) design that is compatible with the restrictions imposed by cloud-based Field Programmable Gate Array (FPGA) providers such as Amazon Web Services (AWS) EC2 F1. Because cloud FPGA providers disallow the ring oscillator circuits that conventionally generate TRNG entropy, our design is oscillator-free and uses clock jitter as its entropy source. The clock jitter is harvested with a time-to-digital converter (TDC) and a controllable delay line that is continuously tuned to compensate for process, voltage, and temperature variations. After describing the design, we present and validate a stochastic model that conservatively quantifies its worst-case entropy. We deploy and model the design in the cloud on 60 EC2 F1 FPGA instances to ensure sufficient randomness is captured. TRNG entropy is further validated using NIST test suites, and experiments are performed to understand how the TRNG responds to on-die power attacks that disturb the FPGA supply voltage in the vicinity of the TRNG. After introducing and validating our basic TRNG design, we introduce and validate a new variant that uses four instances of a linkable sampling module to increase the entropy per sample and improve throughput. The new variant improves throughput by 250% at a modest 17% increase in CLB count.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 20"},"PeriodicalIF":2.3,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49120995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ankita Nayak, Kecheng Zhang, Rajsekhar Setaluri, Alex Carsello, Makai Mann, Christopher Torng, S. Richardson, Rick Bahr, P. Hanrahan, M. Horowitz, Priyanka Raina
{"title":"Improving Energy Efficiency of CGRAs with Low-Overhead Fine-Grained Power Domains","authors":"Ankita Nayak, Kecheng Zhang, Rajsekhar Setaluri, Alex Carsello, Makai Mann, Christopher Torng, S. Richardson, Rick Bahr, P. Hanrahan, M. Horowitz, Priyanka Raina","doi":"10.1145/3558394","DOIUrl":"https://doi.org/10.1145/3558394","url":null,"abstract":"To effectively minimize static power for a wide range of applications, power domains for coarse-grained reconfigurable array (CGRA) architectures need to be more fine-grained than those found in a typical application-specific integrated circuit. However, the special isolation logic needed to ensure electrical protection between off and on domains makes fine-grained power domains area- and timing-inefficient. We propose a novel design of the CGRA routing fabric that reduces the area overhead of power domain boundary protection from around 9% to less than 1% without incurring any extra timing delay from the isolation cells. Conventional Unified Power Format based flow for power domain boundary protection does not support this design choice. Therefore, we create our own compiler-like passes that iteratively introduce the needed design changes, and formally verify the transformations using methods based on satisfiability modulo theories. These passes also let us optimize how we handle test and debug signals through the off tiles in the CGRA. Using our framework, we add power domains to a CGRA that we designed and taped out. The CGRA has 32 × 16 processing element and memory tiles and 4-MB secondary memory. We address the implementation challenges encountered due to the introduction of fine-grained power domains, including the addressing of the CGRA tiles, the power grid design, well substrate connections, and distribution of global signals. Our CGRA achieves up to 83% reduction in leakage power and 26% reduction in total power versus an identical CGRA without multiple power domains, for a range of image processing and machine learning applications.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 28"},"PeriodicalIF":2.3,"publicationDate":"2022-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48447378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingyu Tian, Z. Ye, Alec Lu, Licheng Guo, Yuze Chi, Zhenman Fang Simon Fraser University, University of Electronic Science, Technology of China, U. California, Los Angeles
{"title":"SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs","authors":"Xingyu Tian, Z. Ye, Alec Lu, Licheng Guo, Yuze Chi, Zhenman Fang Simon Fraser University, University of Electronic Science, Technology of China, U. California, Los Angeles","doi":"10.1145/3572547","DOIUrl":"https://doi.org/10.1145/3572547","url":null,"abstract":"Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils that could be either computation-bound or memory-bound. In this article, we present SASA, a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. SASA takes the high-level stencil DSL and FPGA platform as inputs, automatically exploits the best spatial and temporal parallelism configuration based on our accurate analytical model, and generates the optimized FPGA design with the best parallelism configuration in TAPA high-level synthesis C++ as well as its corresponding host code. Compared to state-of-the-art automatic stencil acceleration framework SODA that only exploits temporal parallelism, SASA achieves an average speedup of 3.41× and up to 15.73× speedup on the HBM-based Xilinx Alveo U280 FPGA board for a wide range of stencil kernels.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 33"},"PeriodicalIF":2.3,"publicationDate":"2022-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42262376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shayan Moini, Aleksa Deric, Xiang Li, George Provelengios, W. Burleson, R. Tessier, Daniel E. Holcomb
{"title":"Voltage Sensor Implementations for Remote Power Attacks on FPGAs","authors":"Shayan Moini, Aleksa Deric, Xiang Li, George Provelengios, W. Burleson, R. Tessier, Daniel E. Holcomb","doi":"10.1145/3555048","DOIUrl":"https://doi.org/10.1145/3555048","url":null,"abstract":"This article presents a study of two types of on-chip FPGA voltage sensors based on ring oscillators (ROs) and time-to-digital converter (TDCs), respectively. It has previously been shown that these sensors are often used to extract side-channel information from FPGAs without physical access. The performance of the sensors is evaluated in the presence of circuits that deliberately waste power, resulting in localized voltage drops. The effects of FPGA power supply features and sensor sensitivity in detecting voltage drops in an FPGA power distribution network (PDN) are evaluated for Xilinx Artix-7, Zynq 7000, and Zynq UltraScale+ FPGAs. We show that both sensor types are able to detect supply voltage drops, and that their measurements are consistent with each other. Our findings show that TDC-based sensors are more sensitive and can detect voltage drops that are shorter in duration, while RO sensors are easier to implement because calibration is not required. Furthermore, we present a new time-interleaved TDC design that sweeps the sensor phase. The new sensor generates data that can reconstruct voltage transients on the order of tens of picoseconds.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 21"},"PeriodicalIF":2.3,"publicationDate":"2022-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43857322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Software-like Debugging for FPGAs via Checkpointing and Transaction-based Co-Simulation","authors":"Sameh Attia, V. Betz","doi":"10.1145/3552521","DOIUrl":"https://doi.org/10.1145/3552521","url":null,"abstract":"Checkpoint-based debugging flows have recently been developed that allow the user to move the design state back and forth between an FPGA and a simulator. They provide a softwarelike debugging experience by combining the speed of hardware execution and the full visibility of simulation. However, they assume the entire system state can be moved to a simulator, limiting them to self-contained systems. In this article, we present StateLink, a transaction-based co-simulation framework that allows part of the system (the task) to run in a simulator and still interact with other system components that reside in hardware. StateLink allows tasks to remain connected to and active in the overall hardware system after their state is moved to a simulator. This extends the functionality of checkpoint-based debugging frameworks to designs with external I/Os and significantly speeds up the simulation of tasks that are part of a large system. StateLink typically adds no timing overhead and a modest hardware area overhead. The total area overhead of using the proposed flow on a Memcached system is only 13%. This flow allows the user to benefit from both the hardware speedup of ∼1M× and the StateLink speedup of up to 44× versus full system simulation.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 24"},"PeriodicalIF":2.3,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45709582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soheil Nazar Shahsavani, A. Fayyazi, M. Nazemi, M. Pedram
{"title":"Efficient Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors Targeting Neural Network Inference and Utilizing High-level Synthesis","authors":"Soheil Nazar Shahsavani, A. Fayyazi, M. Nazemi, M. Pedram","doi":"10.1145/3559543","DOIUrl":"https://doi.org/10.1145/3559543","url":null,"abstract":"Recent efforts for improving the performance of neural network (NN) accelerators that meet today’s application requirements have given rise to a new trend of logic-based NN inference relying on fixed function combinational logic. Mapping such large Boolean functions with many input variables and product terms to digital signal processors (DSPs) on Field-programmable gate arrays (FPGAs) needs a novel framework considering the structure and reconfigurability of DSP blocks during this process. The proposed methodology in this article maps the fixed function combinational logic blocks to a set of Boolean functions where Boolean operations corresponding to each function are mapped to DSP devices rather than look-up tables on the FPGAs to take advantage of the high performance, low latency, and parallelism of DSP blocks. This article also presents an innovative design and optimization methodology for compilation and mapping of NNs, utilizing fixed function combinational logic to DSPs on FPGAs employing high-level synthesis flow. Our experimental evaluations across several datasets and selected NNs demonstrate the comparable performance of our framework in terms of the inference latency and output accuracy compared to prior art FPGA-based NN accelerators employing DSPs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 25"},"PeriodicalIF":2.3,"publicationDate":"2022-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48600753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Scalable Many-core Overlay Architecture on an HBM2-enabled Multi-Die FPGA","authors":"Riadh Ben Abdelhamid, Y. Yamaguchi, T. Boku","doi":"10.1145/3547657","DOIUrl":"https://doi.org/10.1145/3547657","url":null,"abstract":"The overlay architecture enables to raise the abstraction level of hardware design and enhances hardware-accelerated applications’ portability. In FPGAs, there is a growing awareness of the overlay structure as typified by many-core architecture. It works in theory; however, it is difficult in practice, because it is beset with serious design issues. For example, the size of FPGAs is bigger than before. It is exacerbating the issue of the place-and-route. Besides, a single FPGA is actually the sum of small-to-middle FPGAs by advancing packaging technology like silicon interposers. Thus, the tightly coupled many-core designs will face this covert issue that the wires among the regions are extremely restricted. This article proposes efficient essential processing elements, micro-architecture design, and the interconnect architecture toward a scalable many-core overlay design. In particular, our work proposes a novel compact buffering technique to reduce memory resource utilization in tightly connected overlays while preserving computational efficiency. This technique reduces the utilization of BlockRAM to nearly 50% while achieving a best-case computational efficiency of 91.93% in a three-dimensional Jacobi benchmark. Besides, the proposed enhancements led to around 2× and 3× improvement in performance and power efficiency, respectively. Moreover, the improved scalability allowed increasing compute resources and delivering around 4× better performance and power efficiency, as compared to the baseline Dynamically Re-programmable Architecture of Gather-scatter Overlay Nodes overlay.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 33"},"PeriodicalIF":2.3,"publicationDate":"2022-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46212356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Near-memory Computing on FPGAs with 3D-stacked Memories: Applications, Architectures, and Optimizations","authors":"Veronia Iskandar, M. A. E. Ghany, Diana Göhringer","doi":"10.1145/3547658","DOIUrl":"https://doi.org/10.1145/3547658","url":null,"abstract":"The near-memory computing (NMC) paradigm has transpired as a promising method for overcoming the memory wall challenges of future computing architectures. Modern systems integrating 3D-stacked DRAM memory can be leveraged to prevent unnecessary data movement between the main memory and the CPU. FPGA vendors have started introducing 3D memories to their products in an effort to remain competitive on bandwidth requirements of modern memory-intensive applications. Recent NMC proposals target various types of data processing workloads such as graph processing, MapReduce, sorting, machine learning, and database analytics. In this article, we conduct a literature survey on previous proposals of NMC systems on FPGAs integrated with 3D memories. By leveraging the high bandwidth offered from such memories together with specifically designed hardware, FPGA architectures have become a competitor to GPU solutions in terms of speed and energy efficiency. Various FPGA-based NMC designs have been proposed with software and hardware optimization methods to achieve high performance and energy efficiency. Our review investigates various aspects of NMC designs such as platforms, architectures, workloads, and tools. We identify the key challenges and open issues with future research directions.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45842961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}