{"title":"Design Exploration of RISC-V Soft-Cores through Speculative High-Level Synthesis","authors":"Jean-Michel Gorius, Simon Rokicki, Steven Derrien","doi":"10.1109/ICFPT56656.2022.9974478","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974478","url":null,"abstract":"The RISC- V ecosystem is quickly growing and has gained a lot of traction in the FPGA community, as it permits free customization of both ISA and micro- architectural features. However, the design of the cor- responding micro-architecture is costly and error-prone. We address this issue by providing a flow capable of automatically synthesizing pipelined micro-architectures directly from an Instruction Set Simulator in C/C++. Our flow is based on HLS technology and bridges part of the gap between Instruction Set Processor design flows and High- Level Synthesis tools by taking advantage of speculative loop pipelining. Our results show that our flow is general enough to support a variety of ISA and micro-architectural extensions, and is capable of producing circuits that are competitive with manually designed cores.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124475850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhongpei Liu, Gaofeng Lv, Jichang Wang, Xiangrui Yang
{"title":"Memory-efficient RMT Matching Optimization Based on MBitTree","authors":"Zhongpei Liu, Gaofeng Lv, Jichang Wang, Xiangrui Yang","doi":"10.1109/ICFPT56656.2022.9974307","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974307","url":null,"abstract":"Reconfigurable match tables (RMT) is a pro-grammable pipeline architecture for packet processing. The ar-chitecture searches for action instructions by matching keywords in the packet header vector to modify the packet header. Among them, exact matching uses hash matching, while mask matching is currently more widely implemented using the Ternary Content Addressable Memory (TCAM). TCAM has high classification performance, but its high cost and power consumption make it difficult to scale to large-scale rule sets. MBitTree, a decision tree based on multi-bit cutting implemented on FPGA, is considered to be one of the most scalable packet classification algorithms due to its fast classification speed and low memory footprint. Therefore, MBitTree is applied in the matching action stage of RMT to improve the mask matching and reduce the memory overhead of RMT. According to the characteristics of RMT pipeline, MBitTree is mapped and optimized to improve pipeline efficiency and make full use of hardware resources. In addition, for the first time, we propose to move the key extractor in each stage of RMT to the action engine of the previous stage to save the memory overhead and processing time caused by the key extractor in each stage. We implement a prototype RMT based on MBitTree matching on FPGA, and the implementation results show that our method can achieve a throughput of over 200 Gbps for 10K rule sets and greatly reduce the memory overhead.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121695309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NetPU: Prototyping a Generic Reconfigurable Neural Network Accelerator Architecture","authors":"Yuhao Liu, Shubham Rai, Salim Ullah, Akash Kumar","doi":"10.1109/ICFPT56656.2022.9974206","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974206","url":null,"abstract":"FPGA-based Neural Network (NN) accelerator is a rapidly advancing subject in recent research. Related works can be classified as two hardware architectures: i) Heterogeneous Streaming Dataflow (HSD) architecture and ii) Processing Element Matrix (PEM) architecture. HSD architecture explores the reconfigurability of FPGAs to support the customization and optimization of hardware design to implement a complete network on FPGA for one given trained model. PEM architecture achieves relatively generic support for different network models, essentially implementing the neuron processing modules on the FPGA scheduled by the runtime software environment. In summary, the HSD architecture requires more resources with simplified runtime software control. The PEM architecture consumes fewer resources than the HSD architecture. However, the runtime software environment can be a heavy payload for lightweight systems, such as the low-power microcontroller of IoT or edge devices.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122041440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mariko Tatsumi, Silviu-Ioan Filip, Caroline White, O. Sentieys, G. Lemieux
{"title":"Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training","authors":"Mariko Tatsumi, Silviu-Ioan Filip, Caroline White, O. Sentieys, G. Lemieux","doi":"10.1109/ICFPT56656.2022.9974324","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974324","url":null,"abstract":"The most compute-intensive stage of deep neural network (DNN) training is matrix multiplication where the multiply-accumulate (MAC) operator is key. To reduce training costs, we consider using low-precision arithmetic for MAC operations. While low-precision training has been investigated in prior work, the focus has been on reducing the number of bits in weights or activations without compromising accuracy. In contrast, the focus in this paper is on implementation details beyond weight or activation width that affect area and accuracy. In particular, we investigate the impact of fixed- versus floating-point representations, multiplier rounding, and floating-point exceptional value support. Results suggest that (1) low-precision floating-point is more area-effective than fixed-point for multiplication, (2) standard IEEE-754 rules for subnormals, NaNs, and intermediate rounding serve little to no value in terms of accuracy but contribute significantly to area, (3) low-precision MACs require an adaptive loss-scaling step during training to compensate for limited representation range, and (4) fixed-point is more area-effective for accumulation, but the cost of format conversion and downstream logic can swamp the savings. Finally, we note that future work should investigate accumulation structures beyond the MAC level to achieve further gains.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123882557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Sano, Atsushi Koshiba, Takaaki Miyajima, Tomohiro Ueno
{"title":"ESSPER: Elastic and Scalable System for High-Performance Reconfigurable Computing with Software-bridged APIs","authors":"K. Sano, Atsushi Koshiba, Takaaki Miyajima, Tomohiro Ueno","doi":"10.1109/ICFPT56656.2022.9974312","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974312","url":null,"abstract":"Many-core CPUs and GPUs, present mainstream architectures for HPC, are facing difficulty in maintaining the same performance improvement rate because of the recent slow-down in the semiconductor scaling, the dark silicon problem, and wasteful mechanisms required for accelerating general-purpose computing such as a branch predictor and an out-of-order mechanism. Also, the power efficiency of HPC systems is significantly important to achieve higher performance.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123901530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel CRC On An FPGA At Terabit Speeds","authors":"Q. Shen, Juan Camilo Vega, P. Chow","doi":"10.1109/ICFPT56656.2022.9974233","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974233","url":null,"abstract":"The Cyclic Redundancy Check Algorithm (CRC) is critical for ensuring high data reliability in serial communication such as Ethernet networks, allowing for the detection of corrupted packets with a programmable and arbitrarily small probability of failure. The baseline algorithm, however, is highly serialized due to read after write (RAW) dependencies, preventing efficient parallelization of the algorithm for use in hardware. We built a fully parameterizable open-source IP core that has no such dependencies to produce the equivalent result as the baseline CRC algorithm but in a form that can be fully parallelized, with fully automated pipelining, which works for any CRC polynomial, and with a low-resource end-of-packet alignment. This allows for up to 64-bit CRC to be computed in an FPGA at 4 Tbps.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129680555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Desgin and Implementation of ROS2-based Autonomous Tiny Robot Car with Integration of Multiple ROS2 FPGA Nodes","authors":"Hayato Mori, Hayato Amano, Akinobu Mizutani, Eisuke Okazaki, Yuki Konno, Kohei Sada, Tomohiro Ono, Yuma Yoshimoto, H. Tamukoh, Takeshi Ohkawa, Midori Sugaya","doi":"10.1109/ICFPT56656.2022.9974433","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974433","url":null,"abstract":"This paper introduces an autonomous tiny robot car equipped with a camera-based lane detection function and a traffic signal/obstacle, pedestrian recognition function. Each function is integrated by Robot Operating System 2 (ROS2), a middleware for robot system development. Autonomous driving without the need for a driver requires not only lane-following driving but also traffic signal recognition and obstacle recognition. These functions are implemented on FPGA, and we evaluated them. According to these results, the execution time of traffic signal recognition by FPGA was 1.2 to 3.4 times faster than CPU execution. YOLOv4 is used for obstacle recognition, which improved mAP by 3.79 points compared to YOLO v3-Tiny.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"83 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128924166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Behnam Khaleghi, Tianqi Zhang, C. Martino, George Armstrong, Ameen Akel, Ken Curewitz, Justin Eno, S. Eilert, Rob Knight, Niema Moshiri, Tajana Rosing
{"title":"SALIENT: Ultra-Fast FPGA-based Short Read Alignment","authors":"Behnam Khaleghi, Tianqi Zhang, C. Martino, George Armstrong, Ameen Akel, Ken Curewitz, Justin Eno, S. Eilert, Rob Knight, Niema Moshiri, Tajana Rosing","doi":"10.1109/ICFPT56656.2022.9974548","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974548","url":null,"abstract":"State-of-the-art high-throughput DNA sequencers output terabytes of short reads that typically need to be aligned to a reference genome in order to perform downstream analyses. Because alignment typically dominates the total run time of bioinformatics pipelines, a number of recent work sought to accelerate it in hardware. However, existing FPGA implemen-tations did not fully optimize the alignment algorithms for the FPGA hardware and mainly focused on a subset of alignment problems, e.g., ungapped alignment with a limited number of mismatches, which hinder their practical utility. In this work, we analyze the existing alignment methods and identify and leverage opportunities for FPGA acceleration. Our alignment framework, SALIENT, first carries out an ultra-fast ungapped alignment, which supports a flexible number of mismatches. Based on the underlying bioinformatics pipeline and the information provided by the ungapped aligner, SALIENT then identifies a fraction of reads that need to go through its gapped aligner, thus improving alignment throughput. We extensively evaluate SALIENT using diverse datasets. Experimental results indicate that SALIENT, running on a single Xilinx Alveo U280 device, delivers an average throughput of 546 million bases/second, outperforming the state- of-the-art minimap2 software by 40x, and Bowtie2 by up to 107 x, with a similar or slightly better (~O.l %-0.5 %) alignment and error (false negative/positive) rate. Compared to the existing ungapped FPGA aligners [1]–[4], SALIENT has 9.4-18x higher throughput/Watt, while compared to the gapped aligners [5], [6], it is 28–35 x better. SALIENT achieves 7.6 x higher throughput than Illumina DRAGEN Bio-IT Platform [7].","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127761493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs","authors":"Marius Stan, Mathew Hall, M. Ibrahim, Vaughn Betz","doi":"10.1109/ICFPT56656.2022.9974441","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974441","url":null,"abstract":"With the ever-increasing compute demands of artificial intelligence (AI) workloads, there is extensive interest in leveraging field-programmable gate-arrays (FPGAs) to quickly deploy hardware accelerators for the latest convolutional neural networks (CNNs). Recent FPGA architectures are also evolving to better serve the needs of AI, but accelerators need extensive re-design to leverage these new features. The Stratix 10 NX chip by Intel is a new FPGA that replaces traditional DSP blocks with in-fabric AI tensor blocks that provide 15x more multipliers and up to 143 TOPS of performance, at the cost of lower precision (INT8) and significant restrictions on how many operands can be fed to the multipliers from the programmable routing. In this paper, we explore different CNN accelerator structures to leverage the tensor blocks, considering the various tensor block modes, operand bandwidth restrictions, and on-chip memory restrictions. We incorporate the most performant techniques into HPIPE, a layer-pipelined and sparse-aware CNN accelerator for FPGAs. We enhance HPIPE's software compiler to restructure the CNN computations and on-chip memory layout to take advantage of the additional multipliers offered by the new tensor block architecture, while also avoiding stalls due to data loading restrictions. We achieve cycle-by-cycle speedups in tensor mode of up to $mathbf{8}.mathbf{3}mathbf{x}$ for Mobilenet-v1 versus the original HPIPE design using conventional DSPs. On the FPGA, we achieve a throughput of 28,541 and 29,429 images/s on Mobilenet-v1 and Mobilenet-v2 respectively, outperforming all previous FPGA accelerators by at least 4.0x, including one on an AI-optimized Xilinx chip. We also outperform NVIDIA's V100 GPU, a machine learning targeted GPU on a similar process node with a $mathbf{1}.mathbf{7}mathbf{x}$ larger die size, by up to 17x with a batch size of one and 1.3x with NVIDIA's largest reported batch size of 128.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130612763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}