2022 International Conference on Field-Programmable Technology (ICFPT)最新文献_第2页

Design Exploration of RISC-V Soft-Cores through Speculative High-Level Synthesis 基于推测性高级综合的RISC-V软核设计探索

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974478

Jean-Michel Gorius, Simon Rokicki, Steven Derrien

引用次数: 0

Memory-efficient RMT Matching Optimization Based on MBitTree 基于MBitTree的内存高效RMT匹配优化

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974307

Zhongpei Liu, Gaofeng Lv, Jichang Wang, Xiangrui Yang

{"title":"Memory-efficient RMT Matching Optimization Based on MBitTree","authors":"Zhongpei Liu, Gaofeng Lv, Jichang Wang, Xiangrui Yang","doi":"10.1109/ICFPT56656.2022.9974307","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974307","url":null,"abstract":"Reconfigurable match tables (RMT) is a pro-grammable pipeline architecture for packet processing. The ar-chitecture searches for action instructions by matching keywords in the packet header vector to modify the packet header. Among them, exact matching uses hash matching, while mask matching is currently more widely implemented using the Ternary Content Addressable Memory (TCAM). TCAM has high classification performance, but its high cost and power consumption make it difficult to scale to large-scale rule sets. MBitTree, a decision tree based on multi-bit cutting implemented on FPGA, is considered to be one of the most scalable packet classification algorithms due to its fast classification speed and low memory footprint. Therefore, MBitTree is applied in the matching action stage of RMT to improve the mask matching and reduce the memory overhead of RMT. According to the characteristics of RMT pipeline, MBitTree is mapped and optimized to improve pipeline efficiency and make full use of hardware resources. In addition, for the first time, we propose to move the key extractor in each stage of RMT to the action engine of the previous stage to save the memory overhead and processing time caused by the key extractor in each stage. We implement a prototype RMT based on MBitTree matching on FPGA, and the implementation results show that our method can achieve a throughput of over 200 Gbps for 10K rule sets and greatly reduce the memory overhead.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121695309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NetPU: Prototyping a Generic Reconfigurable Neural Network Accelerator Architecture NetPU:通用可重构神经网络加速器架构的原型设计

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974206

Yuhao Liu, Shubham Rai, Salim Ullah, Akash Kumar

引用次数: 1

Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training 混合低精度的多重累加单元DNN训练

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974324

Mariko Tatsumi, Silviu-Ioan Filip, Caroline White, O. Sentieys, G. Lemieux

{"title":"Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training","authors":"Mariko Tatsumi, Silviu-Ioan Filip, Caroline White, O. Sentieys, G. Lemieux","doi":"10.1109/ICFPT56656.2022.9974324","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974324","url":null,"abstract":"The most compute-intensive stage of deep neural network (DNN) training is matrix multiplication where the multiply-accumulate (MAC) operator is key. To reduce training costs, we consider using low-precision arithmetic for MAC operations. While low-precision training has been investigated in prior work, the focus has been on reducing the number of bits in weights or activations without compromising accuracy. In contrast, the focus in this paper is on implementation details beyond weight or activation width that affect area and accuracy. In particular, we investigate the impact of fixed- versus floating-point representations, multiplier rounding, and floating-point exceptional value support. Results suggest that (1) low-precision floating-point is more area-effective than fixed-point for multiplication, (2) standard IEEE-754 rules for subnormals, NaNs, and intermediate rounding serve little to no value in terms of accuracy but contribute significantly to area, (3) low-precision MACs require an adaptive loss-scaling step during training to compensate for limited representation range, and (4) fixed-point is more area-effective for accumulation, but the cost of format conversion and downstream logic can swamp the savings. Finally, we note that future work should investigate accumulation structures beyond the MAC level to achieve further gains.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123882557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ESSPER: Elastic and Scalable System for High-Performance Reconfigurable Computing with Software-bridged APIs ESSPER:具有软件桥接api的高性能可重构计算弹性和可扩展系统

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974312

K. Sano, Atsushi Koshiba, Takaaki Miyajima, Tomohiro Ueno

引用次数: 2

Parallel CRC On An FPGA At Terabit Speeds 太比特速度下FPGA上的并行CRC

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974233

Q. Shen, Juan Camilo Vega, P. Chow

引用次数: 0

Steering Committee: ICFPT 2022 指导委员会:ICFPT 2022

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/icfpt56656.2022.9974557

引用次数: 0

Desgin and Implementation of ROS2-based Autonomous Tiny Robot Car with Integration of Multiple ROS2 FPGA Nodes 集成多个ROS2 FPGA节点的基于ROS2的自主微型机器人汽车设计与实现

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974433

Hayato Mori, Hayato Amano, Akinobu Mizutani, Eisuke Okazaki, Yuki Konno, Kohei Sada, Tomohiro Ono, Yuma Yoshimoto, H. Tamukoh, Takeshi Ohkawa, Midori Sugaya

引用次数: 0

SALIENT: Ultra-Fast FPGA-based Short Read Alignment 重点:基于fpga的超快速短读对齐

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974548

Behnam Khaleghi, Tianqi Zhang, C. Martino, George Armstrong, Ameen Akel, Ken Curewitz, Justin Eno, S. Eilert, Rob Knight, Niema Moshiri, Tajana Rosing

{"title":"SALIENT: Ultra-Fast FPGA-based Short Read Alignment","authors":"Behnam Khaleghi, Tianqi Zhang, C. Martino, George Armstrong, Ameen Akel, Ken Curewitz, Justin Eno, S. Eilert, Rob Knight, Niema Moshiri, Tajana Rosing","doi":"10.1109/ICFPT56656.2022.9974548","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974548","url":null,"abstract":"State-of-the-art high-throughput DNA sequencers output terabytes of short reads that typically need to be aligned to a reference genome in order to perform downstream analyses. Because alignment typically dominates the total run time of bioinformatics pipelines, a number of recent work sought to accelerate it in hardware. However, existing FPGA implemen-tations did not fully optimize the alignment algorithms for the FPGA hardware and mainly focused on a subset of alignment problems, e.g., ungapped alignment with a limited number of mismatches, which hinder their practical utility. In this work, we analyze the existing alignment methods and identify and leverage opportunities for FPGA acceleration. Our alignment framework, SALIENT, first carries out an ultra-fast ungapped alignment, which supports a flexible number of mismatches. Based on the underlying bioinformatics pipeline and the information provided by the ungapped aligner, SALIENT then identifies a fraction of reads that need to go through its gapped aligner, thus improving alignment throughput. We extensively evaluate SALIENT using diverse datasets. Experimental results indicate that SALIENT, running on a single Xilinx Alveo U280 device, delivers an average throughput of 546 million bases/second, outperforming the state- of-the-art minimap2 software by 40x, and Bowtie2 by up to 107 x, with a similar or slightly better (~O.l %-0.5 %) alignment and error (false negative/positive) rate. Compared to the existing ungapped FPGA aligners [1]–[4], SALIENT has 9.4-18x higher throughput/Watt, while compared to the gapped aligners [5], [6], it is 28–35 x better. SALIENT achieves 7.6 x higher throughput than Illumina DRAGEN Bio-IT Platform [7].","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127761493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs HPIPE NX:利用ai优化的fpga提升CNN推理加速性能

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974441

Marius Stan, Mathew Hall, M. Ibrahim, Vaughn Betz

{"title":"HPIPE NX: Boosting CNN Inference Acceleration Performance with AI-Optimized FPGAs","authors":"Marius Stan, Mathew Hall, M. Ibrahim, Vaughn Betz","doi":"10.1109/ICFPT56656.2022.9974441","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974441","url":null,"abstract":"With the ever-increasing compute demands of artificial intelligence (AI) workloads, there is extensive interest in leveraging field-programmable gate-arrays (FPGAs) to quickly deploy hardware accelerators for the latest convolutional neural networks (CNNs). Recent FPGA architectures are also evolving to better serve the needs of AI, but accelerators need extensive re-design to leverage these new features. The Stratix 10 NX chip by Intel is a new FPGA that replaces traditional DSP blocks with in-fabric AI tensor blocks that provide 15x more multipliers and up to 143 TOPS of performance, at the cost of lower precision (INT8) and significant restrictions on how many operands can be fed to the multipliers from the programmable routing. In this paper, we explore different CNN accelerator structures to leverage the tensor blocks, considering the various tensor block modes, operand bandwidth restrictions, and on-chip memory restrictions. We incorporate the most performant techniques into HPIPE, a layer-pipelined and sparse-aware CNN accelerator for FPGAs. We enhance HPIPE's software compiler to restructure the CNN computations and on-chip memory layout to take advantage of the additional multipliers offered by the new tensor block architecture, while also avoiding stalls due to data loading restrictions. We achieve cycle-by-cycle speedups in tensor mode of up to $mathbf{8}.mathbf{3}mathbf{x}$ for Mobilenet-v1 versus the original HPIPE design using conventional DSPs. On the FPGA, we achieve a throughput of 28,541 and 29,429 images/s on Mobilenet-v1 and Mobilenet-v2 respectively, outperforming all previous FPGA accelerators by at least 4.0x, including one on an AI-optimized Xilinx chip. We also outperform NVIDIA's V100 GPU, a machine learning targeted GPU on a similar process node with a $mathbf{1}.mathbf{7}mathbf{x}$ larger die size, by up to 17x with a batch size of one and 1.3x with NVIDIA's largest reported batch size of 128.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130612763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3