F3: An FPGA-Based Transformer Fine-Tuning Accelerator With Flexible Floating Point Format

IF 3.8 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2025-03-31 DOI:10.1109/JETCAS.2025.3555970

Zerong He;Xi Jin;Zhongguang Xu

{"title":"F3: An FPGA-Based Transformer Fine-Tuning Accelerator With Flexible Floating Point Format","authors":"Zerong He;Xi Jin;Zhongguang Xu","doi":"10.1109/JETCAS.2025.3555970","DOIUrl":null,"url":null,"abstract":"Transformers have demonstrated remarkable success across various deep learning tasks. However, their inference and fine-tuning require substantial computation and memory resources, posing challenges for existing hardware platforms, particularly resource-constrained edge devices. To address these limitations, we propose F3, an FPGA-based accelerator for transformer fine-tuning. To reduce computation and memory overhead, this paper proposes a flexible floating point (FFP) format which consumes fewer resources than traditional floating-point formats of the same bitwidth. We adapt low-rank adaptation to FFP format and propose a fine-tuning strategy named LR-FFP which reduces the number of trainable parameters without compromising fine-tuning accuracy. At the hardware level, we design specialized processing elements (PEs) for the FFP format. The PE maximizes the utilization of DSP resources, enabling a single DSP to perform two multiply-accumulate operations per cycle. The PEs are organized into a systolic array (SA) to efficiently handle general matrix multiplication during fine-tuning. Through theoretical analysis and experimental evaluation, we determine the optimal dataflow and SA parameters to balance performance and resource consumption. We implement the architecture on the Xilinx VCU128 FPGA platform and F3 achieves a performance of 8.2 TFlops at 250 MHz. Compared with CPU and GPU implementations, F3 achieves speedups of <inline-formula> <tex-math>$15.22 \\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$3.44 \\times $ </tex-math></inline-formula>, respectively, and energy efficiency improvements of <inline-formula> <tex-math>$70.52 \\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$9.44 \\times $ </tex-math></inline-formula>.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"258-271"},"PeriodicalIF":3.8000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10945317/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Transformers have demonstrated remarkable success across various deep learning tasks. However, their inference and fine-tuning require substantial computation and memory resources, posing challenges for existing hardware platforms, particularly resource-constrained edge devices. To address these limitations, we propose F³, an FPGA-based accelerator for transformer fine-tuning. To reduce computation and memory overhead, this paper proposes a flexible floating point (FFP) format which consumes fewer resources than traditional floating-point formats of the same bitwidth. We adapt low-rank adaptation to FFP format and propose a fine-tuning strategy named LR-FFP which reduces the number of trainable parameters without compromising fine-tuning accuracy. At the hardware level, we design specialized processing elements (PEs) for the FFP format. The PE maximizes the utilization of DSP resources, enabling a single DSP to perform two multiply-accumulate operations per cycle. The PEs are organized into a systolic array (SA) to efficiently handle general matrix multiplication during fine-tuning. Through theoretical analysis and experimental evaluation, we determine the optimal dataflow and SA parameters to balance performance and resource consumption. We implement the architecture on the Xilinx VCU128 FPGA platform and F³ achieves a performance of 8.2 TFlops at 250 MHz. Compared with CPU and GPU implementations, F³ achieves speedups of

$15.22 \times $

and

$3.44 \times $

, respectively, and energy efficiency improvements of

$70.52 \times $

and

$9.44 \times $

查看原文本刊更多论文

F3：基于fpga的灵活浮点格式变压器微调加速器

变形金刚在各种深度学习任务中取得了显著的成功。然而，它们的推理和微调需要大量的计算和内存资源，对现有的硬件平台，特别是资源受限的边缘设备提出了挑战。为了解决这些限制，我们提出了F3，一种基于fpga的变压器微调加速器。为了减少计算和内存开销，本文提出了一种灵活的浮点（FFP）格式，该格式比相同位宽的传统浮点格式消耗更少的资源。我们对FFP格式进行了低秩自适应，并提出了一种名为LR-FFP的微调策略，该策略在不影响微调精度的情况下减少了可训练参数的数量。在硬件层面，我们为FFP格式设计了专门的处理元素（pe）。PE最大限度地利用了DSP资源，使单个DSP每个周期可以执行两次乘法累加操作。pe被组织成一个收缩数组（SA），以便在微调期间有效地处理一般的矩阵乘法。通过理论分析和实验评估，我们确定了最优的数据流和SA参数，以平衡性能和资源消耗。我们在Xilinx VCU128 FPGA平台上实现了该架构，F3在250 MHz时实现了8.2 TFlops的性能。与CPU和GPU实现相比，F3分别实现了15.22倍和3.44倍的速度提升，以及70.52倍和9.44倍的能效提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.