BiTA: Bi-directional tuning for lossless acceleration in large language models

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-04-02 DOI:10.1016/j.eswa.2025.127305

Feng Lin , Hanling Yi , Yifan Yang , Hongbin Li , Xiaotian Yu , Guangming Lu , Rong Xiao

{"title":"BiTA: Bi-directional tuning for lossless acceleration in large language models","authors":"Feng Lin , Hanling Yi , Yifan Yang , Hongbin Li , Xiaotian Yu , Guangming Lu , Rong Xiao","doi":"10.1016/j.eswa.2025.127305","DOIUrl":null,"url":null,"abstract":"<div><div>Large language models (LLMs) typically employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. An effective strategy to mitigate this inefficiency is speculative decoding, which reduces the number of model inference calls, thereby lowering memory bandwidth requirements. In this paper, we propose BiTA (<strong>Bi</strong>-directional <strong>T</strong>uning for lossless <strong>A</strong>cceleration), an innovative speculative decoding method that expedites LLMs through streamlined semi-autoregressive generation and draft verification. BiTA enhances LLMs with a parameter-efficient design called bi-directional tuning, enabling semi-autoregressive generation, while leveraging an efficient tree-based decoding mechanism to perform draft candidate generation and verification in parallel, ensuring that the outputs of accelerated LLMs remain identical to those of their original autoregressive counterparts. As a lightweight plug-in module, BiTA seamlessly boosts the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying BiTA, LLaMA-2-70B-Chat achieves a <span><math><mrow><mn>2</mn><mo>.</mo><mn>7</mn><mo>×</mo></mrow></math></span> speedup on the MT-Bench benchmark. Extensive experiments confirm that BiTA surpasses state-of-the-art speculative decoding methods. The code is available at <span><span>https://github.com/linfeng93/BiTA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"279 ","pages":"Article 127305"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425009273","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) typically employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. An effective strategy to mitigate this inefficiency is speculative decoding, which reduces the number of model inference calls, thereby lowering memory bandwidth requirements. In this paper, we propose BiTA (Bi-directional Tuning for lossless Acceleration), an innovative speculative decoding method that expedites LLMs through streamlined semi-autoregressive generation and draft verification. BiTA enhances LLMs with a parameter-efficient design called bi-directional tuning, enabling semi-autoregressive generation, while leveraging an efficient tree-based decoding mechanism to perform draft candidate generation and verification in parallel, ensuring that the outputs of accelerated LLMs remain identical to those of their original autoregressive counterparts. As a lightweight plug-in module, BiTA seamlessly boosts the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying BiTA, LLaMA-2-70B-Chat achieves a

2.7 \times

speedup on the MT-Bench benchmark. Extensive experiments confirm that BiTA surpasses state-of-the-art speculative decoding methods. The code is available at https://github.com/linfeng93/BiTA.

Abstract Image

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.