Feng Lin , Hanling Yi , Yifan Yang , Hongbin Li , Xiaotian Yu , Guangming Lu , Rong Xiao
{"title":"BiTA: Bi-directional tuning for lossless acceleration in large language models","authors":"Feng Lin , Hanling Yi , Yifan Yang , Hongbin Li , Xiaotian Yu , Guangming Lu , Rong Xiao","doi":"10.1016/j.eswa.2025.127305","DOIUrl":null,"url":null,"abstract":"<div><div>Large language models (LLMs) typically employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. An effective strategy to mitigate this inefficiency is speculative decoding, which reduces the number of model inference calls, thereby lowering memory bandwidth requirements. In this paper, we propose BiTA (<strong>Bi</strong>-directional <strong>T</strong>uning for lossless <strong>A</strong>cceleration), an innovative speculative decoding method that expedites LLMs through streamlined semi-autoregressive generation and draft verification. BiTA enhances LLMs with a parameter-efficient design called bi-directional tuning, enabling semi-autoregressive generation, while leveraging an efficient tree-based decoding mechanism to perform draft candidate generation and verification in parallel, ensuring that the outputs of accelerated LLMs remain identical to those of their original autoregressive counterparts. As a lightweight plug-in module, BiTA seamlessly boosts the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying BiTA, LLaMA-2-70B-Chat achieves a <span><math><mrow><mn>2</mn><mo>.</mo><mn>7</mn><mo>×</mo></mrow></math></span> speedup on the MT-Bench benchmark. Extensive experiments confirm that BiTA surpasses state-of-the-art speculative decoding methods. The code is available at <span><span>https://github.com/linfeng93/BiTA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"279 ","pages":"Article 127305"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425009273","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) typically employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. An effective strategy to mitigate this inefficiency is speculative decoding, which reduces the number of model inference calls, thereby lowering memory bandwidth requirements. In this paper, we propose BiTA (Bi-directional Tuning for lossless Acceleration), an innovative speculative decoding method that expedites LLMs through streamlined semi-autoregressive generation and draft verification. BiTA enhances LLMs with a parameter-efficient design called bi-directional tuning, enabling semi-autoregressive generation, while leveraging an efficient tree-based decoding mechanism to perform draft candidate generation and verification in parallel, ensuring that the outputs of accelerated LLMs remain identical to those of their original autoregressive counterparts. As a lightweight plug-in module, BiTA seamlessly boosts the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying BiTA, LLaMA-2-70B-Chat achieves a speedup on the MT-Bench benchmark. Extensive experiments confirm that BiTA surpasses state-of-the-art speculative decoding methods. The code is available at https://github.com/linfeng93/BiTA.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.