Optimal RoPE extension via Bayesian Optimization for training-free length generalization

IF 14.8

AI Open Pub Date : 2025-01-01 DOI:10.1016/j.aiopen.2025.01.002

Xinrong Zhang , Shengding Hu , Weilin Zhao , Huadong Wang , Xu Han , Chaoqun He , Guoyang Zeng , Zhiyuan Liu , Maosong Sun

{"title":"Optimal RoPE extension via Bayesian Optimization for training-free length generalization","authors":"Xinrong Zhang , Shengding Hu , Weilin Zhao , Huadong Wang , Xu Han , Chaoqun He , Guoyang Zeng , Zhiyuan Liu , Maosong Sun","doi":"10.1016/j.aiopen.2025.01.002","DOIUrl":null,"url":null,"abstract":"<div><div>Transformers are designed to process input of variable length without resource constraints. However, their performance significantly deteriorates when the input surpasses a threshold slightly larger than the pre-training context window. This limitation on the effective context window confines the application of Transformer-based large language models (LLMs) that have been the subject of great anticipation. Consequently, the generalization of pre-trained LLMs to handle varying input lengths becomes a pivotal and formidable challenge. Previous research has endeavored to address this challenge by modifying the Rotary Position Embedding (RoPE), the primary factor responsible for disparities in handling different input lengths. These efforts have provided valuable insights, while they often lack a deep understanding of the root causes of performance degradation and rely heavily on manual parameter tuning. In response to these issues, we conduct a comprehensive analysis and identify two primary causes behind the performance drop: global distribution mismatch and local resolution degradation. In light of these challenges, we introduce an Optimal RoPE (ORoPE) extension using Bayesian Optimization (BO), which alleviates the need for additional model training. Our experiments demonstrate the efficacy of our approach, outperforming baselines by up to 21.9%, 32.1%, and 41.2% at evaluation lengths of 8K, 16K, and 32K, respectively. We will release all code and data when this paper is published.</div></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"6 ","pages":"Pages 1-11"},"PeriodicalIF":14.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651025000026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Transformers are designed to process input of variable length without resource constraints. However, their performance significantly deteriorates when the input surpasses a threshold slightly larger than the pre-training context window. This limitation on the effective context window confines the application of Transformer-based large language models (LLMs) that have been the subject of great anticipation. Consequently, the generalization of pre-trained LLMs to handle varying input lengths becomes a pivotal and formidable challenge. Previous research has endeavored to address this challenge by modifying the Rotary Position Embedding (RoPE), the primary factor responsible for disparities in handling different input lengths. These efforts have provided valuable insights, while they often lack a deep understanding of the root causes of performance degradation and rely heavily on manual parameter tuning. In response to these issues, we conduct a comprehensive analysis and identify two primary causes behind the performance drop: global distribution mismatch and local resolution degradation. In light of these challenges, we introduce an Optimal RoPE (ORoPE) extension using Bayesian Optimization (BO), which alleviates the need for additional model training. Our experiments demonstrate the efficacy of our approach, outperforming baselines by up to 21.9%, 32.1%, and 41.2% at evaluation lengths of 8K, 16K, and 32K, respectively. We will release all code and data when this paper is published.

查看原文本刊更多论文

基于贝叶斯优化的无训练长度泛化的最优RoPE扩展

变压器设计用于处理无资源限制的可变长度输入。然而，当输入超过略大于预训练上下文窗口的阈值时，它们的性能会显著下降。这种对有效上下文窗口的限制限制了基于transformer的大型语言模型（llm）的应用，而这些模型一直是备受期待的主题。因此，预训练的llm的泛化处理不同的输入长度成为一个关键和艰巨的挑战。先前的研究试图通过修改旋转位置嵌入（RoPE）来解决这一挑战，这是处理不同输入长度差异的主要因素。这些努力提供了有价值的见解，但它们通常缺乏对性能下降的根本原因的深刻理解，并且严重依赖手动参数调优。针对这些问题，我们进行了全面的分析，并确定了性能下降背后的两个主要原因：全局分布不匹配和局部分辨率下降。鉴于这些挑战，我们引入了使用贝叶斯优化（BO）的最优RoPE （ORoPE）扩展，这减轻了对额外模型训练的需求。我们的实验证明了我们的方法的有效性，在评估长度为8K、16K和32K时，分别比基线高出21.9%、32.1%和41.2%。本文发表后，我们将发布所有代码和数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AI Open

CiteScore

45.00

自引率

0.00%

发文量