DAoG:无参数步长控制的梯度衰减自适应

IF 13.9 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yifan Zhang, Di Zhao, Hongyi Li, Chengwei Pan
{"title":"DAoG:无参数步长控制的梯度衰减自适应","authors":"Yifan Zhang,&nbsp;Di Zhao,&nbsp;Hongyi Li,&nbsp;Chengwei Pan","doi":"10.1007/s10462-025-11362-z","DOIUrl":null,"url":null,"abstract":"<div><p>As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 11","pages":""},"PeriodicalIF":13.9000,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11362-z.pdf","citationCount":"0","resultStr":"{\"title\":\"DAoG: decayed adaptation over gradients for parameter-free step size control\",\"authors\":\"Yifan Zhang,&nbsp;Di Zhao,&nbsp;Hongyi Li,&nbsp;Chengwei Pan\",\"doi\":\"10.1007/s10462-025-11362-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.</p></div>\",\"PeriodicalId\":8449,\"journal\":{\"name\":\"Artificial Intelligence Review\",\"volume\":\"58 11\",\"pages\":\"\"},\"PeriodicalIF\":13.9000,\"publicationDate\":\"2025-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s10462-025-11362-z.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence Review\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10462-025-11362-z\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11362-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

随着深度学习模型参数规模的不断扩大,训练模型的成本也随之增加,这对随机优化方法提出了越来越大的挑战。基于梯度优化的一个核心问题是步长选择,步长选择的合适与否直接影响到训练效率和模型性能。为了解决这一问题,近年来提出了一系列不需要手动调整步长的无参数优化方法。其中最具代表性的是DoG及其改进型DoWG。尽管在各种任务中表现出强大的性能,但在某些模型架构或训练条件下,DoG和DoWG仍然存在性能不稳定或收敛缓慢的问题。本文介绍了一种新的无参数优化方法——衰减梯度自适应(DAoG),该方法系统地解决了这些限制。我们的关键创新在于首次在无参数优化框架中纳入原则性的步长衰减机制,这大大提高了优化稳定性和模型泛化。此外,采用参数压缩策略来降低对初始步长的敏感性。理论分析表明,在l -光滑和G-Lipschitz条件下,DAoG具有良好的收敛性。自然语言处理和计算机视觉中代表性任务的实证研究表明,DAoG在收敛速度和泛化性能方面优于DoG和DoWG。值得注意的是,在几个具有挑战性的场景中,它甚至可以在余弦退火方面与Adam匹敌或超越。这些理论和实验结果表明,DAoG有效地缓解了DoG中过于保守的步长问题和DoWG中的不稳定性问题,从而推动了深度学习中无参数优化方法的发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
DAoG: decayed adaptation over gradients for parameter-free step size control

As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Artificial Intelligence Review
Artificial Intelligence Review 工程技术-计算机:人工智能
CiteScore
22.00
自引率
3.30%
发文量
194
审稿时长
5.3 months
期刊介绍: Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信