DAoG：无参数步长控制的梯度衰减自适应

IF 13.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Artificial Intelligence Review Pub Date : 2025-08-30 DOI:10.1007/s10462-025-11362-z

Yifan Zhang, Di Zhao, Hongyi Li, Chengwei Pan

{"title":"DAoG：无参数步长控制的梯度衰减自适应","authors":"Yifan Zhang, Di Zhao, Hongyi Li, Chengwei Pan","doi":"10.1007/s10462-025-11362-z","DOIUrl":null,"url":null,"abstract":"<div><p>As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 11","pages":""},"PeriodicalIF":13.9000,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11362-z.pdf","citationCount":"0","resultStr":"{\"title\":\"DAoG: decayed adaptation over gradients for parameter-free step size control\",\"authors\":\"Yifan Zhang, Di Zhao, Hongyi Li, Chengwei Pan\",\"doi\":\"10.1007/s10462-025-11362-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.</p></div>\",\"PeriodicalId\":8449,\"journal\":{\"name\":\"Artificial Intelligence Review\",\"volume\":\"58 11\",\"pages\":\"\"},\"PeriodicalIF\":13.9000,\"publicationDate\":\"2025-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s10462-025-11362-z.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence Review\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10462-025-11362-z\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11362-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

随着深度学习模型参数规模的不断扩大，训练模型的成本也随之增加，这对随机优化方法提出了越来越大的挑战。基于梯度优化的一个核心问题是步长选择，步长选择的合适与否直接影响到训练效率和模型性能。为了解决这一问题，近年来提出了一系列不需要手动调整步长的无参数优化方法。其中最具代表性的是DoG及其改进型DoWG。尽管在各种任务中表现出强大的性能，但在某些模型架构或训练条件下，DoG和DoWG仍然存在性能不稳定或收敛缓慢的问题。本文介绍了一种新的无参数优化方法——衰减梯度自适应（DAoG），该方法系统地解决了这些限制。我们的关键创新在于首次在无参数优化框架中纳入原则性的步长衰减机制，这大大提高了优化稳定性和模型泛化。此外，采用参数压缩策略来降低对初始步长的敏感性。理论分析表明，在l -光滑和G-Lipschitz条件下，DAoG具有良好的收敛性。自然语言处理和计算机视觉中代表性任务的实证研究表明，DAoG在收敛速度和泛化性能方面优于DoG和DoWG。值得注意的是，在几个具有挑战性的场景中，它甚至可以在余弦退火方面与Adam匹敌或超越。这些理论和实验结果表明，DAoG有效地缓解了DoG中过于保守的步长问题和DoWG中的不稳定性问题，从而推动了深度学习中无参数优化方法的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DAoG: decayed adaptation over gradients for parameter-free step size control

As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Artificial Intelligence Review 工程技术-计算机：人工智能

CiteScore

22.00

自引率

3.30%

发文量

194

审稿时长

5.3 months

期刊介绍： Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.