{"title":"DAoG:无参数步长控制的梯度衰减自适应","authors":"Yifan Zhang, Di Zhao, Hongyi Li, Chengwei Pan","doi":"10.1007/s10462-025-11362-z","DOIUrl":null,"url":null,"abstract":"<div><p>As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 11","pages":""},"PeriodicalIF":13.9000,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11362-z.pdf","citationCount":"0","resultStr":"{\"title\":\"DAoG: decayed adaptation over gradients for parameter-free step size control\",\"authors\":\"Yifan Zhang, Di Zhao, Hongyi Li, Chengwei Pan\",\"doi\":\"10.1007/s10462-025-11362-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.</p></div>\",\"PeriodicalId\":8449,\"journal\":{\"name\":\"Artificial Intelligence Review\",\"volume\":\"58 11\",\"pages\":\"\"},\"PeriodicalIF\":13.9000,\"publicationDate\":\"2025-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s10462-025-11362-z.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence Review\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10462-025-11362-z\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11362-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
DAoG: decayed adaptation over gradients for parameter-free step size control
As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.
期刊介绍:
Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.