Hongkun Dou;Hongjue Li;Jinyang Du;Leyuan Fang;Qing Gao;Yue Deng;Wen Yao
{"title":"伴随微分的高维超参数优化","authors":"Hongkun Dou;Hongjue Li;Jinyang Du;Leyuan Fang;Qing Gao;Yue Deng;Wen Yao","doi":"10.1109/TAI.2025.3540799","DOIUrl":null,"url":null,"abstract":"As an emerging machine learning task, high-dimensional hyperparameter optimization (HO) aims at enhancing traditional deep learning models by simultaneously optimizing the neural networks’ weights and hyperparameters in a joint bilevel configuration. However, such nested objectives can impose nontrivial difficulties for the pursuit of the gradient of the validation risk with respect to the hyperparameters (a.k.a. hypergradient). To tackle this challenge, we revisit its bilevel objective from the novel perspective of continuous dynamics and then solve the whole HO problem with the adjoint state theory. The proposed HO framework, termed Adjoint Diff, is naturally scalable to a very deep neural network with high-dimensional hyperparameters because it only requires constant memory cost in training. Adjoint Diff is in fact, a general framework that some existing gradient-based HO algorithms are well interpreted by it with simple algebra. In addition, we further offer the Adjoint Diff+ framework by incorporating the prevalent momentum learning concept into the basic Adjoint Diff for enhanced convergence. Experimental results show that our Adjoint Diff frameworks outperform several state-of-the-art approaches on three high-dimensional HO instances including, designing a loss function for imbalanced data, selecting samples from noisy labels, and learning auxiliary tasks for fine-grained classification.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 8","pages":"2148-2162"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"High-Dimensional Hyperparameter Optimization via Adjoint Differentiation\",\"authors\":\"Hongkun Dou;Hongjue Li;Jinyang Du;Leyuan Fang;Qing Gao;Yue Deng;Wen Yao\",\"doi\":\"10.1109/TAI.2025.3540799\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As an emerging machine learning task, high-dimensional hyperparameter optimization (HO) aims at enhancing traditional deep learning models by simultaneously optimizing the neural networks’ weights and hyperparameters in a joint bilevel configuration. However, such nested objectives can impose nontrivial difficulties for the pursuit of the gradient of the validation risk with respect to the hyperparameters (a.k.a. hypergradient). To tackle this challenge, we revisit its bilevel objective from the novel perspective of continuous dynamics and then solve the whole HO problem with the adjoint state theory. The proposed HO framework, termed Adjoint Diff, is naturally scalable to a very deep neural network with high-dimensional hyperparameters because it only requires constant memory cost in training. Adjoint Diff is in fact, a general framework that some existing gradient-based HO algorithms are well interpreted by it with simple algebra. In addition, we further offer the Adjoint Diff+ framework by incorporating the prevalent momentum learning concept into the basic Adjoint Diff for enhanced convergence. Experimental results show that our Adjoint Diff frameworks outperform several state-of-the-art approaches on three high-dimensional HO instances including, designing a loss function for imbalanced data, selecting samples from noisy labels, and learning auxiliary tasks for fine-grained classification.\",\"PeriodicalId\":73305,\"journal\":{\"name\":\"IEEE transactions on artificial intelligence\",\"volume\":\"6 8\",\"pages\":\"2148-2162\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-02-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on artificial intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10880096/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10880096/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
High-Dimensional Hyperparameter Optimization via Adjoint Differentiation
As an emerging machine learning task, high-dimensional hyperparameter optimization (HO) aims at enhancing traditional deep learning models by simultaneously optimizing the neural networks’ weights and hyperparameters in a joint bilevel configuration. However, such nested objectives can impose nontrivial difficulties for the pursuit of the gradient of the validation risk with respect to the hyperparameters (a.k.a. hypergradient). To tackle this challenge, we revisit its bilevel objective from the novel perspective of continuous dynamics and then solve the whole HO problem with the adjoint state theory. The proposed HO framework, termed Adjoint Diff, is naturally scalable to a very deep neural network with high-dimensional hyperparameters because it only requires constant memory cost in training. Adjoint Diff is in fact, a general framework that some existing gradient-based HO algorithms are well interpreted by it with simple algebra. In addition, we further offer the Adjoint Diff+ framework by incorporating the prevalent momentum learning concept into the basic Adjoint Diff for enhanced convergence. Experimental results show that our Adjoint Diff frameworks outperform several state-of-the-art approaches on three high-dimensional HO instances including, designing a loss function for imbalanced data, selecting samples from noisy labels, and learning auxiliary tasks for fine-grained classification.