Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank

IF 2.6 2区 数学 Q1 MATHEMATICS, APPLIED
Hung-Hsu Chou , Carsten Gieshoff , Johannes Maly , Holger Rauhut
{"title":"Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank","authors":"Hung-Hsu Chou ,&nbsp;Carsten Gieshoff ,&nbsp;Johannes Maly ,&nbsp;Holger Rauhut","doi":"10.1016/j.acha.2023.101595","DOIUrl":null,"url":null,"abstract":"<div><p>In deep learning<span>, it is common to use more network parameters than training points. In such scenario of over-parameterization, there are usually multiple networks that achieve zero training error so that the training algorithm induces an implicit bias on the computed solution. In practice, (stochastic) gradient descent tends to prefer solutions which generalize well, which provides a possible explanation of the success of deep learning. In this paper we analyze the dynamics of gradient descent in the simplified setting of linear networks and of an estimation problem. Although we are not in an overparameterized scenario, our analysis nevertheless provides insights into the phenomenon of implicit bias. In fact, we derive a rigorous analysis of the dynamics of vanilla gradient descent, and characterize the dynamical convergence of the spectrum. We are able to accurately locate time intervals where the effective rank of the iterates is close to the effective rank of a low-rank projection of the ground-truth matrix. In practice, those intervals can be used as criteria for early stopping if a certain regularity is desired. We also provide empirical evidence for implicit bias in more general scenarios, such as matrix sensing and random initialization. This suggests that deep learning prefers trajectories whose complexity (measured in terms of effective rank) is monotonically increasing, which we believe is a fundamental concept for the theoretical understanding of deep learning.</span></p></div>","PeriodicalId":55504,"journal":{"name":"Applied and Computational Harmonic Analysis","volume":"68 ","pages":"Article 101595"},"PeriodicalIF":2.6000,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied and Computational Harmonic Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1063520323000829","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 0

Abstract

In deep learning, it is common to use more network parameters than training points. In such scenario of over-parameterization, there are usually multiple networks that achieve zero training error so that the training algorithm induces an implicit bias on the computed solution. In practice, (stochastic) gradient descent tends to prefer solutions which generalize well, which provides a possible explanation of the success of deep learning. In this paper we analyze the dynamics of gradient descent in the simplified setting of linear networks and of an estimation problem. Although we are not in an overparameterized scenario, our analysis nevertheless provides insights into the phenomenon of implicit bias. In fact, we derive a rigorous analysis of the dynamics of vanilla gradient descent, and characterize the dynamical convergence of the spectrum. We are able to accurately locate time intervals where the effective rank of the iterates is close to the effective rank of a low-rank projection of the ground-truth matrix. In practice, those intervals can be used as criteria for early stopping if a certain regularity is desired. We also provide empirical evidence for implicit bias in more general scenarios, such as matrix sensing and random initialization. This suggests that deep learning prefers trajectories whose complexity (measured in terms of effective rank) is monotonically increasing, which we believe is a fundamental concept for the theoretical understanding of deep learning.

深度矩阵分解的梯度下降:对低秩的动态和隐式偏差
在深度学习中,通常使用比训练点更多的网络参数。在这种过度参数化的情况下,通常有多个网络实现零训练误差,因此训练算法对计算的解产生隐含的偏差。在实践中,(随机)梯度下降倾向于倾向于推广良好的解决方案,这为深度学习的成功提供了可能的解释。在本文中,我们分析了线性网络简化设置中的梯度下降动力学和估计问题。尽管我们没有处于一个过度参数化的场景中,但我们的分析仍然为隐性偏见现象提供了见解。事实上,我们对香草梯度下降的动力学进行了严格的分析,并刻画了谱的动力学收敛性。我们能够准确地定位迭代的有效秩接近基本真值矩阵的低秩投影的有效秩的时间间隔。在实践中,如果需要一定的规律性,这些间隔可以用作提前停止的标准。我们还为更一般的场景中的隐性偏见提供了经验证据,如矩阵感知和随机初始化。这表明,深度学习更喜欢复杂性(以有效秩衡量)单调增加的轨迹,我们认为这是深度学习理论理解的一个基本概念。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Applied and Computational Harmonic Analysis
Applied and Computational Harmonic Analysis 物理-物理:数学物理
CiteScore
5.40
自引率
4.00%
发文量
67
审稿时长
22.9 weeks
期刊介绍: Applied and Computational Harmonic Analysis (ACHA) is an interdisciplinary journal that publishes high-quality papers in all areas of mathematical sciences related to the applied and computational aspects of harmonic analysis, with special emphasis on innovative theoretical development, methods, and algorithms, for information processing, manipulation, understanding, and so forth. The objectives of the journal are to chronicle the important publications in the rapidly growing field of data representation and analysis, to stimulate research in relevant interdisciplinary areas, and to provide a common link among mathematical, physical, and life scientists, as well as engineers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信