{"title":"Geometric Dynamics of Signal Propagation Predict Trainability of Transformers","authors":"Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, Surya Ganguli","doi":"arxiv-2403.02579","DOIUrl":null,"url":null,"abstract":"We investigate forward signal propagation and gradient back propagation in\ndeep, randomly initialized transformers, yielding simple necessary and\nsufficient conditions on initialization hyperparameters that ensure\ntrainability of deep transformers. Our approach treats the evolution of the\nrepresentations of $n$ tokens as they propagate through the transformer layers\nin terms of a discrete time dynamical system of $n$ interacting particles. We\nderive simple update equations for the evolving geometry of this particle\nsystem, starting from a permutation symmetric simplex. Our update equations\nshow that without MLP layers, this system will collapse to a line, consistent\nwith prior work on rank collapse in transformers. However, unlike prior work,\nour evolution equations can quantitatively track particle geometry in the\nadditional presence of nonlinear MLP layers, and it reveals an order-chaos\nphase transition as a function of initialization hyperparameters, like the\nstrength of attentional and MLP residual connections and weight variances. In\nthe ordered phase the particles are attractive and collapse to a line, while in\nthe chaotic phase the particles are repulsive and converge to a regular\n$n$-simplex. We analytically derive two Lyapunov exponents: an angle exponent\nthat governs departures from the edge of chaos in this particle system, and a\ngradient exponent that governs the rate of exponential growth or decay of\nbackpropagated gradients. We show through experiments that, remarkably, the\nfinal test loss at the end of training is well predicted just by these two\nexponents at the beginning of training, and that the simultaneous vanishing of\nthese two exponents yields a simple necessary and sufficient condition to\nachieve minimal test loss.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Disordered Systems and Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.02579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We investigate forward signal propagation and gradient back propagation in
deep, randomly initialized transformers, yielding simple necessary and
sufficient conditions on initialization hyperparameters that ensure
trainability of deep transformers. Our approach treats the evolution of the
representations of $n$ tokens as they propagate through the transformer layers
in terms of a discrete time dynamical system of $n$ interacting particles. We
derive simple update equations for the evolving geometry of this particle
system, starting from a permutation symmetric simplex. Our update equations
show that without MLP layers, this system will collapse to a line, consistent
with prior work on rank collapse in transformers. However, unlike prior work,
our evolution equations can quantitatively track particle geometry in the
additional presence of nonlinear MLP layers, and it reveals an order-chaos
phase transition as a function of initialization hyperparameters, like the
strength of attentional and MLP residual connections and weight variances. In
the ordered phase the particles are attractive and collapse to a line, while in
the chaotic phase the particles are repulsive and converge to a regular
$n$-simplex. We analytically derive two Lyapunov exponents: an angle exponent
that governs departures from the edge of chaos in this particle system, and a
gradient exponent that governs the rate of exponential growth or decay of
backpropagated gradients. We show through experiments that, remarkably, the
final test loss at the end of training is well predicted just by these two
exponents at the beginning of training, and that the simultaneous vanishing of
these two exponents yields a simple necessary and sufficient condition to
achieve minimal test loss.