{"title":"自注意神经网络的动态平均场理论","authors":"Ángel Poc-López, Miguel Aguilera","doi":"arxiv-2406.07247","DOIUrl":null,"url":null,"abstract":"Transformer-based models have demonstrated exceptional performance across\ndiverse domains, becoming the state-of-the-art solution for addressing\nsequential machine learning problems. Even though we have a general\nunderstanding of the fundamental components in the transformer architecture,\nlittle is known about how they operate or what are their expected dynamics.\nRecently, there has been an increasing interest in exploring the relationship\nbetween attention mechanisms and Hopfield networks, promising to shed light on\nthe statistical physics of transformer networks. However, to date, the\ndynamical regimes of transformer-like models have not been studied in depth. In\nthis paper, we address this gap by using methods for the study of asymmetric\nHopfield networks in nonequilibrium regimes --namely path integral methods over\ngenerating functionals, yielding dynamics governed by concurrent mean-field\nvariables. Assuming 1-bit tokens and weights, we derive analytical\napproximations for the behavior of large self-attention neural networks coupled\nto a softmax output, which become exact in the large limit size. Our findings\nreveal nontrivial dynamical phenomena, including nonequilibrium phase\ntransitions associated with chaotic bifurcations, even for very simple\nconfigurations with a few encoded features and a very short context window.\nFinally, we discuss the potential of our analytic approach to improve our\nunderstanding of the inner workings of transformer models, potentially reducing\ncomputational training costs and enhancing model interpretability.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dynamical Mean-Field Theory of Self-Attention Neural Networks\",\"authors\":\"Ángel Poc-López, Miguel Aguilera\",\"doi\":\"arxiv-2406.07247\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based models have demonstrated exceptional performance across\\ndiverse domains, becoming the state-of-the-art solution for addressing\\nsequential machine learning problems. Even though we have a general\\nunderstanding of the fundamental components in the transformer architecture,\\nlittle is known about how they operate or what are their expected dynamics.\\nRecently, there has been an increasing interest in exploring the relationship\\nbetween attention mechanisms and Hopfield networks, promising to shed light on\\nthe statistical physics of transformer networks. However, to date, the\\ndynamical regimes of transformer-like models have not been studied in depth. In\\nthis paper, we address this gap by using methods for the study of asymmetric\\nHopfield networks in nonequilibrium regimes --namely path integral methods over\\ngenerating functionals, yielding dynamics governed by concurrent mean-field\\nvariables. Assuming 1-bit tokens and weights, we derive analytical\\napproximations for the behavior of large self-attention neural networks coupled\\nto a softmax output, which become exact in the large limit size. Our findings\\nreveal nontrivial dynamical phenomena, including nonequilibrium phase\\ntransitions associated with chaotic bifurcations, even for very simple\\nconfigurations with a few encoded features and a very short context window.\\nFinally, we discuss the potential of our analytic approach to improve our\\nunderstanding of the inner workings of transformer models, potentially reducing\\ncomputational training costs and enhancing model interpretability.\",\"PeriodicalId\":501066,\"journal\":{\"name\":\"arXiv - PHYS - Disordered Systems and Neural Networks\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Disordered Systems and Neural Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.07247\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Disordered Systems and Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.07247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Dynamical Mean-Field Theory of Self-Attention Neural Networks
Transformer-based models have demonstrated exceptional performance across
diverse domains, becoming the state-of-the-art solution for addressing
sequential machine learning problems. Even though we have a general
understanding of the fundamental components in the transformer architecture,
little is known about how they operate or what are their expected dynamics.
Recently, there has been an increasing interest in exploring the relationship
between attention mechanisms and Hopfield networks, promising to shed light on
the statistical physics of transformer networks. However, to date, the
dynamical regimes of transformer-like models have not been studied in depth. In
this paper, we address this gap by using methods for the study of asymmetric
Hopfield networks in nonequilibrium regimes --namely path integral methods over
generating functionals, yielding dynamics governed by concurrent mean-field
variables. Assuming 1-bit tokens and weights, we derive analytical
approximations for the behavior of large self-attention neural networks coupled
to a softmax output, which become exact in the large limit size. Our findings
reveal nontrivial dynamical phenomena, including nonequilibrium phase
transitions associated with chaotic bifurcations, even for very simple
configurations with a few encoded features and a very short context window.
Finally, we discuss the potential of our analytic approach to improve our
understanding of the inner workings of transformer models, potentially reducing
computational training costs and enhancing model interpretability.