{"title":"Beyond Markov: Transformers, memory, and attention.","authors":"Thomas Parr, Giovanni Pezzulo, Karl Friston","doi":"10.1080/17588928.2025.2484485","DOIUrl":null,"url":null,"abstract":"<p><p>This paper asks what predictive processing models of brain function can learn from the success of transformer architectures. We suggest that the reason transformer architectures have been successful is that they implicitly commit to a non-Markovian generative model - in which we need memory to contextualize our current observations and make predictions about the future. Interestingly, both the notions of working memory in cognitive science and transformer architectures rely heavily upon the concept of attention. We will argue that the move beyond Markov is crucial in the construction of generative models capable of dealing with much of the sequential data - and certainly language - that our brains contend with. We characterize two broad approaches to this problem - deep temporal hierarchies and autoregressive models - with transformers being an example of the latter. Our key conclusions are that transformers benefit heavily from their use of embedding spaces that place strong metric priors on an implicit latent variable and utilize this metric to direct a form of attention that highlights the most relevant, and not only the most recent, previous elements in a sequence to help predict the next.</p>","PeriodicalId":10413,"journal":{"name":"Cognitive Neuroscience","volume":" ","pages":"1-19"},"PeriodicalIF":2.0000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Neuroscience","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/17588928.2025.2484485","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"NEUROSCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
This paper asks what predictive processing models of brain function can learn from the success of transformer architectures. We suggest that the reason transformer architectures have been successful is that they implicitly commit to a non-Markovian generative model - in which we need memory to contextualize our current observations and make predictions about the future. Interestingly, both the notions of working memory in cognitive science and transformer architectures rely heavily upon the concept of attention. We will argue that the move beyond Markov is crucial in the construction of generative models capable of dealing with much of the sequential data - and certainly language - that our brains contend with. We characterize two broad approaches to this problem - deep temporal hierarchies and autoregressive models - with transformers being an example of the latter. Our key conclusions are that transformers benefit heavily from their use of embedding spaces that place strong metric priors on an implicit latent variable and utilize this metric to direct a form of attention that highlights the most relevant, and not only the most recent, previous elements in a sequence to help predict the next.
期刊介绍:
Cognitive Neuroscience publishes high quality discussion papers and empirical papers on any topic in the field of cognitive neuroscience including perception, attention, memory, language, action, social cognition, and executive function. The journal covers findings based on a variety of techniques such as fMRI, ERPs, MEG, TMS, and focal lesion studies. Contributions that employ or discuss multiple techniques to shed light on the spatial-temporal brain mechanisms underlying a cognitive process are encouraged.