线性注意在情境学习中的渐近理论。

IF 9.1 1区 综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES
Yue M Lu, Mary Letey, Jacob A Zavatone-Veth, Anindita Maiti, Cengiz Pehlevan
{"title":"线性注意在情境学习中的渐近理论。","authors":"Yue M Lu, Mary Letey, Jacob A Zavatone-Veth, Anindita Maiti, Cengiz Pehlevan","doi":"10.1073/pnas.2502599122","DOIUrl":null,"url":null,"abstract":"<p><p>Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: in the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine ICL and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.</p>","PeriodicalId":20548,"journal":{"name":"Proceedings of the National Academy of Sciences of the United States of America","volume":"122 28","pages":"e2502599122"},"PeriodicalIF":9.1000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12280938/pdf/","citationCount":"0","resultStr":"{\"title\":\"Asymptotic theory of in-context learning by linear attention.\",\"authors\":\"Yue M Lu, Mary Letey, Jacob A Zavatone-Veth, Anindita Maiti, Cengiz Pehlevan\",\"doi\":\"10.1073/pnas.2502599122\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: in the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine ICL and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.</p>\",\"PeriodicalId\":20548,\"journal\":{\"name\":\"Proceedings of the National Academy of Sciences of the United States of America\",\"volume\":\"122 28\",\"pages\":\"e2502599122\"},\"PeriodicalIF\":9.1000,\"publicationDate\":\"2025-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12280938/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the National Academy of Sciences of the United States of America\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1073/pnas.2502599122\",\"RegionNum\":1,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/7/9 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the National Academy of Sciences of the United States of America","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1073/pnas.2502599122","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/9 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

变形金刚有一种非凡的能力,可以根据输入本身提供的示例来学习和执行任务,而无需事先进行明确的培训。有人认为,这种被称为上下文学习(ICL)的能力是Transformers成功的基石,但有关成功的ICL所需的样本复杂性、预训练任务多样性和上下文长度的问题仍未解决。在这里,我们通过线性注意力在线性回归任务的精确可解ICL模型中提供了这些问题的精确答案。我们导出了学习曲线的尖锐渐近性,在一个现象学丰富的尺度体系中,令牌维被取为无穷大;上下文长度和预训练任务多样性与标记维数成正比;预训练样本的数量呈二次增长。随着预训练样本的增加,我们展示了一个双下降学习曲线,并揭示了模型行为在低任务多样性和高任务多样性状态之间的相变:在低多样性状态下,模型倾向于记忆训练任务,而在高多样性状态下,模型实现了真正的ICL和超越预训练任务范围的泛化。这些理论见解通过线性关注和全非线性变压器架构的实验得到了经验验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Asymptotic theory of in-context learning by linear attention.

Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: in the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine ICL and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
19.00
自引率
0.90%
发文量
3575
审稿时长
2.5 months
期刊介绍: The Proceedings of the National Academy of Sciences (PNAS), a peer-reviewed journal of the National Academy of Sciences (NAS), serves as an authoritative source for high-impact, original research across the biological, physical, and social sciences. With a global scope, the journal welcomes submissions from researchers worldwide, making it an inclusive platform for advancing scientific knowledge.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信