并行性权衡:对数精度变压器的局限性

IF 4.2 1区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Transactions of the Association for Computational Linguistics Pub Date : 2022-07-02 DOI:10.1162/tacl_a_00562

William Cooper Merrill, Ashish Sabharwal

{"title":"并行性权衡:对数精度变压器的局限性","authors":"William Cooper Merrill, Ashish Sabharwal","doi":"10.1162/tacl_a_00562","DOIUrl":null,"url":null,"abstract":"Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if L≠P (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":"11 1","pages":"531-545"},"PeriodicalIF":4.2000,"publicationDate":"2022-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"The Parallelism Tradeoff: Limitations of Log-Precision Transformers\",\"authors\":\"William Cooper Merrill, Ashish Sabharwal\",\"doi\":\"10.1162/tacl_a_00562\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if L≠P (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.\",\"PeriodicalId\":33559,\"journal\":{\"name\":\"Transactions of the Association for Computational Linguistics\",\"volume\":\"11 1\",\"pages\":\"531-545\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2022-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transactions of the Association for Computational Linguistics\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://doi.org/10.1162/tacl_a_00562\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions of the Association for Computational Linguistics","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1162/tacl_a_00562","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 11

摘要

尽管变换神经网络在现代NLP中无处不在，但其计算能力的表征仍然是一个有趣的悬而未决的问题。我们证明了算术精度在输入令牌数量上是对数的变换器（并且其前馈网络在其输入中使用空间线性可计算）可以通过恒定深度对数空间一致阈值电路来模拟。这利用复杂性理论中的已知结果提供了对变压器功率的深入了解。例如，如果L≠P（即，并非所有的多时间问题都可以使用对数空间来解决），那么变换器甚至不能准确地求解线性等式，也不能在具有空乘积的任意上下文无关语法中检查隶属度。我们的结果直观地体现在transformer架构的高并行性上。因此，我们推测性地引入了一个基本并行性权衡的想法：任何像transformer这样可并行的模型架构都将遵守类似的限制。由于并行性是大规模训练模型的关键，这表明了缩放范式的潜在内在弱点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Parallelism Tradeoff: Limitations of Log-Precision Transformers

Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if L≠P (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Transactions of the Association for Computational Linguistics Multiple-

CiteScore

32.60

自引率

4.60%

发文量

审稿时长

8 weeks

期刊介绍： The highly regarded quarterly journal Computational Linguistics has a companion journal called Transactions of the Association for Computational Linguistics. This open access journal publishes articles in all areas of natural language processing and is an important resource for academic and industry computational linguists, natural language processing experts, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, as well as linguists and philosophers. The journal disseminates work of vital relevance to these professionals on an annual basis.