Bigger is not always better: The importance of human-scale language modeling for psycholinguistics

IF 3 1区心理学 Q1 LINGUISTICS

Journal of memory and language Pub Date : 2025-05-23 DOI:10.1016/j.jml.2025.104650

Ethan Gotlieb Wilcox , Michael Y. Hu , Aaron Mueller , Alex Warstadt , Leshem Choshen , Chengxu Zhuang , Adina Williams , Ryan Cotterell , Tal Linzen

{"title":"Bigger is not always better: The importance of human-scale language modeling for psycholinguistics","authors":"Ethan Gotlieb Wilcox , Michael Y. Hu , Aaron Mueller , Alex Warstadt , Leshem Choshen , Chengxu Zhuang , Adina Williams , Ryan Cotterell , Tal Linzen","doi":"10.1016/j.jml.2025.104650","DOIUrl":null,"url":null,"abstract":"<div><div>When trained to place high probability on a training corpus, neural network language models can learn a surprising amount about language. Recent work has demonstrated that large performance improvements can arise from simply increasing, i.e., scaling, the size of the corpora they are trained on and the number of parameters in those models. Accordingly, many contemporary systems are trained on trillions of words. While largely beneficial to performance on language applications, scaling has several downsides for both computational psycholinguistics and natural language processing research. We discuss the scientific challenges presented by the scaling paradigm, as well as the benefits that would result from language models that can learn from human-scale data. In the second half of this paper, we report on findings from a recent effort to bring about human-scale language model pretraining: the first iteration of the BabyLM Challenge, a shared task organized by the authors that invited participants to train a language model on 100 million words or less. The challenge produced several concrete best practices for practitioners interested in small-scale language modeling. For cognitive scientists, the challenge demonstrated that robust linguistic generalizations can be learned by models trained on a human-scale dataset, though this is not yet achieved through cognitively plausible mechanisms. Furthermore, it established a population of “BabyLMs” that are all effective at data-efficient language learning. Studying such models can help us identify hypotheses for the computational mechanisms that underlie human language acquisition.</div></div>","PeriodicalId":16493,"journal":{"name":"Journal of memory and language","volume":"144 ","pages":"Article 104650"},"PeriodicalIF":3.0000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of memory and language","FirstCategoryId":"102","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0749596X25000439","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

When trained to place high probability on a training corpus, neural network language models can learn a surprising amount about language. Recent work has demonstrated that large performance improvements can arise from simply increasing, i.e., scaling, the size of the corpora they are trained on and the number of parameters in those models. Accordingly, many contemporary systems are trained on trillions of words. While largely beneficial to performance on language applications, scaling has several downsides for both computational psycholinguistics and natural language processing research. We discuss the scientific challenges presented by the scaling paradigm, as well as the benefits that would result from language models that can learn from human-scale data. In the second half of this paper, we report on findings from a recent effort to bring about human-scale language model pretraining: the first iteration of the BabyLM Challenge, a shared task organized by the authors that invited participants to train a language model on 100 million words or less. The challenge produced several concrete best practices for practitioners interested in small-scale language modeling. For cognitive scientists, the challenge demonstrated that robust linguistic generalizations can be learned by models trained on a human-scale dataset, though this is not yet achieved through cognitively plausible mechanisms. Furthermore, it established a population of “BabyLMs” that are all effective at data-efficient language learning. Studying such models can help us identify hypotheses for the computational mechanisms that underlie human language acquisition.

查看原文本刊更多论文

更大并不总是更好：人类尺度的语言建模对心理语言学的重要性

当神经网络语言模型被训练成在训练语料库上放置高概率时，它可以学习到惊人的语言知识。最近的工作表明，简单地增加，即缩放，它们所训练的语料库的大小和这些模型中的参数数量，可以产生巨大的性能改进。因此，许多现代系统都是在数万亿个单词上进行训练的。虽然在很大程度上有利于语言应用程序的性能，但对于计算心理语言学和自然语言处理研究来说，缩放有几个缺点。我们讨论了尺度范式所带来的科学挑战，以及可以从人类尺度数据中学习的语言模型所带来的好处。在本文的后半部分，我们报告了最近一项实现人类尺度语言模型预训练的研究结果：BabyLM挑战的第一次迭代，这是由作者组织的一项共享任务，邀请参与者训练1亿个或更少单词的语言模型。这个挑战为对小规模语言建模感兴趣的从业者提供了几个具体的最佳实践。对于认知科学家来说，这项挑战表明，在人类尺度的数据集上训练的模型可以学习强大的语言泛化，尽管这还没有通过认知上合理的机制实现。此外，它还建立了一个“BabyLMs”群体，这些人在数据高效的语言学习方面都很有效。研究这些模型可以帮助我们确定人类语言习得背后的计算机制的假设。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of memory and language 医学-心理学

CiteScore

8.70

自引率

14.00%

发文量

审稿时长

12.7 weeks

期刊介绍： Articles in the Journal of Memory and Language contribute to the formulation of scientific issues and theories in the areas of memory, language comprehension and production, and cognitive processes. Special emphasis is given to research articles that provide new theoretical insights based on a carefully laid empirical foundation. The journal generally favors articles that provide multiple experiments. In addition, significant theoretical papers without new experimental findings may be published. The Journal of Memory and Language is a valuable tool for cognitive scientists, including psychologists, linguists, and others interested in memory and learning, language, reading, and speech. Research Areas include: • Topics that illuminate aspects of memory or language processing • Linguistics • Neuropsychology.