{"title":"荷兰语大语言模型在句子、段落和书籍阅读中的惊人估计的系统评价。","authors":"Sam Boeve, Louisa Bogaerts","doi":"10.3758/s13428-025-02774-4","DOIUrl":null,"url":null,"abstract":"<p><p>Studies using computational estimates of word predictability from neural language models have garnered strong evidence in favour of surprisal theory. Upon encountering a word, readers experience a processing difficulty that is a linear function of that word's surprisal. Evidence for this effect has been established in the English language or using multilingual models to estimate surprisal across languages. At the same time, many language-specific models of unknown psychometric quality are made openly available. Here, we provide a systematic evaluation of the surprisal estimates of a collection of large language models, specifically designed for Dutch, examining how well they account for reading times in corpora of sentence, paragraph and book reading. We compare their performance to multilingual models and an N-gram model. While models' predictive power for reading times varied considerably across corpora, GPT-2-based models demonstrated superior overall performance. We show that Dutch large language models exhibit the same inverse scaling trend observed for English, with the surprisal estimates of smaller models showing a better fit to reading times than those of the largest models. We also replicate the linear effect of surprisal on reading times for Dutch. Both effects, however, depended on the corpus used for evaluation. Overall, these results offer a psychometric leaderboard of Dutch large language models and challenge the notion of a one-size-fits-all language model for psycholinguistic research. The surprisal estimates derived from all neural language models across the three corpora, along with the code to extract the surprisal, are made publicly available ( https://osf.io/wr4qf/ ).</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"57 9","pages":"266"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12361287/pdf/","citationCount":"0","resultStr":"{\"title\":\"A systematic evaluation of Dutch large language models' surprisal estimates in sentence, paragraph and book reading.\",\"authors\":\"Sam Boeve, Louisa Bogaerts\",\"doi\":\"10.3758/s13428-025-02774-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Studies using computational estimates of word predictability from neural language models have garnered strong evidence in favour of surprisal theory. Upon encountering a word, readers experience a processing difficulty that is a linear function of that word's surprisal. Evidence for this effect has been established in the English language or using multilingual models to estimate surprisal across languages. At the same time, many language-specific models of unknown psychometric quality are made openly available. Here, we provide a systematic evaluation of the surprisal estimates of a collection of large language models, specifically designed for Dutch, examining how well they account for reading times in corpora of sentence, paragraph and book reading. We compare their performance to multilingual models and an N-gram model. While models' predictive power for reading times varied considerably across corpora, GPT-2-based models demonstrated superior overall performance. We show that Dutch large language models exhibit the same inverse scaling trend observed for English, with the surprisal estimates of smaller models showing a better fit to reading times than those of the largest models. We also replicate the linear effect of surprisal on reading times for Dutch. Both effects, however, depended on the corpus used for evaluation. Overall, these results offer a psychometric leaderboard of Dutch large language models and challenge the notion of a one-size-fits-all language model for psycholinguistic research. The surprisal estimates derived from all neural language models across the three corpora, along with the code to extract the surprisal, are made publicly available ( https://osf.io/wr4qf/ ).</p>\",\"PeriodicalId\":8717,\"journal\":{\"name\":\"Behavior Research Methods\",\"volume\":\"57 9\",\"pages\":\"266\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12361287/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Behavior Research Methods\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://doi.org/10.3758/s13428-025-02774-4\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHOLOGY, EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Behavior Research Methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.3758/s13428-025-02774-4","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}
A systematic evaluation of Dutch large language models' surprisal estimates in sentence, paragraph and book reading.
Studies using computational estimates of word predictability from neural language models have garnered strong evidence in favour of surprisal theory. Upon encountering a word, readers experience a processing difficulty that is a linear function of that word's surprisal. Evidence for this effect has been established in the English language or using multilingual models to estimate surprisal across languages. At the same time, many language-specific models of unknown psychometric quality are made openly available. Here, we provide a systematic evaluation of the surprisal estimates of a collection of large language models, specifically designed for Dutch, examining how well they account for reading times in corpora of sentence, paragraph and book reading. We compare their performance to multilingual models and an N-gram model. While models' predictive power for reading times varied considerably across corpora, GPT-2-based models demonstrated superior overall performance. We show that Dutch large language models exhibit the same inverse scaling trend observed for English, with the surprisal estimates of smaller models showing a better fit to reading times than those of the largest models. We also replicate the linear effect of surprisal on reading times for Dutch. Both effects, however, depended on the corpus used for evaluation. Overall, these results offer a psychometric leaderboard of Dutch large language models and challenge the notion of a one-size-fits-all language model for psycholinguistic research. The surprisal estimates derived from all neural language models across the three corpora, along with the code to extract the surprisal, are made publicly available ( https://osf.io/wr4qf/ ).
期刊介绍:
Behavior Research Methods publishes articles concerned with the methods, techniques, and instrumentation of research in experimental psychology. The journal focuses particularly on the use of computer technology in psychological research. An annual special issue is devoted to this field.