{"title":"Large language models predict cognition and education close to or better than genomics or expert assessment.","authors":"Tobias Wolfram","doi":"10.1038/s44271-025-00274-x","DOIUrl":null,"url":null,"abstract":"<p><p>Previous research using standard social survey data has emphasized a relative lack of power when predicting educational and psychological outcomes. Leveraging a unique longitudinal dataset, we explore predictability of educational attainment, cognitive abilities, and non-cognitive traits. Integrating various measures of computational linguistics and large language model-based embeddings within a SuperLearner framework trained on short aspirational essays written at age 11, we accurately predict cognition and non-cognitive traits at the same and later age to a similar degree as teacher assessments, and better than genomic data. The same is true for predicting final educational attainment. Combining text, genetic markers, and teacher assessments into an ensemble model, we can predict cognitive ability at close to test-retest reliability of gold-standard tests ( <math> <msubsup><mrow><mi>R</mi></mrow> <mrow><mi>H</mi> <mi>o</mi> <mi>l</mi> <mi>d</mi> <mi>o</mi> <mi>u</mi> <mi>t</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> <mo>=</mo> <mn>0.7</mn></math> ) and explain 38% of individual differences in attainment. A sociological model comparable to the baseline of the Fragile Family Challenge replicates the FFC's findings regarding the level of predictability achievable with such data. These findings show that recent advances in large language models and machine learning equip behavioural scientists with tools for prediction of psycho-social features.</p>","PeriodicalId":501698,"journal":{"name":"Communications Psychology","volume":"3 1","pages":"95"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12229686/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications Psychology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s44271-025-00274-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Previous research using standard social survey data has emphasized a relative lack of power when predicting educational and psychological outcomes. Leveraging a unique longitudinal dataset, we explore predictability of educational attainment, cognitive abilities, and non-cognitive traits. Integrating various measures of computational linguistics and large language model-based embeddings within a SuperLearner framework trained on short aspirational essays written at age 11, we accurately predict cognition and non-cognitive traits at the same and later age to a similar degree as teacher assessments, and better than genomic data. The same is true for predicting final educational attainment. Combining text, genetic markers, and teacher assessments into an ensemble model, we can predict cognitive ability at close to test-retest reliability of gold-standard tests ( ) and explain 38% of individual differences in attainment. A sociological model comparable to the baseline of the Fragile Family Challenge replicates the FFC's findings regarding the level of predictability achievable with such data. These findings show that recent advances in large language models and machine learning equip behavioural scientists with tools for prediction of psycho-social features.