Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests

Sherif Abdelkarim , David Lu , Dora-Luz Flores , Susanne Jaeggi , Pierre Baldi
{"title":"Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests","authors":"Sherif Abdelkarim ,&nbsp;David Lu ,&nbsp;Dora-Luz Flores ,&nbsp;Susanne Jaeggi ,&nbsp;Pierre Baldi","doi":"10.1016/j.chbah.2025.100170","DOIUrl":null,"url":null,"abstract":"<div><div>Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ <span><math><mo>≈</mo></math></span> 125 vs visual-IQ <span><math><mo>≈</mo></math></span> 103), and persistent failure on abstract arithmetic (<span><math><mo>≤</mo></math></span> 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.</div></div>","PeriodicalId":100324,"journal":{"name":"Computers in Human Behavior: Artificial Humans","volume":"5 ","pages":"Article 100170"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in Human Behavior: Artificial Humans","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949882125000544","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ 125 vs visual-IQ 103), and persistent failure on abstract arithmetic ( 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.
大型语言模型(llm)在许多专门的基准测试中表现出色,但它们的一般推理能力仍然不透明。因此,我们测试了18个模型——包括GPT-4、Claude 3和Gemini Pro——在一个14部分的IQ套件上,涵盖了语言、数字和视觉难题,并添加了一个“多代理反射”变体,其中一个模型回答,而其他模型则批评和修改。结果重复了已知的模式:语言和数字推理的强烈偏见(GPT-4: 79%对53%的准确率),明显的模态差距(文本智商≈125 vs视觉智商≈103),以及抽象算术的持续失败(在缺失数字任务中≤20%)。缩放将平均智商从89(小模型)提高到131(大模型),但收益是不均匀的,反射只给前沿系统带来适度的加分。我们的贡献包括:(1)提出了一个使用语言和视觉智商任务的法学硕士“智力”评估框架;(2)分析了具有不同演员和评论家大小的多智能体设置如何影响问题解决性能;(3)分析模型大小和多模态对不同推理任务性能的影响;(4)强调智商测试作为一种标准化的、以人为参考的基准的价值,可以对法学硕士的认知能力与人类标准进行纵向比较。我们进一步讨论了智商测试作为人工智能基准的局限性,并概述了更全面评估法学硕士推理能力的方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信