Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests

Computers in Human Behavior: Artificial Humans Pub Date : 2025-06-18 DOI:10.1016/j.chbah.2025.100170

Sherif Abdelkarim , David Lu , Dora-Luz Flores , Susanne Jaeggi , Pierre Baldi

{"title":"Evaluating the Intelligence of large language models: A comparative study using verbal and visual IQ tests","authors":"Sherif Abdelkarim , David Lu , Dora-Luz Flores , Susanne Jaeggi , Pierre Baldi","doi":"10.1016/j.chbah.2025.100170","DOIUrl":null,"url":null,"abstract":"<div><div>Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ <span><math><mo>≈</mo></math></span> 125 vs visual-IQ <span><math><mo>≈</mo></math></span> 103), and persistent failure on abstract arithmetic (<span><math><mo>≤</mo></math></span> 20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.</div></div>","PeriodicalId":100324,"journal":{"name":"Computers in Human Behavior: Artificial Humans","volume":"5 ","pages":"Article 100170"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in Human Behavior: Artificial Humans","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949882125000544","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) excel on many specialized benchmarks, yet their general-reasoning ability remains opaque. We therefore test 18 models – including GPT-4, Claude 3 and Gemini Pro – on a 14-section IQ suite spanning verbal, numerical and visual puzzles and add a “multi-agent reflection” variant in which one model answers while others critique and revise. Results replicate known patterns: a strong bias towards verbal vs numerical reasoning (GPT-4: 79% vs 53% accuracy), a pronounced modality gap (text-IQ

\approx

125 vs visual-IQ

\approx

103), and persistent failure on abstract arithmetic (

\leq

20% on missing-number tasks). Scaling lifts mean IQ from 89 (tiny models) to 131 (large models), but gains are non-uniform, and reflection yields only modest extra points for frontier systems. Our contributions include: (1) proposing an evaluation framework for LLM “intelligence” using both verbal and visual IQ tasks, (2) analyzing how multi-agent setups with varying actor and critic sizes affect problem-solving performance; (3) analyzing how model size and multi-modality affect performance across diverse reasoning tasks; and (4) highlighting the value of IQ tests as a standardized, human-referenced benchmark that enables longitudinal comparison of LLMs’ cognitive abilities relative to human norms. We further discuss the limitations of IQ tests as an AI benchmark and outline directions for more comprehensive evaluation of LLM reasoning capabilities.

查看原文本刊更多论文

评估大型语言模型的智力：使用口头和视觉智商测试的比较研究

大型语言模型（llm）在许多专门的基准测试中表现出色，但它们的一般推理能力仍然不透明。因此，我们测试了18个模型——包括GPT-4、Claude 3和Gemini Pro——在一个14部分的IQ套件上，涵盖了语言、数字和视觉难题，并添加了一个“多代理反射”变体，其中一个模型回答，而其他模型则批评和修改。结果重复了已知的模式：语言和数字推理的强烈偏见（GPT-4: 79%对53%的准确率），明显的模态差距（文本智商≈125 vs视觉智商≈103），以及抽象算术的持续失败（在缺失数字任务中≤20%）。缩放将平均智商从89（小模型）提高到131（大模型），但收益是不均匀的，反射只给前沿系统带来适度的加分。我们的贡献包括：(1)提出了一个使用语言和视觉智商任务的法学硕士“智力”评估框架；(2)分析了具有不同演员和评论家大小的多智能体设置如何影响问题解决性能；(3)分析模型大小和多模态对不同推理任务性能的影响；(4)强调智商测试作为一种标准化的、以人为参考的基准的价值，可以对法学硕士的认知能力与人类标准进行纵向比较。我们进一步讨论了智商测试作为人工智能基准的局限性，并概述了更全面评估法学硕士推理能力的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers in Human Behavior: Artificial Humans

自引率

0.00%

发文量