Large language models are proficient in solving and creating emotional intelligence tests.

Katja Schlegel, Nils R Sommer, Marcello Mortillaro
{"title":"Large language models are proficient in solving and creating emotional intelligence tests.","authors":"Katja Schlegel, Nils R Sommer, Marcello Mortillaro","doi":"10.1038/s44271-025-00258-x","DOIUrl":null,"url":null,"abstract":"<p><p>Large Language Models (LLMs) demonstrate expertise across diverse domains, yet their capacity for emotional intelligence remains uncertain. This research examined whether LLMs can solve and generate performance-based emotional intelligence tests. Results showed that ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3 outperformed humans on five standard emotional intelligence tests, achieving an average accuracy of 81%, compared to the 56% human average reported in the original validation studies. In a second step, ChatGPT-4 generated new test items for each emotional intelligence test. These new versions and the original tests were administered to human participants across five studies (total N = 467). Overall, original and ChatGPT-generated tests demonstrated statistically equivalent test difficulty. Perceived item clarity and realism, item content diversity, internal consistency, correlations with a vocabulary test, and correlations with an external ability emotional intelligence test were not statistically equivalent between original and ChatGPT-generated tests. However, all differences were smaller than Cohen's d ± 0.25, and none of the 95% confidence interval boundaries exceeded a medium effect size (d ± 0.50). Additionally, original and ChatGPT-generated tests were strongly correlated (r = 0.46). These findings suggest that LLMs can generate responses that are consistent with accurate knowledge about human emotions and their regulation.</p>","PeriodicalId":501698,"journal":{"name":"Communications Psychology","volume":"3 1","pages":"80"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12095572/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications Psychology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s44271-025-00258-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Large Language Models (LLMs) demonstrate expertise across diverse domains, yet their capacity for emotional intelligence remains uncertain. This research examined whether LLMs can solve and generate performance-based emotional intelligence tests. Results showed that ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3 outperformed humans on five standard emotional intelligence tests, achieving an average accuracy of 81%, compared to the 56% human average reported in the original validation studies. In a second step, ChatGPT-4 generated new test items for each emotional intelligence test. These new versions and the original tests were administered to human participants across five studies (total N = 467). Overall, original and ChatGPT-generated tests demonstrated statistically equivalent test difficulty. Perceived item clarity and realism, item content diversity, internal consistency, correlations with a vocabulary test, and correlations with an external ability emotional intelligence test were not statistically equivalent between original and ChatGPT-generated tests. However, all differences were smaller than Cohen's d ± 0.25, and none of the 95% confidence interval boundaries exceeded a medium effect size (d ± 0.50). Additionally, original and ChatGPT-generated tests were strongly correlated (r = 0.46). These findings suggest that LLMs can generate responses that are consistent with accurate knowledge about human emotions and their regulation.

大型语言模型精通解决和创建情商测试。
大型语言模型(llm)展示了不同领域的专业知识,但它们在情商方面的能力仍不确定。本研究考察了法学硕士是否能够解决并生成基于绩效的情商测试。结果显示,ChatGPT-4、chatgpt - 1、Gemini 1.5 flash、Copilot 365、Claude 3.5 Haiku和DeepSeek V3在五项标准情商测试中表现优于人类,平均准确率达到81%,而在最初的验证研究中,人类的平均准确率为56%。在第二步中,ChatGPT-4为每个情商测试生成新的测试项目。这些新版本和原始测试在五项研究中对人类参与者进行了管理(总N = 467)。总体而言,原始和chatgpt生成的测试显示出统计上相等的测试难度。在原始测试和chatgpt生成的测试之间,感知项目的清晰度和真实性、项目内容多样性、内部一致性、与词汇测试的相关性以及与外部能力情商测试的相关性在统计上并不相等。然而,所有差异均小于Cohen’s d±0.25,95%置信区间边界均未超过中等效应量(d±0.50)。此外,原始和chatgpt生成的测试是强相关的(r = 0.46)。这些发现表明,法学硕士可以产生与人类情绪及其调节的准确知识相一致的反应。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信