Large language models are proficient in solving and creating emotional intelligence tests.

Communications Psychology Pub Date : 2025-05-21 DOI:10.1038/s44271-025-00258-x

Katja Schlegel, Nils R Sommer, Marcello Mortillaro

{"title":"Large language models are proficient in solving and creating emotional intelligence tests.","authors":"Katja Schlegel, Nils R Sommer, Marcello Mortillaro","doi":"10.1038/s44271-025-00258-x","DOIUrl":null,"url":null,"abstract":"<p><p>Large Language Models (LLMs) demonstrate expertise across diverse domains, yet their capacity for emotional intelligence remains uncertain. This research examined whether LLMs can solve and generate performance-based emotional intelligence tests. Results showed that ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3 outperformed humans on five standard emotional intelligence tests, achieving an average accuracy of 81%, compared to the 56% human average reported in the original validation studies. In a second step, ChatGPT-4 generated new test items for each emotional intelligence test. These new versions and the original tests were administered to human participants across five studies (total N = 467). Overall, original and ChatGPT-generated tests demonstrated statistically equivalent test difficulty. Perceived item clarity and realism, item content diversity, internal consistency, correlations with a vocabulary test, and correlations with an external ability emotional intelligence test were not statistically equivalent between original and ChatGPT-generated tests. However, all differences were smaller than Cohen's d ± 0.25, and none of the 95% confidence interval boundaries exceeded a medium effect size (d ± 0.50). Additionally, original and ChatGPT-generated tests were strongly correlated (r = 0.46). These findings suggest that LLMs can generate responses that are consistent with accurate knowledge about human emotions and their regulation.</p>","PeriodicalId":501698,"journal":{"name":"Communications Psychology","volume":"3 1","pages":"80"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12095572/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications Psychology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s44271-025-00258-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) demonstrate expertise across diverse domains, yet their capacity for emotional intelligence remains uncertain. This research examined whether LLMs can solve and generate performance-based emotional intelligence tests. Results showed that ChatGPT-4, ChatGPT-o1, Gemini 1.5 flash, Copilot 365, Claude 3.5 Haiku, and DeepSeek V3 outperformed humans on five standard emotional intelligence tests, achieving an average accuracy of 81%, compared to the 56% human average reported in the original validation studies. In a second step, ChatGPT-4 generated new test items for each emotional intelligence test. These new versions and the original tests were administered to human participants across five studies (total N = 467). Overall, original and ChatGPT-generated tests demonstrated statistically equivalent test difficulty. Perceived item clarity and realism, item content diversity, internal consistency, correlations with a vocabulary test, and correlations with an external ability emotional intelligence test were not statistically equivalent between original and ChatGPT-generated tests. However, all differences were smaller than Cohen's d ± 0.25, and none of the 95% confidence interval boundaries exceeded a medium effect size (d ± 0.50). Additionally, original and ChatGPT-generated tests were strongly correlated (r = 0.46). These findings suggest that LLMs can generate responses that are consistent with accurate knowledge about human emotions and their regulation.

查看原文本刊更多论文

大型语言模型精通解决和创建情商测试。

大型语言模型（llm）展示了不同领域的专业知识，但它们在情商方面的能力仍不确定。本研究考察了法学硕士是否能够解决并生成基于绩效的情商测试。结果显示，ChatGPT-4、chatgpt - 1、Gemini 1.5 flash、Copilot 365、Claude 3.5 Haiku和DeepSeek V3在五项标准情商测试中表现优于人类，平均准确率达到81%，而在最初的验证研究中，人类的平均准确率为56%。在第二步中，ChatGPT-4为每个情商测试生成新的测试项目。这些新版本和原始测试在五项研究中对人类参与者进行了管理（总N = 467）。总体而言，原始和chatgpt生成的测试显示出统计上相等的测试难度。在原始测试和chatgpt生成的测试之间，感知项目的清晰度和真实性、项目内容多样性、内部一致性、与词汇测试的相关性以及与外部能力情商测试的相关性在统计上并不相等。然而，所有差异均小于Cohen’s d±0.25,95%置信区间边界均未超过中等效应量（d±0.50）。此外，原始和chatgpt生成的测试是强相关的（r = 0.46）。这些发现表明，法学硕士可以产生与人类情绪及其调节的准确知识相一致的反应。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Communications Psychology

自引率

0.00%

发文量