High variability in LLMs’ analogical reasoning

IF 21.4 1区 心理学 Q1 MULTIDISCIPLINARY SCIENCES
Andrea Gregor de Varda, Chiara Saponaro, Marco Marelli
{"title":"High variability in LLMs’ analogical reasoning","authors":"Andrea Gregor de Varda, Chiara Saponaro, Marco Marelli","doi":"10.1038/s41562-025-02224-3","DOIUrl":null,"url":null,"abstract":"<p><span>arising from</span> T. Webb et al. <i>Nature Human Behaviour</i> https://doi.org/10.1038/s41562-023-01659-w (2025)</p><p>In a recent study, Webb, Holyoak and Lu<sup>1</sup> (henceforth WHL) demonstrated that a large language model (GPT-3, text-davinci-003) could match or even exceed human performance across several analogical reasoning tasks. This result led to the compelling conclusion that LLMs such as GPT-3 possess an emergent ability to reason by analogy. However, the findings were based on a single, proprietary model for which the releasing company provided limited public details and progressively restricted access to the internal probability distributions. Furthermore, text-davinci-003 was deprecated on 4 January 2024, and is no longer available through the OpenAI API. This poses a challenge to replicability in two ways. First, the lack of open access to the model and its recent deprecation make it difficult—if not impossible—for other researchers to verify or build upon the findings. Second, relying on a single model leaves open the question of whether the results can be extended to LLMs as a broader class of objects of scientific investigation. Without testing a diverse range of models, it is unclear whether the observed behaviours are specific to GPT-3 or represent a general property of comparable contemporary LLMs. Replicating experimental results based on proprietary models with public alternatives is thus crucial to ensure that the findings can be reproduced in the future<sup>2</sup>, generalized to new model instances, and, more generally, to adhere to transparency principles that are of paramount importance in scientific research<sup>3</sup>.</p>","PeriodicalId":19074,"journal":{"name":"Nature Human Behaviour","volume":"17 1","pages":""},"PeriodicalIF":21.4000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Human Behaviour","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1038/s41562-025-02224-3","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

arising from T. Webb et al. Nature Human Behaviour https://doi.org/10.1038/s41562-023-01659-w (2025)

In a recent study, Webb, Holyoak and Lu1 (henceforth WHL) demonstrated that a large language model (GPT-3, text-davinci-003) could match or even exceed human performance across several analogical reasoning tasks. This result led to the compelling conclusion that LLMs such as GPT-3 possess an emergent ability to reason by analogy. However, the findings were based on a single, proprietary model for which the releasing company provided limited public details and progressively restricted access to the internal probability distributions. Furthermore, text-davinci-003 was deprecated on 4 January 2024, and is no longer available through the OpenAI API. This poses a challenge to replicability in two ways. First, the lack of open access to the model and its recent deprecation make it difficult—if not impossible—for other researchers to verify or build upon the findings. Second, relying on a single model leaves open the question of whether the results can be extended to LLMs as a broader class of objects of scientific investigation. Without testing a diverse range of models, it is unclear whether the observed behaviours are specific to GPT-3 or represent a general property of comparable contemporary LLMs. Replicating experimental results based on proprietary models with public alternatives is thus crucial to ensure that the findings can be reproduced in the future2, generalized to new model instances, and, more generally, to adhere to transparency principles that are of paramount importance in scientific research3.

Abstract Image

法学硕士类比推理的高度可变性
源自T. Webb等人。自然-人类行为https://doi.org/10.1038/s41562-023-01659-w(2025)在最近的一项研究中,Webb, Holyoak和Lu1(以下简称WHL)证明了一个大型语言模型(GPT-3, text- davincii -003)可以在几个类比推理任务中匹配甚至超过人类的表现。这一结果导致了一个令人信服的结论,即像GPT-3这样的法学硕士具有通过类比推理的新兴能力。然而,调查结果是基于单一的专有模型,发行公司提供了有限的公开细节,并逐渐限制了对内部概率分布的访问。此外,text- davici -003已于2024年1月4日弃用,不再通过OpenAI API提供。这从两个方面对可复制性提出了挑战。首先,缺乏对该模型的开放访问以及最近的弃用使得其他研究人员很难(如果不是不可能的话)验证或建立在这些发现的基础上。其次,依赖单一模型留下了一个问题,即结果是否可以扩展到法学硕士,作为更广泛的科学研究对象。在没有测试各种模型的情况下,尚不清楚观察到的行为是GPT-3特有的,还是代表了可比的当代法学硕士的一般特性。因此,用公共替代方案来复制基于专有模型的实验结果,对于确保这些发现在未来可以被复制、推广到新的模型实例,以及更普遍地说,对于坚持科学研究中最重要的透明度原则至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Nature Human Behaviour
Nature Human Behaviour Psychology-Social Psychology
CiteScore
36.80
自引率
1.00%
发文量
227
期刊介绍: Nature Human Behaviour is a journal that focuses on publishing research of outstanding significance into any aspect of human behavior.The research can cover various areas such as psychological, biological, and social bases of human behavior.It also includes the study of origins, development, and disorders related to human behavior.The primary aim of the journal is to increase the visibility of research in the field and enhance its societal reach and impact.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信