High variability in LLMs’ analogical reasoning

IF 21.4 1区心理学 Q1 MULTIDISCIPLINARY SCIENCES

Nature Human Behaviour Pub Date : 2025-06-04 DOI:10.1038/s41562-025-02224-3

Andrea Gregor de Varda, Chiara Saponaro, Marco Marelli

{"title":"High variability in LLMs’ analogical reasoning","authors":"Andrea Gregor de Varda, Chiara Saponaro, Marco Marelli","doi":"10.1038/s41562-025-02224-3","DOIUrl":null,"url":null,"abstract":"arising from T. Webb et al. Nature Human Behaviour https://doi.org/10.1038/s41562-023-01659-w (2025)In a recent study, Webb, Holyoak and Lu1 (henceforth WHL) demonstrated that a large language model (GPT-3, text-davinci-003) could match or even exceed human performance across several analogical reasoning tasks. This result led to the compelling conclusion that LLMs such as GPT-3 possess an emergent ability to reason by analogy. However, the findings were based on a single, proprietary model for which the releasing company provided limited public details and progressively restricted access to the internal probability distributions. Furthermore, text-davinci-003 was deprecated on 4 January 2024, and is no longer available through the OpenAI API. This poses a challenge to replicability in two ways. First, the lack of open access to the model and its recent deprecation make it difficult—if not impossible—for other researchers to verify or build upon the findings. Second, relying on a single model leaves open the question of whether the results can be extended to LLMs as a broader class of objects of scientific investigation. Without testing a diverse range of models, it is unclear whether the observed behaviours are specific to GPT-3 or represent a general property of comparable contemporary LLMs. Replicating experimental results based on proprietary models with public alternatives is thus crucial to ensure that the findings can be reproduced in the future2, generalized to new model instances, and, more generally, to adhere to transparency principles that are of paramount importance in scientific research3.","PeriodicalId":19074,"journal":{"name":"Nature Human Behaviour","volume":"17 1","pages":""},"PeriodicalIF":21.4000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Human Behaviour","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1038/s41562-025-02224-3","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

arising from T. Webb et al. Nature Human Behaviour https://doi.org/10.1038/s41562-023-01659-w (2025)

In a recent study, Webb, Holyoak and Lu¹ (henceforth WHL) demonstrated that a large language model (GPT-3, text-davinci-003) could match or even exceed human performance across several analogical reasoning tasks. This result led to the compelling conclusion that LLMs such as GPT-3 possess an emergent ability to reason by analogy. However, the findings were based on a single, proprietary model for which the releasing company provided limited public details and progressively restricted access to the internal probability distributions. Furthermore, text-davinci-003 was deprecated on 4 January 2024, and is no longer available through the OpenAI API. This poses a challenge to replicability in two ways. First, the lack of open access to the model and its recent deprecation make it difficult—if not impossible—for other researchers to verify or build upon the findings. Second, relying on a single model leaves open the question of whether the results can be extended to LLMs as a broader class of objects of scientific investigation. Without testing a diverse range of models, it is unclear whether the observed behaviours are specific to GPT-3 or represent a general property of comparable contemporary LLMs. Replicating experimental results based on proprietary models with public alternatives is thus crucial to ensure that the findings can be reproduced in the future², generalized to new model instances, and, more generally, to adhere to transparency principles that are of paramount importance in scientific research³.

Abstract Image

查看原文本刊更多论文

法学硕士类比推理的高度可变性

源自T. Webb等人。自然-人类行为https://doi.org/10.1038/s41562-023-01659-w（2025）在最近的一项研究中，Webb， Holyoak和Lu1（以下简称WHL）证明了一个大型语言模型（GPT-3, text- davincii -003）可以在几个类比推理任务中匹配甚至超过人类的表现。这一结果导致了一个令人信服的结论，即像GPT-3这样的法学硕士具有通过类比推理的新兴能力。然而，调查结果是基于单一的专有模型，发行公司提供了有限的公开细节，并逐渐限制了对内部概率分布的访问。此外，text- davici -003已于2024年1月4日弃用，不再通过OpenAI API提供。这从两个方面对可复制性提出了挑战。首先，缺乏对该模型的开放访问以及最近的弃用使得其他研究人员很难（如果不是不可能的话）验证或建立在这些发现的基础上。其次，依赖单一模型留下了一个问题，即结果是否可以扩展到法学硕士，作为更广泛的科学研究对象。在没有测试各种模型的情况下，尚不清楚观察到的行为是GPT-3特有的，还是代表了可比的当代法学硕士的一般特性。因此，用公共替代方案来复制基于专有模型的实验结果，对于确保这些发现在未来可以被复制、推广到新的模型实例，以及更普遍地说，对于坚持科学研究中最重要的透明度原则至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nature Human Behaviour Psychology-Social Psychology

CiteScore

36.80

自引率

1.00%

发文量

227

期刊介绍： Nature Human Behaviour is a journal that focuses on publishing research of outstanding significance into any aspect of human behavior.The research can cover various areas such as psychological, biological, and social bases of human behavior.It also includes the study of origins, development, and disorders related to human behavior.The primary aim of the journal is to increase the visibility of research in the field and enhance its societal reach and impact.