评估o1推理大语言模型的认知偏差：一项小研究

IF 9.3 1区医学 Q1 CRITICAL CARE MEDICINE

Critical Care Pub Date : 2025-08-21 DOI:10.1186/s13054-025-05591-5

Or Degany, Sahar Laros, Daphna Idan, Sharon Einav

{"title":"评估o1推理大语言模型的认知偏差：一项小研究","authors":"Or Degany, Sahar Laros, Daphna Idan, Sharon Einav","doi":"10.1186/s13054-025-05591-5","DOIUrl":null,"url":null,"abstract":"Cognitive biases, systematic deviations from logical judgment, are well documented in clinical decision-making, particularly in clinical settings characterized by high decision load, limited time, and diagnostic uncertainty-such as critical care. Prior work demonstrated that large language models, particularly GPT-4, reproduce many of these biases, sometimes to a greater extent than human clinicians. We tested whether the o1 model (o1-2024–12-17), a newly released AI system with enhanced reasoning capabilities, is susceptible to cognitive biases that commonly affect medical decision-making. Following the methodology established by Wang and Redelmeier [15], we used ten pairs of clinical scenarios, each designed to test a specific cognitive bias known to influence clinicians. Each scenario had two versions, differed by subtle modifications designed to trigger the bias (such as presenting mortality rates versus survival rates). The o1 model generated 90 independent clinical recommendations for each scenario version, totalling 1,800 responses. We measured cognitive bias as systematic differences in recommendation rates between the paired scenarios, which should not occur with unbiased reasoning. The o1 model's performance was compared against previously published results from both the GPT-4 model and historical human clinician studies. The o1 model showed no measurable cognitive bias in seven of the ten vignettes. In two vignettes, the o1 model showed significant bias, but its absolute magnitude was lower than values previously reported for GPT-4 and human clinicians. In a single vignette, Occam’s razor, the o1 model exhibited consistent bias. Therefore, although overall bias appears less frequent overall with the reasoning model than with GPT-4, it was worse in one vignette. The model was more prone to bias in vignettes that included a gap-closing cue, seemingly resolving the clinical uncertainty. Across eight vignette versions, intra‑scenario agreement exceeded 94%, indicating lower decision variability than previously described with GPT‑4 and human clinicians. Reasoning models may reduce cognitive bias and random variation in judgment (i.e., “noise”). However, our findings caution that reasoning models are still not entirely immune to cognitive bias. These findings suggest that reasoning models may impart some benefits as decision-support tools in medicine, but they also imply a need to explore further the circumstances in which these tools may fail.","PeriodicalId":10811,"journal":{"name":"Critical Care","volume":"53 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating the o1 reasoning large language model for cognitive bias: a vignette study\",\"authors\":\"Or Degany, Sahar Laros, Daphna Idan, Sharon Einav\",\"doi\":\"10.1186/s13054-025-05591-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cognitive biases, systematic deviations from logical judgment, are well documented in clinical decision-making, particularly in clinical settings characterized by high decision load, limited time, and diagnostic uncertainty-such as critical care. Prior work demonstrated that large language models, particularly GPT-4, reproduce many of these biases, sometimes to a greater extent than human clinicians. We tested whether the o1 model (o1-2024–12-17), a newly released AI system with enhanced reasoning capabilities, is susceptible to cognitive biases that commonly affect medical decision-making. Following the methodology established by Wang and Redelmeier [15], we used ten pairs of clinical scenarios, each designed to test a specific cognitive bias known to influence clinicians. Each scenario had two versions, differed by subtle modifications designed to trigger the bias (such as presenting mortality rates versus survival rates). The o1 model generated 90 independent clinical recommendations for each scenario version, totalling 1,800 responses. We measured cognitive bias as systematic differences in recommendation rates between the paired scenarios, which should not occur with unbiased reasoning. The o1 model's performance was compared against previously published results from both the GPT-4 model and historical human clinician studies. The o1 model showed no measurable cognitive bias in seven of the ten vignettes. In two vignettes, the o1 model showed significant bias, but its absolute magnitude was lower than values previously reported for GPT-4 and human clinicians. In a single vignette, Occam’s razor, the o1 model exhibited consistent bias. Therefore, although overall bias appears less frequent overall with the reasoning model than with GPT-4, it was worse in one vignette. The model was more prone to bias in vignettes that included a gap-closing cue, seemingly resolving the clinical uncertainty. Across eight vignette versions, intra‑scenario agreement exceeded 94%, indicating lower decision variability than previously described with GPT‑4 and human clinicians. Reasoning models may reduce cognitive bias and random variation in judgment (i.e., “noise”). However, our findings caution that reasoning models are still not entirely immune to cognitive bias. These findings suggest that reasoning models may impart some benefits as decision-support tools in medicine, but they also imply a need to explore further the circumstances in which these tools may fail.\",\"PeriodicalId\":10811,\"journal\":{\"name\":\"Critical Care\",\"volume\":\"53 1\",\"pages\":\"\"},\"PeriodicalIF\":9.3000,\"publicationDate\":\"2025-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Critical Care\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s13054-025-05591-5\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CRITICAL CARE MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Critical Care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13054-025-05591-5","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

认知偏差，即对逻辑判断的系统性偏差，在临床决策中有很好的记录，特别是在具有高决策负荷、有限时间和诊断不确定性的临床环境中，如重症监护。先前的研究表明，大型语言模型，特别是GPT-4，再现了许多这些偏见，有时比人类临床医生更严重。我们测试了新发布的具有增强推理能力的人工智能系统o1模型（o1-2024-12-17）是否容易受到通常影响医疗决策的认知偏差的影响。遵循Wang和Redelmeier[15]建立的方法，我们使用了十对临床场景，每个场景都设计用于测试已知影响临床医生的特定认知偏见。每个场景都有两个版本，通过细微的修改来触发偏差（比如呈现死亡率和存活率）。01模型为每个场景版本生成了90个独立的临床建议，总共有1800个回应。我们将认知偏差测量为配对场景之间推荐率的系统差异，这在无偏推理中不应该发生。o1模型的表现与先前发表的GPT-4模型和历史人类临床研究的结果进行了比较。01模型在10个小插曲中有7个没有显示出可测量的认知偏差。在两个小片段中，o1模型显示出显著的偏倚，但其绝对值低于先前报道的GPT-4和人类临床医生的值。在奥卡姆剃刀模型中，01模型表现出一致的偏差。因此，尽管总体偏倚在推理模型中出现的频率低于GPT-4，但在一个小片段中偏倚更严重。该模型在包含间隙闭合提示的小片段中更容易产生偏差，似乎解决了临床不确定性。在8个小插曲版本中，情景内一致性超过94%，表明决策变异性比以前用GPT - 4和人类临床医生描述的要低。推理模型可以减少认知偏差和判断中的随机变化（即“噪音”）。然而，我们的研究结果提醒我们，推理模型仍然不能完全免受认知偏见的影响。这些发现表明，推理模型作为医学决策支持工具可能会带来一些好处，但它们也意味着需要进一步探索这些工具可能失效的情况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating the o1 reasoning large language model for cognitive bias: a vignette study

Cognitive biases, systematic deviations from logical judgment, are well documented in clinical decision-making, particularly in clinical settings characterized by high decision load, limited time, and diagnostic uncertainty-such as critical care. Prior work demonstrated that large language models, particularly GPT-4, reproduce many of these biases, sometimes to a greater extent than human clinicians. We tested whether the o1 model (o1-2024–12-17), a newly released AI system with enhanced reasoning capabilities, is susceptible to cognitive biases that commonly affect medical decision-making. Following the methodology established by Wang and Redelmeier [15], we used ten pairs of clinical scenarios, each designed to test a specific cognitive bias known to influence clinicians. Each scenario had two versions, differed by subtle modifications designed to trigger the bias (such as presenting mortality rates versus survival rates). The o1 model generated 90 independent clinical recommendations for each scenario version, totalling 1,800 responses. We measured cognitive bias as systematic differences in recommendation rates between the paired scenarios, which should not occur with unbiased reasoning. The o1 model's performance was compared against previously published results from both the GPT-4 model and historical human clinician studies. The o1 model showed no measurable cognitive bias in seven of the ten vignettes. In two vignettes, the o1 model showed significant bias, but its absolute magnitude was lower than values previously reported for GPT-4 and human clinicians. In a single vignette, Occam’s razor, the o1 model exhibited consistent bias. Therefore, although overall bias appears less frequent overall with the reasoning model than with GPT-4, it was worse in one vignette. The model was more prone to bias in vignettes that included a gap-closing cue, seemingly resolving the clinical uncertainty. Across eight vignette versions, intra‑scenario agreement exceeded 94%, indicating lower decision variability than previously described with GPT‑4 and human clinicians. Reasoning models may reduce cognitive bias and random variation in judgment (i.e., “noise”). However, our findings caution that reasoning models are still not entirely immune to cognitive bias. These findings suggest that reasoning models may impart some benefits as decision-support tools in medicine, but they also imply a need to explore further the circumstances in which these tools may fail.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Critical Care 医学-危重病医学

CiteScore

20.60

自引率

3.30%

发文量

348

审稿时长

1.5 months

期刊介绍： Critical Care is an esteemed international medical journal that undergoes a rigorous peer-review process to maintain its high quality standards. Its primary objective is to enhance the healthcare services offered to critically ill patients. To achieve this, the journal focuses on gathering, exchanging, disseminating, and endorsing evidence-based information that is highly relevant to intensivists. By doing so, Critical Care seeks to provide a thorough and inclusive examination of the intensive care field.