Human–AI collectives most accurately diagnose clinical vignettes

IF 9.1 1区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Proceedings of the National Academy of Sciences of the United States of America Pub Date : 2025-06-13 DOI:10.1073/pnas.2426153122

Nikolas Zöller, Julian Berger, Irving Lin, Nathan Fu, Jayanth Komarneni, Gioele Barabucci, Kyle Laskowski, Victor Shia, Benjamin Harack, Eugene A. Chu, Vito Trianni, Ralf H. J. M. Kurvers, Stefan M. Herzog

{"title":"Human–AI collectives most accurately diagnose clinical vignettes","authors":"Nikolas Zöller, Julian Berger, Irving Lin, Nathan Fu, Jayanth Komarneni, Gioele Barabucci, Kyle Laskowski, Victor Shia, Benjamin Harack, Eugene A. Chu, Vito Trianni, Ralf H. J. M. Kurvers, Stefan M. Herzog","doi":"10.1073/pnas.2426153122","DOIUrl":null,"url":null,"abstract":"AI systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased—shortcomings that may reflect LLMs’ inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here, we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 text-based medical case vignettes. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience and can be attributed to humans’ and LLMs’ complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.","PeriodicalId":20548,"journal":{"name":"Proceedings of the National Academy of Sciences of the United States of America","volume":"591 1","pages":""},"PeriodicalIF":9.1000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the National Academy of Sciences of the United States of America","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1073/pnas.2426153122","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

AI systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased—shortcomings that may reflect LLMs’ inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here, we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 text-based medical case vignettes. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience and can be attributed to humans’ and LLMs’ complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.

查看原文本刊更多论文

人类-人工智能集体最准确地诊断临床小插曲

人工智能系统，特别是大型语言模型（llm），越来越多地被用于影响个人和整个社会的高风险决策，通常没有足够的保障措施来确保安全、质量和公平。然而，法学硕士会产生幻觉，缺乏常识，并且有偏见——这些缺点可能反映了法学硕士固有的局限性，因此可能无法通过更复杂的架构、更多的数据或更多的人类反馈来弥补。因此，仅仅依靠法学硕士来做复杂、高风险的决策是有问题的。在这里，我们提出了一个混合集体智慧系统，通过利用人类经验的互补优势和法学硕士处理的大量信息来减轻这些风险。我们将我们的方法应用于开放式医学诊断，将医生做出的40,762例鉴别诊断与五个最先进的法学硕士在2,133个基于文本的医学病例小片段中的诊断相结合。我们表明，医生和法学硕士的混合集体优于单一医生和医生集体，以及单一法学硕士和法学硕士组合。这一结果适用于一系列医学专业和专业经验，可以归因于人类和法学硕士的互补贡献，导致不同类型的错误。我们的方法强调了人类和机器集体智能在提高医疗诊断等复杂开放式领域准确性方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the National Academy of Sciences of the United States of America 综合性期刊-综合性期刊

CiteScore

19.00

自引率

0.90%

发文量

3575

审稿时长

2.5 months

期刊介绍： The Proceedings of the National Academy of Sciences (PNAS), a peer-reviewed journal of the National Academy of Sciences (NAS), serves as an authoritative source for high-impact, original research across the biological, physical, and social sciences. With a global scope, the journal welcomes submissions from researchers worldwide, making it an inclusive platform for advancing scientific knowledge.