Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE.

IF 7.7
PLOS digital health Pub Date : 2025-10-09 eCollection Date: 2025-10-01 DOI:10.1371/journal.pdig.0000787
Yahya Shaikh, Zainab Asiyah Jeelani-Shaikh, Muzamillah Mushtaq Jeelani, Aamir Javaid, Tauhid Mahmud, Shiv Gaglani, Michael Christopher Gibbons, Minahil Cheema, Amanda Cross, Denisa Livingston, Morgan Cheatham, Elahe Nezami, Ronald Dixon, Ashwini Niranjan-Azadi, Saad Zafar, Zishan Siddiqui
{"title":"Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE.","authors":"Yahya Shaikh, Zainab Asiyah Jeelani-Shaikh, Muzamillah Mushtaq Jeelani, Aamir Javaid, Tauhid Mahmud, Shiv Gaglani, Michael Christopher Gibbons, Minahil Cheema, Amanda Cross, Denisa Livingston, Morgan Cheatham, Elahe Nezami, Ronald Dixon, Ashwini Niranjan-Azadi, Saad Zafar, Zishan Siddiqui","doi":"10.1371/journal.pdig.0000787","DOIUrl":null,"url":null,"abstract":"<p><p>The stochastic nature of next-token generation and resulting response variability in Large Language Models (LLMs) outputs pose challenges in ensuring consistency and accuracy on knowledge assessments. This study introduces a novel multi-agent framework, referred to as a \"Council of AIs\", to enhance LLM performance through collaborative decision-making. The Council consists of multiple GPT-4 instances that iteratively discuss and reach consensus on answers facilitated by a designated \"Facilitator AI.\" This methodology was applied to 325 United States Medical Licensing Exam (USMLE) questions across all three exam stages: Step 1, focusing on biomedical sciences; Step 2 evaluating clinical knowledge (CK)\\; and Step 3, evaluating readiness for independent medical practice. The Council achieved consensus that were correct 97%, 93%, and 94% of the time for Step 1, Step 2 CK, and Step 3, respectively, outperforming single-instance GPT-4 models. In cases where there wasn't an initial unanimous response, the Council deliberations achieved a consensus that was the correct answer 83% of the time, with the Council correcting over half (53%) of the responses that majority vote had gotten incorrect. The odds of a majority voting response changing from incorrect to correct were 5 (95% CI: 1.1, 22.8) times higher than the odds of changing from correct to incorrect after discussion. This study provides the first evidence that the semantic entropy of the response space can consistently be reduced to zero-demonstrated here through Council deliberation, and suggesting the possibility of other mechanisms to achieve the same outcome.. This study revealed that in a Council model, response variability, often considered a limitation, can be transformed into a strength that supports adaptive reasoning and collaborative refinement of answers. These findings suggest new paradigms for AI implementation and reveal the heightened strength that emerges when AIs begin to collaborate as a collective rather than operate alone.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 10","pages":"e0000787"},"PeriodicalIF":7.7000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12510544/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The stochastic nature of next-token generation and resulting response variability in Large Language Models (LLMs) outputs pose challenges in ensuring consistency and accuracy on knowledge assessments. This study introduces a novel multi-agent framework, referred to as a "Council of AIs", to enhance LLM performance through collaborative decision-making. The Council consists of multiple GPT-4 instances that iteratively discuss and reach consensus on answers facilitated by a designated "Facilitator AI." This methodology was applied to 325 United States Medical Licensing Exam (USMLE) questions across all three exam stages: Step 1, focusing on biomedical sciences; Step 2 evaluating clinical knowledge (CK)\; and Step 3, evaluating readiness for independent medical practice. The Council achieved consensus that were correct 97%, 93%, and 94% of the time for Step 1, Step 2 CK, and Step 3, respectively, outperforming single-instance GPT-4 models. In cases where there wasn't an initial unanimous response, the Council deliberations achieved a consensus that was the correct answer 83% of the time, with the Council correcting over half (53%) of the responses that majority vote had gotten incorrect. The odds of a majority voting response changing from incorrect to correct were 5 (95% CI: 1.1, 22.8) times higher than the odds of changing from correct to incorrect after discussion. This study provides the first evidence that the semantic entropy of the response space can consistently be reduced to zero-demonstrated here through Council deliberation, and suggesting the possibility of other mechanisms to achieve the same outcome.. This study revealed that in a Council model, response variability, often considered a limitation, can be transformed into a strength that supports adaptive reasoning and collaborative refinement of answers. These findings suggest new paradigms for AI implementation and reveal the heightened strength that emerges when AIs begin to collaborate as a collective rather than operate alone.

人工智能中的协同智能:在USMLE上评估人工智能委员会的性能。
下一代代币的随机性以及由此产生的大型语言模型(llm)输出中的响应可变性对确保知识评估的一致性和准确性提出了挑战。本研究引入了一种新的多智能体框架,称为“人工智能理事会”,通过协作决策来提高法学硕士的绩效。理事会由多个GPT-4实例组成,迭代讨论并就指定的“Facilitator AI”提供的答案达成共识。该方法应用于美国医师执照考试(USMLE)所有三个考试阶段的325个问题:第一步,重点是生物医学科学;step2评估临床知识(CK)\;第三步,评估独立医疗实践的准备情况。在步骤1、步骤2 CK和步骤3中,理事会分别取得了97%、93%和94%的正确率,优于单实例GPT-4模型。在最初没有一致回应的情况下,理事会的审议达成了83%的正确答案的共识,理事会纠正了超过一半(53%)的多数投票不正确的回答。多数投票反应从不正确变为正确的几率是讨论后从正确变为不正确几率的5倍(95% CI: 1.1, 22.8)。本研究首次证明了响应空间的语义熵可以持续地降至零,并通过理事会审议证明了这一点,并提出了其他机制实现相同结果的可能性。该研究表明,在Council模型中,通常被认为是限制的响应可变性可以转化为支持自适应推理和协作改进答案的优势。这些发现为人工智能的实施提供了新的范例,并揭示了当人工智能开始作为一个集体而不是单独行动时出现的更高的力量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信