Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE.

IF 7.7

PLOS digital health Pub Date : 2025-10-09 eCollection Date: 2025-10-01 DOI:10.1371/journal.pdig.0000787

Yahya Shaikh, Zainab Asiyah Jeelani-Shaikh, Muzamillah Mushtaq Jeelani, Aamir Javaid, Tauhid Mahmud, Shiv Gaglani, Michael Christopher Gibbons, Minahil Cheema, Amanda Cross, Denisa Livingston, Morgan Cheatham, Elahe Nezami, Ronald Dixon, Ashwini Niranjan-Azadi, Saad Zafar, Zishan Siddiqui

{"title":"Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE.","authors":"Yahya Shaikh, Zainab Asiyah Jeelani-Shaikh, Muzamillah Mushtaq Jeelani, Aamir Javaid, Tauhid Mahmud, Shiv Gaglani, Michael Christopher Gibbons, Minahil Cheema, Amanda Cross, Denisa Livingston, Morgan Cheatham, Elahe Nezami, Ronald Dixon, Ashwini Niranjan-Azadi, Saad Zafar, Zishan Siddiqui","doi":"10.1371/journal.pdig.0000787","DOIUrl":null,"url":null,"abstract":"<p><p>The stochastic nature of next-token generation and resulting response variability in Large Language Models (LLMs) outputs pose challenges in ensuring consistency and accuracy on knowledge assessments. This study introduces a novel multi-agent framework, referred to as a \"Council of AIs\", to enhance LLM performance through collaborative decision-making. The Council consists of multiple GPT-4 instances that iteratively discuss and reach consensus on answers facilitated by a designated \"Facilitator AI.\" This methodology was applied to 325 United States Medical Licensing Exam (USMLE) questions across all three exam stages: Step 1, focusing on biomedical sciences; Step 2 evaluating clinical knowledge (CK)\\; and Step 3, evaluating readiness for independent medical practice. The Council achieved consensus that were correct 97%, 93%, and 94% of the time for Step 1, Step 2 CK, and Step 3, respectively, outperforming single-instance GPT-4 models. In cases where there wasn't an initial unanimous response, the Council deliberations achieved a consensus that was the correct answer 83% of the time, with the Council correcting over half (53%) of the responses that majority vote had gotten incorrect. The odds of a majority voting response changing from incorrect to correct were 5 (95% CI: 1.1, 22.8) times higher than the odds of changing from correct to incorrect after discussion. This study provides the first evidence that the semantic entropy of the response space can consistently be reduced to zero-demonstrated here through Council deliberation, and suggesting the possibility of other mechanisms to achieve the same outcome.. This study revealed that in a Council model, response variability, often considered a limitation, can be transformed into a strength that supports adaptive reasoning and collaborative refinement of answers. These findings suggest new paradigms for AI implementation and reveal the heightened strength that emerges when AIs begin to collaborate as a collective rather than operate alone.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 10","pages":"e0000787"},"PeriodicalIF":7.7000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12510544/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The stochastic nature of next-token generation and resulting response variability in Large Language Models (LLMs) outputs pose challenges in ensuring consistency and accuracy on knowledge assessments. This study introduces a novel multi-agent framework, referred to as a "Council of AIs", to enhance LLM performance through collaborative decision-making. The Council consists of multiple GPT-4 instances that iteratively discuss and reach consensus on answers facilitated by a designated "Facilitator AI." This methodology was applied to 325 United States Medical Licensing Exam (USMLE) questions across all three exam stages: Step 1, focusing on biomedical sciences; Step 2 evaluating clinical knowledge (CK)\; and Step 3, evaluating readiness for independent medical practice. The Council achieved consensus that were correct 97%, 93%, and 94% of the time for Step 1, Step 2 CK, and Step 3, respectively, outperforming single-instance GPT-4 models. In cases where there wasn't an initial unanimous response, the Council deliberations achieved a consensus that was the correct answer 83% of the time, with the Council correcting over half (53%) of the responses that majority vote had gotten incorrect. The odds of a majority voting response changing from incorrect to correct were 5 (95% CI: 1.1, 22.8) times higher than the odds of changing from correct to incorrect after discussion. This study provides the first evidence that the semantic entropy of the response space can consistently be reduced to zero-demonstrated here through Council deliberation, and suggesting the possibility of other mechanisms to achieve the same outcome.. This study revealed that in a Council model, response variability, often considered a limitation, can be transformed into a strength that supports adaptive reasoning and collaborative refinement of answers. These findings suggest new paradigms for AI implementation and reveal the heightened strength that emerges when AIs begin to collaborate as a collective rather than operate alone.

查看原文本刊更多论文

人工智能中的协同智能：在USMLE上评估人工智能委员会的性能。

下一代代币的随机性以及由此产生的大型语言模型（llm）输出中的响应可变性对确保知识评估的一致性和准确性提出了挑战。本研究引入了一种新的多智能体框架，称为“人工智能理事会”，通过协作决策来提高法学硕士的绩效。理事会由多个GPT-4实例组成，迭代讨论并就指定的“Facilitator AI”提供的答案达成共识。该方法应用于美国医师执照考试（USMLE）所有三个考试阶段的325个问题：第一步，重点是生物医学科学；step2评估临床知识（CK）\；第三步，评估独立医疗实践的准备情况。在步骤1、步骤2 CK和步骤3中，理事会分别取得了97%、93%和94%的正确率，优于单实例GPT-4模型。在最初没有一致回应的情况下，理事会的审议达成了83%的正确答案的共识，理事会纠正了超过一半（53%）的多数投票不正确的回答。多数投票反应从不正确变为正确的几率是讨论后从正确变为不正确几率的5倍（95% CI: 1.1, 22.8）。本研究首次证明了响应空间的语义熵可以持续地降至零，并通过理事会审议证明了这一点，并提出了其他机制实现相同结果的可能性。该研究表明，在Council模型中，通常被认为是限制的响应可变性可以转化为支持自适应推理和协作改进答案的优势。这些发现为人工智能的实施提供了新的范例，并揭示了当人工智能开始作为一个集体而不是单独行动时出现的更高的力量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLOS digital health

自引率

0.00%

发文量