Yahya Shaikh, Zainab Asiyah Jeelani-Shaikh, Muzamillah Mushtaq Jeelani, Aamir Javaid, Tauhid Mahmud, Shiv Gaglani, Michael Christopher Gibbons, Minahil Cheema, Amanda Cross, Denisa Livingston, Morgan Cheatham, Elahe Nezami, Ronald Dixon, Ashwini Niranjan-Azadi, Saad Zafar, Zishan Siddiqui
{"title":"Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE.","authors":"Yahya Shaikh, Zainab Asiyah Jeelani-Shaikh, Muzamillah Mushtaq Jeelani, Aamir Javaid, Tauhid Mahmud, Shiv Gaglani, Michael Christopher Gibbons, Minahil Cheema, Amanda Cross, Denisa Livingston, Morgan Cheatham, Elahe Nezami, Ronald Dixon, Ashwini Niranjan-Azadi, Saad Zafar, Zishan Siddiqui","doi":"10.1371/journal.pdig.0000787","DOIUrl":null,"url":null,"abstract":"<p><p>The stochastic nature of next-token generation and resulting response variability in Large Language Models (LLMs) outputs pose challenges in ensuring consistency and accuracy on knowledge assessments. This study introduces a novel multi-agent framework, referred to as a \"Council of AIs\", to enhance LLM performance through collaborative decision-making. The Council consists of multiple GPT-4 instances that iteratively discuss and reach consensus on answers facilitated by a designated \"Facilitator AI.\" This methodology was applied to 325 United States Medical Licensing Exam (USMLE) questions across all three exam stages: Step 1, focusing on biomedical sciences; Step 2 evaluating clinical knowledge (CK)\\; and Step 3, evaluating readiness for independent medical practice. The Council achieved consensus that were correct 97%, 93%, and 94% of the time for Step 1, Step 2 CK, and Step 3, respectively, outperforming single-instance GPT-4 models. In cases where there wasn't an initial unanimous response, the Council deliberations achieved a consensus that was the correct answer 83% of the time, with the Council correcting over half (53%) of the responses that majority vote had gotten incorrect. The odds of a majority voting response changing from incorrect to correct were 5 (95% CI: 1.1, 22.8) times higher than the odds of changing from correct to incorrect after discussion. This study provides the first evidence that the semantic entropy of the response space can consistently be reduced to zero-demonstrated here through Council deliberation, and suggesting the possibility of other mechanisms to achieve the same outcome.. This study revealed that in a Council model, response variability, often considered a limitation, can be transformed into a strength that supports adaptive reasoning and collaborative refinement of answers. These findings suggest new paradigms for AI implementation and reveal the heightened strength that emerges when AIs begin to collaborate as a collective rather than operate alone.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 10","pages":"e0000787"},"PeriodicalIF":7.7000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12510544/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The stochastic nature of next-token generation and resulting response variability in Large Language Models (LLMs) outputs pose challenges in ensuring consistency and accuracy on knowledge assessments. This study introduces a novel multi-agent framework, referred to as a "Council of AIs", to enhance LLM performance through collaborative decision-making. The Council consists of multiple GPT-4 instances that iteratively discuss and reach consensus on answers facilitated by a designated "Facilitator AI." This methodology was applied to 325 United States Medical Licensing Exam (USMLE) questions across all three exam stages: Step 1, focusing on biomedical sciences; Step 2 evaluating clinical knowledge (CK)\; and Step 3, evaluating readiness for independent medical practice. The Council achieved consensus that were correct 97%, 93%, and 94% of the time for Step 1, Step 2 CK, and Step 3, respectively, outperforming single-instance GPT-4 models. In cases where there wasn't an initial unanimous response, the Council deliberations achieved a consensus that was the correct answer 83% of the time, with the Council correcting over half (53%) of the responses that majority vote had gotten incorrect. The odds of a majority voting response changing from incorrect to correct were 5 (95% CI: 1.1, 22.8) times higher than the odds of changing from correct to incorrect after discussion. This study provides the first evidence that the semantic entropy of the response space can consistently be reduced to zero-demonstrated here through Council deliberation, and suggesting the possibility of other mechanisms to achieve the same outcome.. This study revealed that in a Council model, response variability, often considered a limitation, can be transformed into a strength that supports adaptive reasoning and collaborative refinement of answers. These findings suggest new paradigms for AI implementation and reveal the heightened strength that emerges when AIs begin to collaborate as a collective rather than operate alone.