公开发布的大语言模型在内科委员会式问题上的表现。

PLOS digital health Pub Date : 2024-09-17 eCollection Date: 2024-09-01 DOI:10.1371/journal.pdig.0000604

Constantine Tarabanis, Sohail Zahid, Marios Mamalis, Kevin Zhang, Evangelos Kalampokis, Lior Jankelson

{"title":"公开发布的大语言模型在内科委员会式问题上的表现。","authors":"Constantine Tarabanis, Sohail Zahid, Marios Mamalis, Kevin Zhang, Evangelos Kalampokis, Lior Jankelson","doi":"10.1371/journal.pdig.0000604","DOIUrl":null,"url":null,"abstract":"Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"3 9","pages":"e0000604"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11407633/pdf/","citationCount":"0","resultStr":"{\"title\":\"Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions.\",\"authors\":\"Constantine Tarabanis, Sohail Zahid, Marios Mamalis, Kevin Zhang, Evangelos Kalampokis, Lior Jankelson\",\"doi\":\"10.1371/journal.pdig.0000604\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.\",\"PeriodicalId\":74465,\"journal\":{\"name\":\"PLOS digital health\",\"volume\":\"3 9\",\"pages\":\"e0000604\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11407633/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLOS digital health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pdig.0000604\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

正在进行的研究试图通过评估大型语言模型（LLM）在医学考试中的表现，以医生的知识储备为基准。此前还没有研究评估过 LLM 在内科（IM）委员会考试问题上的表现。关于从医学文本中提取的知识如何提高 LLM 的性能，目前只有有限的数据。在随机抽取的 240 道内科医学板式试题中，对 GPT-3.5、GPT-4.0、LaMDA 和 Llama 2 的性能进行了评估。问题来自美国内科医师学会发布的医学知识自我评估计划，每个问题都是 LLM 提示的一部分。在可用的情况下，可通过应用编程接口（API）和相应的聊天机器人访问 LLM。使用检索增强生成法，用 Harrison 的《内科原理》增强了模式输入。LLM 生成的对 25 个正确答案的解释与 MKSAP 解释一起以盲法的方式呈现给一名 IM 委员会认证的医生，该医生的任务是选择人工生成的答案。通过必应聊天工具或其应用程序接口访问 GPT-4.0 时，77.5%-80.7% 的得分依次高于 GPT-3.5、人类应答者、LaMDA 和 Llama 2。GPT-4.0 在每个测试的即时通讯主题上都优于人类 MKSAP 用户，在传染病学（第 80 位）和风湿病学（第 99.7 位）上的百分位数分别为最高和最低。通过应用程序接口而非在线聊天机器人访问 LLM 时，GPT-3.5 和 GPT-4.0 的性能下降了 3.2-5.3%。在额外的输入增强后，通过 API 访问 GPT-3.5 和 GPT-4.0 的性能均提高了 4.5-7.5%。在 25 个问题的样本集中，盲审员正确识别了 72% 的人工生成的 MKSAP 答案。GPT-4.0 在即时通讯板风格的问题上表现最佳，超过了人类回答者。使用特定领域的信息进行扩增提高了性能，从而使检索扩增生成成为提高医学考试 LLM 答题准确性的一种可行技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions.

Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PLOS digital health

自引率

0.00%

发文量