{"title":"公开发布的大语言模型在内科委员会式问题上的表现。","authors":"Constantine Tarabanis, Sohail Zahid, Marios Mamalis, Kevin Zhang, Evangelos Kalampokis, Lior Jankelson","doi":"10.1371/journal.pdig.0000604","DOIUrl":null,"url":null,"abstract":"<p><p>Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"3 9","pages":"e0000604"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11407633/pdf/","citationCount":"0","resultStr":"{\"title\":\"Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions.\",\"authors\":\"Constantine Tarabanis, Sohail Zahid, Marios Mamalis, Kevin Zhang, Evangelos Kalampokis, Lior Jankelson\",\"doi\":\"10.1371/journal.pdig.0000604\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.</p>\",\"PeriodicalId\":74465,\"journal\":{\"name\":\"PLOS digital health\",\"volume\":\"3 9\",\"pages\":\"e0000604\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11407633/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLOS digital health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pdig.0000604\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions.
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.