Tessa Danehy, Jessica Hecht, Sabrina Kentis, Clyde B Schechter, Sunit P Jariwala
{"title":"信息学教育特刊:与医学知识问题相比,ChatGPT 在 USMLE 形式的伦理问题上表现更差。","authors":"Tessa Danehy, Jessica Hecht, Sabrina Kentis, Clyde B Schechter, Sunit P Jariwala","doi":"10.1055/a-2405-0138","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong> The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.</p><p><strong>Methods: </strong> Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.</p><p><strong>Results: </strong> Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (<i>p</i> < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (<i>p</i> = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (<i>p</i> < 0.001) on medical ethics and 33% points (<i>p</i> < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.</p><p><strong>Conclusion: </strong> Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.</p>","PeriodicalId":48956,"journal":{"name":"Applied Clinical Informatics","volume":" ","pages":"1049-1055"},"PeriodicalIF":2.1000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11617073/pdf/","citationCount":"0","resultStr":"{\"title\":\"ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.\",\"authors\":\"Tessa Danehy, Jessica Hecht, Sabrina Kentis, Clyde B Schechter, Sunit P Jariwala\",\"doi\":\"10.1055/a-2405-0138\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong> The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.</p><p><strong>Methods: </strong> Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.</p><p><strong>Results: </strong> Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (<i>p</i> < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (<i>p</i> = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (<i>p</i> < 0.001) on medical ethics and 33% points (<i>p</i> < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.</p><p><strong>Conclusion: </strong> Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.</p>\",\"PeriodicalId\":48956,\"journal\":{\"name\":\"Applied Clinical Informatics\",\"volume\":\" \",\"pages\":\"1049-1055\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11617073/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Clinical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1055/a-2405-0138\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/8/29 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q4\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Clinical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2405-0138","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/29 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.
Objectives: The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.
Methods: Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.
Results: Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (p < 0.001) on medical ethics and 33% points (p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.
Conclusion: Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.
期刊介绍:
ACI is the third Schattauer journal dealing with biomedical and health informatics. It perfectly complements our other journals Öffnet internen Link im aktuellen FensterMethods of Information in Medicine and the Öffnet internen Link im aktuellen FensterYearbook of Medical Informatics. The Yearbook of Medical Informatics being the “Milestone” or state-of-the-art journal and Methods of Information in Medicine being the “Science and Research” journal of IMIA, ACI intends to be the “Practical” journal of IMIA.