病理学中的大语言模型：病理学学员多项选择题成绩比较研究

medRxiv - Pathology Pub Date : 2024-07-11 DOI:10.1101/2024.07.10.24310093

Wei Du, Jaryse Harris, Alessandro Brunetti, Olivia Leung, Xingchen Li, Selemon Walle, Qing Yu, Xiao Zhou, Fang Bian, Kajanna Mckenzie, Xueting Jin, Manita Kanathanavanich, Farah El-Sharkawy, Shunsuke Koga

{"title":"病理学中的大语言模型：病理学学员多项选择题成绩比较研究","authors":"Wei Du, Jaryse Harris, Alessandro Brunetti, Olivia Leung, Xingchen Li, Selemon Walle, Qing Yu, Xiao Zhou, Fang Bian, Kajanna Mckenzie, Xueting Jin, Manita Kanathanavanich, Farah El-Sharkawy, Shunsuke Koga","doi":"10.1101/2024.07.10.24310093","DOIUrl":null,"url":null,"abstract":"Aims: Large language models (LLMs), such as ChatGPT and Bard, have shown potential in various medical applications. This study aims to evaluate the performance of LLMs, specifically ChatGPT and Bard, in pathology by comparing their performance with that of pathology residents and fellows, and to assess the consistency of their responses.\nMethods: We selected 150 multiple-choice questions covering 15 subspecialties, excluding those with images. Both ChatGPT and Bard were tested on these questions three times, and their responses were compared with those of 14 pathology trainees from two hospitals. Questions were categorized into easy, intermediate, and difficult based on trainee performance. Consistency and variability in LLM responses were analyzed across three evaluation sessions.\nResults: ChatGPT significantly outperformed Bard and trainees, achieving an average total score of 82.2% compared to Bard's 49.5% and trainees' 50.7%. ChatGPT's performance was notably stronger in difficult questions (61.8%-70.6%) compared to Bard (29.4%-32.4%) and trainees (5.9%-44.1%). For easy questions, ChatGPT (88.9%-94.4%) and trainees (75.0%-100.0%) showed similar high scores. Consistency analysis revealed that ChatGPT showed a high consistency rate of 85%-80% across three tests, whereas Bard exhibited greater variability with consistency rates of 61%-54%.\nConclusion: ChatGPT consistently outperformed Bard and trainees, especially on difficult questions. While LLMs show significant potential in pathology education and practice, ongoing development and human oversight are essential for reliable clinical application.","PeriodicalId":501528,"journal":{"name":"medRxiv - Pathology","volume":"28 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large Language Models in Pathology: A Comparative Study on Multiple Choice Question Performance with Pathology Trainees\",\"authors\":\"Wei Du, Jaryse Harris, Alessandro Brunetti, Olivia Leung, Xingchen Li, Selemon Walle, Qing Yu, Xiao Zhou, Fang Bian, Kajanna Mckenzie, Xueting Jin, Manita Kanathanavanich, Farah El-Sharkawy, Shunsuke Koga\",\"doi\":\"10.1101/2024.07.10.24310093\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Aims: Large language models (LLMs), such as ChatGPT and Bard, have shown potential in various medical applications. This study aims to evaluate the performance of LLMs, specifically ChatGPT and Bard, in pathology by comparing their performance with that of pathology residents and fellows, and to assess the consistency of their responses.\\nMethods: We selected 150 multiple-choice questions covering 15 subspecialties, excluding those with images. Both ChatGPT and Bard were tested on these questions three times, and their responses were compared with those of 14 pathology trainees from two hospitals. Questions were categorized into easy, intermediate, and difficult based on trainee performance. Consistency and variability in LLM responses were analyzed across three evaluation sessions.\\nResults: ChatGPT significantly outperformed Bard and trainees, achieving an average total score of 82.2% compared to Bard's 49.5% and trainees' 50.7%. ChatGPT's performance was notably stronger in difficult questions (61.8%-70.6%) compared to Bard (29.4%-32.4%) and trainees (5.9%-44.1%). For easy questions, ChatGPT (88.9%-94.4%) and trainees (75.0%-100.0%) showed similar high scores. Consistency analysis revealed that ChatGPT showed a high consistency rate of 85%-80% across three tests, whereas Bard exhibited greater variability with consistency rates of 61%-54%.\\nConclusion: ChatGPT consistently outperformed Bard and trainees, especially on difficult questions. While LLMs show significant potential in pathology education and practice, ongoing development and human oversight are essential for reliable clinical application.\",\"PeriodicalId\":501528,\"journal\":{\"name\":\"medRxiv - Pathology\",\"volume\":\"28 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv - Pathology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.07.10.24310093\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Pathology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.10.24310093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目的：大语言模型（LLM），如 ChatGPT 和 Bard，已在各种医疗应用中显示出潜力。本研究旨在评估大型语言模型（特别是 ChatGPT 和 Bard）在病理学中的表现，将其与病理学住院医师和研究员的表现进行比较，并评估其回答的一致性：我们选择了 150 道多选题，涵盖 15 个亚专科，其中不包括有图像的题目。对 ChatGPT 和 Bard 就这些问题进行了三次测试，并将他们的回答与来自两家医院的 14 名病理学学员的回答进行了比较。根据受训者的表现，问题被分为简单、中等和困难。对三次评估中 LLM 回答的一致性和可变性进行了分析：结果：ChatGPT 的表现明显优于 Bard 和受训者，平均总分达到 82.2%，而 Bard 为 49.5%，受训者为 50.7%。与巴德（29.4%-32.4%）和受训者（5.9%-44.1%）相比，ChatGPT 在难题（61.8%-70.6%）上的表现更为突出。在容易的问题上，ChatGPT（88.9%-94.4%）和受训人员（75.0%-100.0%）显示出相似的高分。一致性分析表明，ChatGPT 在三次测试中表现出 85%-80% 的高一致性，而 Bard 则表现出更大的差异性，一致性为 61%-54%：结论：ChatGPT 的表现始终优于 Bard 和受训者，尤其是在难题上。虽然 LLM 在病理学教育和实践中显示出巨大的潜力，但持续开发和人工监督对于可靠的临床应用至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large Language Models in Pathology: A Comparative Study on Multiple Choice Question Performance with Pathology Trainees

Aims: Large language models (LLMs), such as ChatGPT and Bard, have shown potential in various medical applications. This study aims to evaluate the performance of LLMs, specifically ChatGPT and Bard, in pathology by comparing their performance with that of pathology residents and fellows, and to assess the consistency of their responses. Methods: We selected 150 multiple-choice questions covering 15 subspecialties, excluding those with images. Both ChatGPT and Bard were tested on these questions three times, and their responses were compared with those of 14 pathology trainees from two hospitals. Questions were categorized into easy, intermediate, and difficult based on trainee performance. Consistency and variability in LLM responses were analyzed across three evaluation sessions. Results: ChatGPT significantly outperformed Bard and trainees, achieving an average total score of 82.2% compared to Bard's 49.5% and trainees' 50.7%. ChatGPT's performance was notably stronger in difficult questions (61.8%-70.6%) compared to Bard (29.4%-32.4%) and trainees (5.9%-44.1%). For easy questions, ChatGPT (88.9%-94.4%) and trainees (75.0%-100.0%) showed similar high scores. Consistency analysis revealed that ChatGPT showed a high consistency rate of 85%-80% across three tests, whereas Bard exhibited greater variability with consistency rates of 61%-54%. Conclusion: ChatGPT consistently outperformed Bard and trainees, especially on difficult questions. While LLMs show significant potential in pathology education and practice, ongoing development and human oversight are essential for reliable clinical application.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

medRxiv - Pathology

自引率

0.00%

发文量