{"title":"比较大型语言模型在诊断和处理棘手临床病例中的应用。","authors":"Sujeeth Krishna Shanmugam, David J Browning","doi":"10.2147/OPTH.S488232","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Compare large language models (LLMs) in analyzing and responding to a difficult series of ophthalmic cases.</p><p><strong>Design: </strong>A comparative case series involving LLMs that met inclusion criteria tested on twenty difficult case studies posed in open-text format.</p><p><strong>Methods: </strong>Fifteen LLMs accessible to ophthalmologists were tested against twenty case studies published in JAMA Ophthalmology. Each case was presented in identical, open-ended text fashion to each LLM and open-ended responses regarding differential diagnosis, next diagnostic tests and recommended treatments were requested. Responses were recorded and assessed for accuracy against published correct answers. The main outcome was accuracy of LLMs against the correct answers. Secondary outcomes included comparative performance on the differential diagnosis, ancillary testing, and treatment subtests; and readability of responses.</p><p><strong>Results: </strong>Scores were normally distributed and ranged from 0-35 (with a maximum score of 60) with a mean ± standard deviation of 19 ± 9. Scores for three of the LLMs (ChatGPT 3.5, Claude Pro, and Copilot Pro) were statistically significantly higher than the mean. Two of the high-performing LLMs were paid subscription (Claude Pro and Copilot Pro) and one was free (ChatGPT 3.5). While there were no clinical or statistical differences between ChatGPT 3.5 and Claude Pro, a separation of +5 points, or 0.56 standard deviations, between Copilot Pro and the other highly ranked LLMs was present. Readability of all tested programs were above the AMA (American Medical Association) reading level recommendations to public consumers of eight grade.</p><p><strong>Conclusion: </strong>Subscription LLMs were more prevalent among highly ranked LLMs suggesting that these perform better as ophthalmic assistants. While readability was poor for the average person, the content was understood by a board-certified ophthalmologist. The accuracy of LLMs is not high enough to recommend patient care in standalone mode, but aiding clinicians in patient care and prevent oversights is promising.</p>","PeriodicalId":93945,"journal":{"name":"Clinical ophthalmology (Auckland, N.Z.)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11568767/pdf/","citationCount":"0","resultStr":"{\"title\":\"Comparison of Large Language Models in Diagnosis and Management of Challenging Clinical Cases.\",\"authors\":\"Sujeeth Krishna Shanmugam, David J Browning\",\"doi\":\"10.2147/OPTH.S488232\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Compare large language models (LLMs) in analyzing and responding to a difficult series of ophthalmic cases.</p><p><strong>Design: </strong>A comparative case series involving LLMs that met inclusion criteria tested on twenty difficult case studies posed in open-text format.</p><p><strong>Methods: </strong>Fifteen LLMs accessible to ophthalmologists were tested against twenty case studies published in JAMA Ophthalmology. Each case was presented in identical, open-ended text fashion to each LLM and open-ended responses regarding differential diagnosis, next diagnostic tests and recommended treatments were requested. Responses were recorded and assessed for accuracy against published correct answers. The main outcome was accuracy of LLMs against the correct answers. Secondary outcomes included comparative performance on the differential diagnosis, ancillary testing, and treatment subtests; and readability of responses.</p><p><strong>Results: </strong>Scores were normally distributed and ranged from 0-35 (with a maximum score of 60) with a mean ± standard deviation of 19 ± 9. Scores for three of the LLMs (ChatGPT 3.5, Claude Pro, and Copilot Pro) were statistically significantly higher than the mean. Two of the high-performing LLMs were paid subscription (Claude Pro and Copilot Pro) and one was free (ChatGPT 3.5). While there were no clinical or statistical differences between ChatGPT 3.5 and Claude Pro, a separation of +5 points, or 0.56 standard deviations, between Copilot Pro and the other highly ranked LLMs was present. Readability of all tested programs were above the AMA (American Medical Association) reading level recommendations to public consumers of eight grade.</p><p><strong>Conclusion: </strong>Subscription LLMs were more prevalent among highly ranked LLMs suggesting that these perform better as ophthalmic assistants. While readability was poor for the average person, the content was understood by a board-certified ophthalmologist. The accuracy of LLMs is not high enough to recommend patient care in standalone mode, but aiding clinicians in patient care and prevent oversights is promising.</p>\",\"PeriodicalId\":93945,\"journal\":{\"name\":\"Clinical ophthalmology (Auckland, N.Z.)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11568767/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical ophthalmology (Auckland, N.Z.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2147/OPTH.S488232\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical ophthalmology (Auckland, N.Z.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2147/OPTH.S488232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
目的:比较大型语言模型(LLM)在分析和应对一系列眼科疑难病例时的表现:比较符合纳入标准的 LLM 在以开放文本格式提出的 20 个疑难病例研究中的表现:方法:针对《美国医学会眼科学杂志》(JAMA Ophthalmology)上发表的 20 个病例研究,对眼科医生可获得的 15 个 LLM 进行测试。每个病例都以相同的开放式文本格式呈现给每位 LLM,并要求他们就鉴别诊断、下一步诊断检测和建议治疗作出开放式回答。对回答进行记录,并根据公布的正确答案对准确性进行评估。主要结果是 LLM 对照正确答案的准确性。次要结果包括在鉴别诊断、辅助检查和治疗分测验中的表现比较;以及回答的可读性:三个 LLM(ChatGPT 3.5、Claude Pro 和 Copilot Pro)的得分在统计上明显高于平均值。在表现优异的 LLMs 中,有两个是付费订阅的(Claude Pro 和 Copilot Pro),一个是免费的(ChatGPT 3.5)。虽然 ChatGPT 3.5 与 Claude Pro 之间没有临床或统计差异,但 Copilot Pro 与其他排名靠前的 LLM 之间存在 +5 分或 0.56 个标准差的差距。所有测试程序的可读性均高于美国医学会(AMA)向八年级公众消费者推荐的阅读水平:结论:在排名较高的 LLMs 中,订阅 LLMs 的情况更为普遍,这表明这些 LLMs 作为眼科助理的表现更好。虽然对于普通人来说可读性较差,但经董事会认证的眼科医生却能看懂内容。LLMs的准确性还不足以建议独立模式下的病人护理,但帮助临床医生护理病人并防止疏忽是很有希望的。
Comparison of Large Language Models in Diagnosis and Management of Challenging Clinical Cases.
Purpose: Compare large language models (LLMs) in analyzing and responding to a difficult series of ophthalmic cases.
Design: A comparative case series involving LLMs that met inclusion criteria tested on twenty difficult case studies posed in open-text format.
Methods: Fifteen LLMs accessible to ophthalmologists were tested against twenty case studies published in JAMA Ophthalmology. Each case was presented in identical, open-ended text fashion to each LLM and open-ended responses regarding differential diagnosis, next diagnostic tests and recommended treatments were requested. Responses were recorded and assessed for accuracy against published correct answers. The main outcome was accuracy of LLMs against the correct answers. Secondary outcomes included comparative performance on the differential diagnosis, ancillary testing, and treatment subtests; and readability of responses.
Results: Scores were normally distributed and ranged from 0-35 (with a maximum score of 60) with a mean ± standard deviation of 19 ± 9. Scores for three of the LLMs (ChatGPT 3.5, Claude Pro, and Copilot Pro) were statistically significantly higher than the mean. Two of the high-performing LLMs were paid subscription (Claude Pro and Copilot Pro) and one was free (ChatGPT 3.5). While there were no clinical or statistical differences between ChatGPT 3.5 and Claude Pro, a separation of +5 points, or 0.56 standard deviations, between Copilot Pro and the other highly ranked LLMs was present. Readability of all tested programs were above the AMA (American Medical Association) reading level recommendations to public consumers of eight grade.
Conclusion: Subscription LLMs were more prevalent among highly ranked LLMs suggesting that these perform better as ophthalmic assistants. While readability was poor for the average person, the content was understood by a board-certified ophthalmologist. The accuracy of LLMs is not high enough to recommend patient care in standalone mode, but aiding clinicians in patient care and prevent oversights is promising.