大型语言模型对视神经炎相关问题反应的评价与比较。

IF 3.1 3区 医学 Q1 MEDICINE, GENERAL & INTERNAL
Frontiers in Medicine Pub Date : 2025-06-25 eCollection Date: 2025-01-01 DOI:10.3389/fmed.2025.1516442
Han-Jie He, Fang-Fang Zhao, Jia-Jian Liang, Yun Wang, Qian-Qian He, Hongjie Lin, Jingyun Cen, Feifei Chen, Tai-Ping Li, Zhanchi Hu, Jian-Feng Yang, Lan Chen, Carol Y Cheung, Yih-Chung Tham, Ling-Ping Cen
{"title":"大型语言模型对视神经炎相关问题反应的评价与比较。","authors":"Han-Jie He, Fang-Fang Zhao, Jia-Jian Liang, Yun Wang, Qian-Qian He, Hongjie Lin, Jingyun Cen, Feifei Chen, Tai-Ping Li, Zhanchi Hu, Jian-Feng Yang, Lan Chen, Carol Y Cheung, Yih-Chung Tham, Ling-Ping Cen","doi":"10.3389/fmed.2025.1516442","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Large language models (LLMs) show promise as clinical consultation tools and may assist optic neuritis patients, though research on their performance in this area is limited. Our study aims to assess and compare the performance of four commonly used LLM-Chatbots-Claude-2, ChatGPT-3.5, ChatGPT-4.0, and Google Bard-in addressing questions related to optic neuritis.</p><p><strong>Methods: </strong>We curated 24 optic neuritis-related questions and had three ophthalmologists rate the responses on two three-point scales for accuracy and comprehensiveness. We also assessed readability using four scales. The final results showed performance differences among the four LLM-Chatbots.</p><p><strong>Results: </strong>The average total accuracy scores (out of 9): ChatGPT-4.0 (7.62 ± 0.86), Google Bard (7.42 ± 1.20), ChatGPT-3.5 (7.21 ± 0.70), Claude-2 (6.44 ± 1.07). ChatGPT-4.0 (<i>p</i> = 0.0006) and Google Bard (<i>p</i> = 0.0015) were significantly more accurate than Claude-2. Also, 62.5% of ChatGPT-4.0's responses were rated \"Excellent,\" followed by 58.3% for Google Bard, both higher than Claude-2's 29.2% (all <i>p</i> ≤ 0.042) and ChatGPT-3.5's 41.7%. Both Claude-2 and Google Bard had 8.3% \"Deficient\" responses. The comprehensiveness scores were similar among the four LLMs (<i>p</i> = 0.1531). Note that all responses require at least a university-level reading proficiency.</p><p><strong>Conclusion: </strong>Large language models-Chatbots hold immense potential as clinical consultation tools for optic neuritis, but they require further refinement and proper evaluation strategies before deployment to ensure reliable and accurate performance.</p>","PeriodicalId":12488,"journal":{"name":"Frontiers in Medicine","volume":"12 ","pages":"1516442"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12238082/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation and comparison of large language models' responses to questions related optic neuritis.\",\"authors\":\"Han-Jie He, Fang-Fang Zhao, Jia-Jian Liang, Yun Wang, Qian-Qian He, Hongjie Lin, Jingyun Cen, Feifei Chen, Tai-Ping Li, Zhanchi Hu, Jian-Feng Yang, Lan Chen, Carol Y Cheung, Yih-Chung Tham, Ling-Ping Cen\",\"doi\":\"10.3389/fmed.2025.1516442\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>Large language models (LLMs) show promise as clinical consultation tools and may assist optic neuritis patients, though research on their performance in this area is limited. Our study aims to assess and compare the performance of four commonly used LLM-Chatbots-Claude-2, ChatGPT-3.5, ChatGPT-4.0, and Google Bard-in addressing questions related to optic neuritis.</p><p><strong>Methods: </strong>We curated 24 optic neuritis-related questions and had three ophthalmologists rate the responses on two three-point scales for accuracy and comprehensiveness. We also assessed readability using four scales. The final results showed performance differences among the four LLM-Chatbots.</p><p><strong>Results: </strong>The average total accuracy scores (out of 9): ChatGPT-4.0 (7.62 ± 0.86), Google Bard (7.42 ± 1.20), ChatGPT-3.5 (7.21 ± 0.70), Claude-2 (6.44 ± 1.07). ChatGPT-4.0 (<i>p</i> = 0.0006) and Google Bard (<i>p</i> = 0.0015) were significantly more accurate than Claude-2. Also, 62.5% of ChatGPT-4.0's responses were rated \\\"Excellent,\\\" followed by 58.3% for Google Bard, both higher than Claude-2's 29.2% (all <i>p</i> ≤ 0.042) and ChatGPT-3.5's 41.7%. Both Claude-2 and Google Bard had 8.3% \\\"Deficient\\\" responses. The comprehensiveness scores were similar among the four LLMs (<i>p</i> = 0.1531). Note that all responses require at least a university-level reading proficiency.</p><p><strong>Conclusion: </strong>Large language models-Chatbots hold immense potential as clinical consultation tools for optic neuritis, but they require further refinement and proper evaluation strategies before deployment to ensure reliable and accurate performance.</p>\",\"PeriodicalId\":12488,\"journal\":{\"name\":\"Frontiers in Medicine\",\"volume\":\"12 \",\"pages\":\"1516442\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12238082/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3389/fmed.2025.1516442\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/fmed.2025.1516442","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

摘要

目的:大型语言模型(llm)显示出作为临床咨询工具的希望,并可能帮助视神经炎患者,尽管在这一领域的研究表现有限。我们的研究旨在评估和比较四种常用的llm - chatbot - claude -2、ChatGPT-3.5、ChatGPT-4.0和谷歌bar在解决视神经炎相关问题方面的性能。方法:我们收集了24个视神经炎相关的问题,并请三位眼科医生对回答的准确性和全面性进行打分。我们还使用四种量表评估可读性。最终结果显示了四个llm聊天机器人之间的性能差异。结果:平均总准确率评分(满分9分):ChatGPT-4.0(7.62±0.86),谷歌Bard(7.42±1.20),ChatGPT-3.5(7.21±0.70),Claude-2(6.44±1.07)。ChatGPT-4.0 (p = 0.0006)和谷歌Bard (p = 0.0015)的准确率显著高于Claude-2。此外,ChatGPT-4.0的62.5%的回答被评为“优秀”,其次是谷歌巴德的58.3%,都高于Claude-2的29.2%(均p≤0.042)和ChatGPT-3.5的41.7%。克劳德-2和b谷歌巴德均有8.3%的“缺陷”反应。4个llm的综合得分相似(p = 0.1531)。请注意,所有回答都要求至少具有大学水平的阅读能力。结论:大型语言模型-聊天机器人作为视神经炎的临床咨询工具具有巨大的潜力,但在部署之前需要进一步完善和适当的评估策略,以确保可靠和准确的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluation and comparison of large language models' responses to questions related optic neuritis.

Objectives: Large language models (LLMs) show promise as clinical consultation tools and may assist optic neuritis patients, though research on their performance in this area is limited. Our study aims to assess and compare the performance of four commonly used LLM-Chatbots-Claude-2, ChatGPT-3.5, ChatGPT-4.0, and Google Bard-in addressing questions related to optic neuritis.

Methods: We curated 24 optic neuritis-related questions and had three ophthalmologists rate the responses on two three-point scales for accuracy and comprehensiveness. We also assessed readability using four scales. The final results showed performance differences among the four LLM-Chatbots.

Results: The average total accuracy scores (out of 9): ChatGPT-4.0 (7.62 ± 0.86), Google Bard (7.42 ± 1.20), ChatGPT-3.5 (7.21 ± 0.70), Claude-2 (6.44 ± 1.07). ChatGPT-4.0 (p = 0.0006) and Google Bard (p = 0.0015) were significantly more accurate than Claude-2. Also, 62.5% of ChatGPT-4.0's responses were rated "Excellent," followed by 58.3% for Google Bard, both higher than Claude-2's 29.2% (all p ≤ 0.042) and ChatGPT-3.5's 41.7%. Both Claude-2 and Google Bard had 8.3% "Deficient" responses. The comprehensiveness scores were similar among the four LLMs (p = 0.1531). Note that all responses require at least a university-level reading proficiency.

Conclusion: Large language models-Chatbots hold immense potential as clinical consultation tools for optic neuritis, but they require further refinement and proper evaluation strategies before deployment to ensure reliable and accurate performance.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Frontiers in Medicine
Frontiers in Medicine Medicine-General Medicine
CiteScore
5.10
自引率
5.10%
发文量
3710
审稿时长
12 weeks
期刊介绍: Frontiers in Medicine publishes rigorously peer-reviewed research linking basic research to clinical practice and patient care, as well as translating scientific advances into new therapies and diagnostic tools. Led by an outstanding Editorial Board of international experts, this multidisciplinary open-access journal is at the forefront of disseminating and communicating scientific knowledge and impactful discoveries to researchers, academics, clinicians and the public worldwide. In addition to papers that provide a link between basic research and clinical practice, a particular emphasis is given to studies that are directly relevant to patient care. In this spirit, the journal publishes the latest research results and medical knowledge that facilitate the translation of scientific advances into new therapies or diagnostic tools. The full listing of the Specialty Sections represented by Frontiers in Medicine is as listed below. As well as the established medical disciplines, Frontiers in Medicine is launching new sections that together will facilitate - the use of patient-reported outcomes under real world conditions - the exploitation of big data and the use of novel information and communication tools in the assessment of new medicines - the scientific bases for guidelines and decisions from regulatory authorities - access to medicinal products and medical devices worldwide - addressing the grand health challenges around the world
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信