{"title":"大型语言模型对视神经炎相关问题反应的评价与比较。","authors":"Han-Jie He, Fang-Fang Zhao, Jia-Jian Liang, Yun Wang, Qian-Qian He, Hongjie Lin, Jingyun Cen, Feifei Chen, Tai-Ping Li, Zhanchi Hu, Jian-Feng Yang, Lan Chen, Carol Y Cheung, Yih-Chung Tham, Ling-Ping Cen","doi":"10.3389/fmed.2025.1516442","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>Large language models (LLMs) show promise as clinical consultation tools and may assist optic neuritis patients, though research on their performance in this area is limited. Our study aims to assess and compare the performance of four commonly used LLM-Chatbots-Claude-2, ChatGPT-3.5, ChatGPT-4.0, and Google Bard-in addressing questions related to optic neuritis.</p><p><strong>Methods: </strong>We curated 24 optic neuritis-related questions and had three ophthalmologists rate the responses on two three-point scales for accuracy and comprehensiveness. We also assessed readability using four scales. The final results showed performance differences among the four LLM-Chatbots.</p><p><strong>Results: </strong>The average total accuracy scores (out of 9): ChatGPT-4.0 (7.62 ± 0.86), Google Bard (7.42 ± 1.20), ChatGPT-3.5 (7.21 ± 0.70), Claude-2 (6.44 ± 1.07). ChatGPT-4.0 (<i>p</i> = 0.0006) and Google Bard (<i>p</i> = 0.0015) were significantly more accurate than Claude-2. Also, 62.5% of ChatGPT-4.0's responses were rated \"Excellent,\" followed by 58.3% for Google Bard, both higher than Claude-2's 29.2% (all <i>p</i> ≤ 0.042) and ChatGPT-3.5's 41.7%. Both Claude-2 and Google Bard had 8.3% \"Deficient\" responses. The comprehensiveness scores were similar among the four LLMs (<i>p</i> = 0.1531). Note that all responses require at least a university-level reading proficiency.</p><p><strong>Conclusion: </strong>Large language models-Chatbots hold immense potential as clinical consultation tools for optic neuritis, but they require further refinement and proper evaluation strategies before deployment to ensure reliable and accurate performance.</p>","PeriodicalId":12488,"journal":{"name":"Frontiers in Medicine","volume":"12 ","pages":"1516442"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12238082/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation and comparison of large language models' responses to questions related optic neuritis.\",\"authors\":\"Han-Jie He, Fang-Fang Zhao, Jia-Jian Liang, Yun Wang, Qian-Qian He, Hongjie Lin, Jingyun Cen, Feifei Chen, Tai-Ping Li, Zhanchi Hu, Jian-Feng Yang, Lan Chen, Carol Y Cheung, Yih-Chung Tham, Ling-Ping Cen\",\"doi\":\"10.3389/fmed.2025.1516442\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>Large language models (LLMs) show promise as clinical consultation tools and may assist optic neuritis patients, though research on their performance in this area is limited. Our study aims to assess and compare the performance of four commonly used LLM-Chatbots-Claude-2, ChatGPT-3.5, ChatGPT-4.0, and Google Bard-in addressing questions related to optic neuritis.</p><p><strong>Methods: </strong>We curated 24 optic neuritis-related questions and had three ophthalmologists rate the responses on two three-point scales for accuracy and comprehensiveness. We also assessed readability using four scales. The final results showed performance differences among the four LLM-Chatbots.</p><p><strong>Results: </strong>The average total accuracy scores (out of 9): ChatGPT-4.0 (7.62 ± 0.86), Google Bard (7.42 ± 1.20), ChatGPT-3.5 (7.21 ± 0.70), Claude-2 (6.44 ± 1.07). ChatGPT-4.0 (<i>p</i> = 0.0006) and Google Bard (<i>p</i> = 0.0015) were significantly more accurate than Claude-2. Also, 62.5% of ChatGPT-4.0's responses were rated \\\"Excellent,\\\" followed by 58.3% for Google Bard, both higher than Claude-2's 29.2% (all <i>p</i> ≤ 0.042) and ChatGPT-3.5's 41.7%. Both Claude-2 and Google Bard had 8.3% \\\"Deficient\\\" responses. The comprehensiveness scores were similar among the four LLMs (<i>p</i> = 0.1531). Note that all responses require at least a university-level reading proficiency.</p><p><strong>Conclusion: </strong>Large language models-Chatbots hold immense potential as clinical consultation tools for optic neuritis, but they require further refinement and proper evaluation strategies before deployment to ensure reliable and accurate performance.</p>\",\"PeriodicalId\":12488,\"journal\":{\"name\":\"Frontiers in Medicine\",\"volume\":\"12 \",\"pages\":\"1516442\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12238082/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3389/fmed.2025.1516442\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/fmed.2025.1516442","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
Evaluation and comparison of large language models' responses to questions related optic neuritis.
Objectives: Large language models (LLMs) show promise as clinical consultation tools and may assist optic neuritis patients, though research on their performance in this area is limited. Our study aims to assess and compare the performance of four commonly used LLM-Chatbots-Claude-2, ChatGPT-3.5, ChatGPT-4.0, and Google Bard-in addressing questions related to optic neuritis.
Methods: We curated 24 optic neuritis-related questions and had three ophthalmologists rate the responses on two three-point scales for accuracy and comprehensiveness. We also assessed readability using four scales. The final results showed performance differences among the four LLM-Chatbots.
Results: The average total accuracy scores (out of 9): ChatGPT-4.0 (7.62 ± 0.86), Google Bard (7.42 ± 1.20), ChatGPT-3.5 (7.21 ± 0.70), Claude-2 (6.44 ± 1.07). ChatGPT-4.0 (p = 0.0006) and Google Bard (p = 0.0015) were significantly more accurate than Claude-2. Also, 62.5% of ChatGPT-4.0's responses were rated "Excellent," followed by 58.3% for Google Bard, both higher than Claude-2's 29.2% (all p ≤ 0.042) and ChatGPT-3.5's 41.7%. Both Claude-2 and Google Bard had 8.3% "Deficient" responses. The comprehensiveness scores were similar among the four LLMs (p = 0.1531). Note that all responses require at least a university-level reading proficiency.
Conclusion: Large language models-Chatbots hold immense potential as clinical consultation tools for optic neuritis, but they require further refinement and proper evaluation strategies before deployment to ensure reliable and accurate performance.
期刊介绍:
Frontiers in Medicine publishes rigorously peer-reviewed research linking basic research to clinical practice and patient care, as well as translating scientific advances into new therapies and diagnostic tools. Led by an outstanding Editorial Board of international experts, this multidisciplinary open-access journal is at the forefront of disseminating and communicating scientific knowledge and impactful discoveries to researchers, academics, clinicians and the public worldwide.
In addition to papers that provide a link between basic research and clinical practice, a particular emphasis is given to studies that are directly relevant to patient care. In this spirit, the journal publishes the latest research results and medical knowledge that facilitate the translation of scientific advances into new therapies or diagnostic tools. The full listing of the Specialty Sections represented by Frontiers in Medicine is as listed below. As well as the established medical disciplines, Frontiers in Medicine is launching new sections that together will facilitate
- the use of patient-reported outcomes under real world conditions
- the exploitation of big data and the use of novel information and communication tools in the assessment of new medicines
- the scientific bases for guidelines and decisions from regulatory authorities
- access to medicinal products and medical devices worldwide
- addressing the grand health challenges around the world