Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.
{"title":"Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.","authors":"Fang-Fang Zhao, Han-Jie He, Jia-Jian Liang, Jingyun Cen, Yun Wang, Hongjie Lin, Feifei Chen, Tai-Ping Li, Jian-Feng Yang, Lan Chen, Ling-Ping Cen","doi":"10.1038/s41433-024-03545-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background/objective: </strong>This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology.</p><p><strong>Methods: </strong>Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05.</p><p><strong>Results: </strong>Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3.</p><p><strong>Conclusions: </strong>Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.</p>","PeriodicalId":12125,"journal":{"name":"Eye","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eye","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41433-024-03545-9","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background/objective: This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology.
Methods: Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05.
Results: Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3.
Conclusions: Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.
期刊介绍:
Eye seeks to provide the international practising ophthalmologist with high quality articles, of academic rigour, on the latest global clinical and laboratory based research. Its core aim is to advance the science and practice of ophthalmology with the latest clinical- and scientific-based research. Whilst principally aimed at the practising clinician, the journal contains material of interest to a wider readership including optometrists, orthoptists, other health care professionals and research workers in all aspects of the field of visual science worldwide. Eye is the official journal of The Royal College of Ophthalmologists.
Eye encourages the submission of original articles covering all aspects of ophthalmology including: external eye disease; oculo-plastic surgery; orbital and lacrimal disease; ocular surface and corneal disorders; paediatric ophthalmology and strabismus; glaucoma; medical and surgical retina; neuro-ophthalmology; cataract and refractive surgery; ocular oncology; ophthalmic pathology; ophthalmic genetics.