{"title":"Comparison of the Accuracy, Comprehensiveness, and Readability of ChatGPT, Google Gemini, and Microsoft Copilot on Dry Eye Disease.","authors":"Dilan Colak, Burcu Yakut, Abdullah Agin","doi":"10.14744/bej.2025.76743","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study compared the performance of ChatGPT, Google Gemini, and Microsoft Copilot in answering 25 questions about dry eye disease and evaluated comprehensiveness, accuracy, and readability metrics.</p><p><strong>Methods: </strong>The artificial intelligence (AI) platforms answered 25 questions derived from the American Academy of Ophthalmology's Eye Health webpage. Three reviewers assigned comprehensiveness (0-5) and accuracy (-2 to 2) scores. Readability metrics included Flesch-Kincaid Grade Level, Flesch Reading Ease Score, sentence/word statistics, and total content measures. Responses were rated by three independent reviewers. Readability metrics were also calculated, and platforms were compared using Kruskal-Wallis and Friedman tests with <i>post hoc</i> analysis. Reviewer consistency was assessed using the intraclass correlation coefficient (ICC).</p><p><strong>Results: </strong>Google Gemini demonstrated the highest comprehensiveness and accuracy scores, significantly outperforming Microsoft Copilot (p<0.001). ChatGPT produced the most sentences and words (p<0.001), while readability metrics showed no significant differences among models (p>0.05). Inter-observer reliability was highest for Google Gemini (ICC=0.701), followed by ChatGPT (ICC=0.578), with Microsoft Copilot showing the lowest agreement (ICC=0.495). These results indicate Google Gemini's superior performance and consistency, whereas Microsoft Copilot had the weakest overall performance.</p><p><strong>Conclusion: </strong>Google Gemini excelled in content volume while maintaining high comprehensiveness and accuracy, outperforming ChatGPT and Microsoft Copilot in content generation. The platforms displayed comparable readability and linguistic complexity. These findings inform AI tool selection in health-related contexts, emphasizing Google Gemini's strengths in detailed responses. Its superior performance suggests potential utility in clinical and patient-facing applications requiring accurate and comprehensive content.</p>","PeriodicalId":8740,"journal":{"name":"Beyoglu Eye Journal","volume":"10 3","pages":"168-174"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12499718/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Beyoglu Eye Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14744/bej.2025.76743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: This study compared the performance of ChatGPT, Google Gemini, and Microsoft Copilot in answering 25 questions about dry eye disease and evaluated comprehensiveness, accuracy, and readability metrics.
Methods: The artificial intelligence (AI) platforms answered 25 questions derived from the American Academy of Ophthalmology's Eye Health webpage. Three reviewers assigned comprehensiveness (0-5) and accuracy (-2 to 2) scores. Readability metrics included Flesch-Kincaid Grade Level, Flesch Reading Ease Score, sentence/word statistics, and total content measures. Responses were rated by three independent reviewers. Readability metrics were also calculated, and platforms were compared using Kruskal-Wallis and Friedman tests with post hoc analysis. Reviewer consistency was assessed using the intraclass correlation coefficient (ICC).
Results: Google Gemini demonstrated the highest comprehensiveness and accuracy scores, significantly outperforming Microsoft Copilot (p<0.001). ChatGPT produced the most sentences and words (p<0.001), while readability metrics showed no significant differences among models (p>0.05). Inter-observer reliability was highest for Google Gemini (ICC=0.701), followed by ChatGPT (ICC=0.578), with Microsoft Copilot showing the lowest agreement (ICC=0.495). These results indicate Google Gemini's superior performance and consistency, whereas Microsoft Copilot had the weakest overall performance.
Conclusion: Google Gemini excelled in content volume while maintaining high comprehensiveness and accuracy, outperforming ChatGPT and Microsoft Copilot in content generation. The platforms displayed comparable readability and linguistic complexity. These findings inform AI tool selection in health-related contexts, emphasizing Google Gemini's strengths in detailed responses. Its superior performance suggests potential utility in clinical and patient-facing applications requiring accurate and comprehensive content.