Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye.

IF 3.7 3区 医学 Q1 OPHTHALMOLOGY
Germán Mejía-Salgado, William Rojas-Carabali, Carlos Cifuentes-González, María Andrea Bernal-Valencia, Paola Saboya-Galindo, Jaime Soto-Ariño, Valentina Dumar-Kerguelen, Guillermo Marroquín-Gómez, Martha Lucía Moreno-Pardo, Juliana Tirado-Ángel, Anat Galor, Alejandra de-la-Torre
{"title":"Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye.","authors":"Germán Mejía-Salgado, William Rojas-Carabali, Carlos Cifuentes-González, María Andrea Bernal-Valencia, Paola Saboya-Galindo, Jaime Soto-Ariño, Valentina Dumar-Kerguelen, Guillermo Marroquín-Gómez, Martha Lucía Moreno-Pardo, Juliana Tirado-Ángel, Anat Galor, Alejandra de-la-Torre","doi":"10.1016/j.clae.2025.102509","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the agreement and performance of four large language models (LLMs)-ChatGPT-3.5, ChatGPT-4.0, Leny-ai, and MediSearch-in diagnosing and classifying Dry Eye Disease (DED), compared to clinician judgment and Dry Eye Workshop-II (DEWS-II) criteria.</p><p><strong>Methods: </strong>A standardized prompt incorporating retrospective clinical and symptomatic data from patients with suspected DED referred to a dry eye clinic was developed. LLMs were evaluated for diagnosis (DED vs. no DED) and classification (aqueous-deficient, evaporative, mixed-component). Agreement was assessed using Cohen's-kappa (Cκ) and Fleiss'-kappa (Fκ). Balanced accuracy, sensitivity, specificity, and F1 score were calculated.</p><p><strong>Results: </strong>Among 338 patients (78.6 % female, mean age 53.2 years), clinicians diagnosed DED in 300, and DEWS-II criteria identified 234. LLMs showed high agreement with clinicians for DED diagnosis (93 %-99 %, Cκ: 0.81-0.86). Subtype agreement was lower (aqueous-deficient: 0 %-18 %, evaporative: 4 %-80 %, mixed-component: 22 %-92 %; Fκ: -0.20 to -0.10). Diagnostic balanced accuracy was 48 %-56 %, with high sensitivity (93 %-99 %) but low specificity (0 %-16 %). Subtype balanced accuracy and F1 score ranged from 33 %-81 % 0 %-71 %, respectively. Compared to DEWS-II, agreement for DED diagnosis remained high (96 %-99 %) but with weaker Cκ (0.52-0.58). Subtype agreement was again low (aqueous-deficient: 0 %-20 %, evaporative: 9 %-68 %, mixed-component: 16 %-75 %; Fκ: -0.09 to -0.02). Diagnostic balanced accuracy was 49 %-56 %, sensitivity 97 %-99 %, and specificity 5 %-16 %. Subtype balanced accuracy ranged from 43 % to 56 %, F1 score 0-68.</p><p><strong>Conclusion: </strong>LLMs showed strong agreement and high sensitivity for DED diagnosis but limited specificity and poor subtype classification, mirroring clinical challenges and highlighting risks of overdiagnosis.</p>","PeriodicalId":49087,"journal":{"name":"Contact Lens & Anterior Eye","volume":" ","pages":"102509"},"PeriodicalIF":3.7000,"publicationDate":"2025-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Contact Lens & Anterior Eye","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.clae.2025.102509","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: To evaluate the agreement and performance of four large language models (LLMs)-ChatGPT-3.5, ChatGPT-4.0, Leny-ai, and MediSearch-in diagnosing and classifying Dry Eye Disease (DED), compared to clinician judgment and Dry Eye Workshop-II (DEWS-II) criteria.

Methods: A standardized prompt incorporating retrospective clinical and symptomatic data from patients with suspected DED referred to a dry eye clinic was developed. LLMs were evaluated for diagnosis (DED vs. no DED) and classification (aqueous-deficient, evaporative, mixed-component). Agreement was assessed using Cohen's-kappa (Cκ) and Fleiss'-kappa (Fκ). Balanced accuracy, sensitivity, specificity, and F1 score were calculated.

Results: Among 338 patients (78.6 % female, mean age 53.2 years), clinicians diagnosed DED in 300, and DEWS-II criteria identified 234. LLMs showed high agreement with clinicians for DED diagnosis (93 %-99 %, Cκ: 0.81-0.86). Subtype agreement was lower (aqueous-deficient: 0 %-18 %, evaporative: 4 %-80 %, mixed-component: 22 %-92 %; Fκ: -0.20 to -0.10). Diagnostic balanced accuracy was 48 %-56 %, with high sensitivity (93 %-99 %) but low specificity (0 %-16 %). Subtype balanced accuracy and F1 score ranged from 33 %-81 % 0 %-71 %, respectively. Compared to DEWS-II, agreement for DED diagnosis remained high (96 %-99 %) but with weaker Cκ (0.52-0.58). Subtype agreement was again low (aqueous-deficient: 0 %-20 %, evaporative: 9 %-68 %, mixed-component: 16 %-75 %; Fκ: -0.09 to -0.02). Diagnostic balanced accuracy was 49 %-56 %, sensitivity 97 %-99 %, and specificity 5 %-16 %. Subtype balanced accuracy ranged from 43 % to 56 %, F1 score 0-68.

Conclusion: LLMs showed strong agreement and high sensitivity for DED diagnosis but limited specificity and poor subtype classification, mirroring clinical challenges and highlighting risks of overdiagnosis.

干眼症的诊断准确性:对临床和人工智能局限性的见解:干眼症诊断准确性的局限性。
目的:评估四种大型语言模型(LLMs)-ChatGPT-3.5、ChatGPT-4.0、Leny-ai和medisearch -在诊断和分类干眼病(DED)方面的一致性和性能,并将其与临床医生判断和干眼研讨会ii (DEWS-II)标准进行比较。方法:标准化提示纳入回顾性临床和症状资料的患者疑似DED转诊干眼诊所。评估llm的诊断(DED vs.无DED)和分类(缺水、蒸发、混合成分)。采用Cohen's-kappa (Cκ)和Fleiss'-kappa (Fκ)评价一致性。计算平衡的准确性、敏感性、特异性和F1评分。结果:在338例患者中(78.6%为女性,平均年龄53.2岁),临床医生诊断为DED的有300例,DEWS-II标准确诊为234例。LLMs与临床医生对DED诊断的一致性较高(93% ~ 99%,Cκ: 0.81 ~ 0.86)。亚型一致性较低(缺水型:0% ~ 18%,蒸发型:4% ~ 80%,混合组分:22% ~ 92%;Fκ: -0.20 ~ -0.10)。诊断平衡准确率为48% - 56%,灵敏度高(93% - 99%),特异性低(0% - 16%)。亚型平衡准确率和F1评分范围分别为33% ~ 81%、0% ~ 71%。与DEWS-II相比,诊断DED的一致性仍然很高(96% - 99%),但Cκ较弱(0.52-0.58)。亚型一致性也很低(缺水:0% - 20%,蒸发:9% - 68%,混合成分:16% - 75%;Fκ: -0.09至-0.02)。诊断平衡准确率为49% - 56%,灵敏度为97% - 99%,特异性为5% - 16%。亚型平衡准确率范围为43% ~ 56%,F1得分0 ~ 68分。结论:LLMs对DED的诊断一致性强,敏感性高,但特异性有限,亚型分型差,反映了临床挑战,突出了过度诊断的风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.60
自引率
18.80%
发文量
198
审稿时长
55 days
期刊介绍: Contact Lens & Anterior Eye is a research-based journal covering all aspects of contact lens theory and practice, including original articles on invention and innovations, as well as the regular features of: Case Reports; Literary Reviews; Editorials; Instrumentation and Techniques and Dates of Professional Meetings.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信