Germán Mejía-Salgado, William Rojas-Carabali, Carlos Cifuentes-González, María Andrea Bernal-Valencia, Paola Saboya-Galindo, Jaime Soto-Ariño, Valentina Dumar-Kerguelen, Guillermo Marroquín-Gómez, Martha Lucía Moreno-Pardo, Juliana Tirado-Ángel, Anat Galor, Alejandra de-la-Torre
{"title":"干眼症的诊断准确性:对临床和人工智能局限性的见解:干眼症诊断准确性的局限性。","authors":"Germán Mejía-Salgado, William Rojas-Carabali, Carlos Cifuentes-González, María Andrea Bernal-Valencia, Paola Saboya-Galindo, Jaime Soto-Ariño, Valentina Dumar-Kerguelen, Guillermo Marroquín-Gómez, Martha Lucía Moreno-Pardo, Juliana Tirado-Ángel, Anat Galor, Alejandra de-la-Torre","doi":"10.1016/j.clae.2025.102509","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the agreement and performance of four large language models (LLMs)-ChatGPT-3.5, ChatGPT-4.0, Leny-ai, and MediSearch-in diagnosing and classifying Dry Eye Disease (DED), compared to clinician judgment and Dry Eye Workshop-II (DEWS-II) criteria.</p><p><strong>Methods: </strong>A standardized prompt incorporating retrospective clinical and symptomatic data from patients with suspected DED referred to a dry eye clinic was developed. LLMs were evaluated for diagnosis (DED vs. no DED) and classification (aqueous-deficient, evaporative, mixed-component). Agreement was assessed using Cohen's-kappa (Cκ) and Fleiss'-kappa (Fκ). Balanced accuracy, sensitivity, specificity, and F1 score were calculated.</p><p><strong>Results: </strong>Among 338 patients (78.6 % female, mean age 53.2 years), clinicians diagnosed DED in 300, and DEWS-II criteria identified 234. LLMs showed high agreement with clinicians for DED diagnosis (93 %-99 %, Cκ: 0.81-0.86). Subtype agreement was lower (aqueous-deficient: 0 %-18 %, evaporative: 4 %-80 %, mixed-component: 22 %-92 %; Fκ: -0.20 to -0.10). Diagnostic balanced accuracy was 48 %-56 %, with high sensitivity (93 %-99 %) but low specificity (0 %-16 %). Subtype balanced accuracy and F1 score ranged from 33 %-81 % 0 %-71 %, respectively. Compared to DEWS-II, agreement for DED diagnosis remained high (96 %-99 %) but with weaker Cκ (0.52-0.58). Subtype agreement was again low (aqueous-deficient: 0 %-20 %, evaporative: 9 %-68 %, mixed-component: 16 %-75 %; Fκ: -0.09 to -0.02). Diagnostic balanced accuracy was 49 %-56 %, sensitivity 97 %-99 %, and specificity 5 %-16 %. Subtype balanced accuracy ranged from 43 % to 56 %, F1 score 0-68.</p><p><strong>Conclusion: </strong>LLMs showed strong agreement and high sensitivity for DED diagnosis but limited specificity and poor subtype classification, mirroring clinical challenges and highlighting risks of overdiagnosis.</p>","PeriodicalId":49087,"journal":{"name":"Contact Lens & Anterior Eye","volume":" ","pages":"102509"},"PeriodicalIF":3.7000,"publicationDate":"2025-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye.\",\"authors\":\"Germán Mejía-Salgado, William Rojas-Carabali, Carlos Cifuentes-González, María Andrea Bernal-Valencia, Paola Saboya-Galindo, Jaime Soto-Ariño, Valentina Dumar-Kerguelen, Guillermo Marroquín-Gómez, Martha Lucía Moreno-Pardo, Juliana Tirado-Ángel, Anat Galor, Alejandra de-la-Torre\",\"doi\":\"10.1016/j.clae.2025.102509\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>To evaluate the agreement and performance of four large language models (LLMs)-ChatGPT-3.5, ChatGPT-4.0, Leny-ai, and MediSearch-in diagnosing and classifying Dry Eye Disease (DED), compared to clinician judgment and Dry Eye Workshop-II (DEWS-II) criteria.</p><p><strong>Methods: </strong>A standardized prompt incorporating retrospective clinical and symptomatic data from patients with suspected DED referred to a dry eye clinic was developed. LLMs were evaluated for diagnosis (DED vs. no DED) and classification (aqueous-deficient, evaporative, mixed-component). Agreement was assessed using Cohen's-kappa (Cκ) and Fleiss'-kappa (Fκ). Balanced accuracy, sensitivity, specificity, and F1 score were calculated.</p><p><strong>Results: </strong>Among 338 patients (78.6 % female, mean age 53.2 years), clinicians diagnosed DED in 300, and DEWS-II criteria identified 234. LLMs showed high agreement with clinicians for DED diagnosis (93 %-99 %, Cκ: 0.81-0.86). Subtype agreement was lower (aqueous-deficient: 0 %-18 %, evaporative: 4 %-80 %, mixed-component: 22 %-92 %; Fκ: -0.20 to -0.10). Diagnostic balanced accuracy was 48 %-56 %, with high sensitivity (93 %-99 %) but low specificity (0 %-16 %). Subtype balanced accuracy and F1 score ranged from 33 %-81 % 0 %-71 %, respectively. Compared to DEWS-II, agreement for DED diagnosis remained high (96 %-99 %) but with weaker Cκ (0.52-0.58). Subtype agreement was again low (aqueous-deficient: 0 %-20 %, evaporative: 9 %-68 %, mixed-component: 16 %-75 %; Fκ: -0.09 to -0.02). Diagnostic balanced accuracy was 49 %-56 %, sensitivity 97 %-99 %, and specificity 5 %-16 %. Subtype balanced accuracy ranged from 43 % to 56 %, F1 score 0-68.</p><p><strong>Conclusion: </strong>LLMs showed strong agreement and high sensitivity for DED diagnosis but limited specificity and poor subtype classification, mirroring clinical challenges and highlighting risks of overdiagnosis.</p>\",\"PeriodicalId\":49087,\"journal\":{\"name\":\"Contact Lens & Anterior Eye\",\"volume\":\" \",\"pages\":\"102509\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Contact Lens & Anterior Eye\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.clae.2025.102509\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Contact Lens & Anterior Eye","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.clae.2025.102509","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
Diagnostic accuracy in dry eye: Insights into clinical and artificial intelligence limitations: Limitations of diagnostic accuracy in dry eye.
Purpose: To evaluate the agreement and performance of four large language models (LLMs)-ChatGPT-3.5, ChatGPT-4.0, Leny-ai, and MediSearch-in diagnosing and classifying Dry Eye Disease (DED), compared to clinician judgment and Dry Eye Workshop-II (DEWS-II) criteria.
Methods: A standardized prompt incorporating retrospective clinical and symptomatic data from patients with suspected DED referred to a dry eye clinic was developed. LLMs were evaluated for diagnosis (DED vs. no DED) and classification (aqueous-deficient, evaporative, mixed-component). Agreement was assessed using Cohen's-kappa (Cκ) and Fleiss'-kappa (Fκ). Balanced accuracy, sensitivity, specificity, and F1 score were calculated.
Results: Among 338 patients (78.6 % female, mean age 53.2 years), clinicians diagnosed DED in 300, and DEWS-II criteria identified 234. LLMs showed high agreement with clinicians for DED diagnosis (93 %-99 %, Cκ: 0.81-0.86). Subtype agreement was lower (aqueous-deficient: 0 %-18 %, evaporative: 4 %-80 %, mixed-component: 22 %-92 %; Fκ: -0.20 to -0.10). Diagnostic balanced accuracy was 48 %-56 %, with high sensitivity (93 %-99 %) but low specificity (0 %-16 %). Subtype balanced accuracy and F1 score ranged from 33 %-81 % 0 %-71 %, respectively. Compared to DEWS-II, agreement for DED diagnosis remained high (96 %-99 %) but with weaker Cκ (0.52-0.58). Subtype agreement was again low (aqueous-deficient: 0 %-20 %, evaporative: 9 %-68 %, mixed-component: 16 %-75 %; Fκ: -0.09 to -0.02). Diagnostic balanced accuracy was 49 %-56 %, sensitivity 97 %-99 %, and specificity 5 %-16 %. Subtype balanced accuracy ranged from 43 % to 56 %, F1 score 0-68.
Conclusion: LLMs showed strong agreement and high sensitivity for DED diagnosis but limited specificity and poor subtype classification, mirroring clinical challenges and highlighting risks of overdiagnosis.
期刊介绍:
Contact Lens & Anterior Eye is a research-based journal covering all aspects of contact lens theory and practice, including original articles on invention and innovations, as well as the regular features of: Case Reports; Literary Reviews; Editorials; Instrumentation and Techniques and Dates of Professional Meetings.