Ting Fang Tan, Arun J Thirunavukarasu, Chrystie Quek, Daniel S W Ting
{"title":"Evaluation of ophthalmic large language models: quantitative vs. qualitative methods.","authors":"Ting Fang Tan, Arun J Thirunavukarasu, Chrystie Quek, Daniel S W Ting","doi":"10.1097/ICU.0000000000001171","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose of review: </strong>Alongside the development of large language models (LLMs) and generative artificial intelligence (AI) applications across a diverse range of clinical applications in Ophthalmology, this review highlights the importance of evaluation of LLM applications by discussing evaluation metrics commonly adopted.</p><p><strong>Recent findings: </strong>Generative AI applications have demonstrated encouraging performance in clinical applications of Ophthalmology. Beyond accuracy, evaluation in the form of quantitative and qualitative metrics facilitate a more nuanced assessment of LLM output responses. Several challenges limit evaluation including the lack of consensus on standardized benchmarks, and limited availability of robust and curated clinical datasets.</p><p><strong>Summary: </strong>This review outlines the spectrum of quantitative and qualitative evaluation metrics adopted in existing studies, highlights key challenges in LLM evaluation, to catalyze further work towards standardized and domain-specific evaluation. Robust evaluation to effectively validate clinical LLM applications is crucial in closing the gap towards clinical integration.</p>","PeriodicalId":50604,"journal":{"name":"Current Opinion in Ophthalmology","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Opinion in Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/ICU.0000000000001171","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose of review: Alongside the development of large language models (LLMs) and generative artificial intelligence (AI) applications across a diverse range of clinical applications in Ophthalmology, this review highlights the importance of evaluation of LLM applications by discussing evaluation metrics commonly adopted.
Recent findings: Generative AI applications have demonstrated encouraging performance in clinical applications of Ophthalmology. Beyond accuracy, evaluation in the form of quantitative and qualitative metrics facilitate a more nuanced assessment of LLM output responses. Several challenges limit evaluation including the lack of consensus on standardized benchmarks, and limited availability of robust and curated clinical datasets.
Summary: This review outlines the spectrum of quantitative and qualitative evaluation metrics adopted in existing studies, highlights key challenges in LLM evaluation, to catalyze further work towards standardized and domain-specific evaluation. Robust evaluation to effectively validate clinical LLM applications is crucial in closing the gap towards clinical integration.
期刊介绍:
Current Opinion in Ophthalmology is an indispensable resource featuring key up-to-date and important advances in the field from around the world. With renowned guest editors for each section, every bimonthly issue of Current Opinion in Ophthalmology delivers a fresh insight into topics such as glaucoma, refractive surgery and corneal and external disorders. With ten sections in total, the journal provides a convenient and thorough review of the field and will be of interest to researchers, clinicians and other healthcare professionals alike.