Performance of vision language models for optic disc swelling identification on fundus photographs.

IF 3.2 Q1 HEALTH CARE SCIENCES & SERVICES

Frontiers in digital health Pub Date : 2025-08-25 eCollection Date: 2025-01-01 DOI:10.3389/fdgth.2025.1660887

Kelvin Zhenghao Li, Tuyet Thao Nguyen, Heather E Moss

{"title":"Performance of vision language models for optic disc swelling identification on fundus photographs.","authors":"Kelvin Zhenghao Li, Tuyet Thao Nguyen, Heather E Moss","doi":"10.3389/fdgth.2025.1660887","DOIUrl":null,"url":null,"abstract":"Introduction: Vision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.Methods: A diagnostic test accuracy study was conducted utilizing an open-sourced dataset. Five different prompts (increasing in context) were used with each of five different VLMs (Llama 3.2-vision, LLaVA-Med, LLaVA, GPT-4o, and DeepSeek-4V), resulting in 25 prompt-model pairs. The performance of VLMs in classifying photographs with and without optic disc swelling was measured using Youden's index (YI), F1 score, and accuracy rate.Results: A total of 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Among the 25 prompt-model pairs, valid response rates ranged from 7.8% to 100% (median 93.6%). Diagnostic performance ranged from YI: 0.00 to 0.231 (median 0.042), F1 score: 0.00 to 0.716 (median 0.401), and accuracy rate: 27.5 to 70.5% (median 58.8%). The best-performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought and few-shot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and the model performance.Conclusions: Non-specialty-trained VLMs could classify photographs of swollen and normal optic discs better than chance, with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty-specific VLMs may be necessary to improve ophthalmic image analysis performance.","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1660887"},"PeriodicalIF":3.2000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12415036/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1660887","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Vision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.

Methods: A diagnostic test accuracy study was conducted utilizing an open-sourced dataset. Five different prompts (increasing in context) were used with each of five different VLMs (Llama 3.2-vision, LLaVA-Med, LLaVA, GPT-4o, and DeepSeek-4V), resulting in 25 prompt-model pairs. The performance of VLMs in classifying photographs with and without optic disc swelling was measured using Youden's index (YI), F1 score, and accuracy rate.

Results: A total of 779 images of normal optic discs and 295 images of swollen discs were obtained from an open-source image database. Among the 25 prompt-model pairs, valid response rates ranged from 7.8% to 100% (median 93.6%). Diagnostic performance ranged from YI: 0.00 to 0.231 (median 0.042), F1 score: 0.00 to 0.716 (median 0.401), and accuracy rate: 27.5 to 70.5% (median 58.8%). The best-performing prompt-model pair was GPT-4o with role-playing with Chain-of-Thought and few-shot prompting. On average, Llama 3.2-vision performed the best (average YI across prompts 0.181). There was no consistent relationship between the amount of information given in the prompt and the model performance.

Conclusions: Non-specialty-trained VLMs could classify photographs of swollen and normal optic discs better than chance, with performance varying by model. Increasing prompt complexity did not consistently improve performance. Specialty-specific VLMs may be necessary to improve ophthalmic image analysis performance.

Abstract Image

查看原文本刊更多论文

眼底照片视盘肿胀识别的视觉语言模型性能。

简介：视觉语言模型（vlm）将图像分析能力与大型语言模型（llm）相结合。由于其多模态功能，VLMs在考虑临床情况的情况下，为视盘肿胀的诊断提供了优于图像分类模型的临床优势。在这项研究中，我们比较了非专业训练的VLMs在不同提示下对眼底照片视盘肿胀分类的表现。方法：利用开源数据集进行诊断测试准确性研究。五种不同的vlm （Llama 3.2-vision、LLaVA- med、LLaVA、gpt - 40和DeepSeek-4V）分别使用了五种不同的提示符（根据上下文增加），得到25对提示符模型。采用约登指数（YI）、F1评分和准确率来衡量VLMs对视盘肿胀和非视盘肿胀照片的分类性能。结果：从开源图像数据库中获取正常视盘图像779张，肿胀视盘图像295张。在25对提示模型中，有效反应率从7.8%到100%不等（中位数为93.6%）。诊断性能范围：YI: 0.00 ~ 0.231（中位数0.042），F1评分：0.00 ~ 0.716（中位数0.401），准确率：27.5 ~ 70.5%（中位数58.8%）。表现最好的提示模型组合是gpt - 40、角色扮演、思维链和少量提示。平均而言，Llama 3.2-vision表现最好（通过提示的平均YI为0.181）。提示中给出的信息量与模型性能之间没有一致的关系。结论：非专业训练的VLMs对肿胀和正常视盘的图像分类优于随机分类，不同模型的分类效果不同。增加提示符复杂度并不能持续提高性能。为了提高眼科图像分析性能，可能需要专门的VLMs。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊