{"title":"MedVH: Toward Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context.","authors":"Zishan Gu, Jiayuan Chen, Fenglin Liu, Changchang Yin, Ping Zhang","doi":"10.1002/aisy.202500255","DOIUrl":null,"url":null,"abstract":"<p><p>Large vision language models (LVLMs) have achieved superior performance on natural image and text tasks, inspiring extensive fine-tuning research. However, their robustness against hallucination in clinical contexts remains understudied. We propose the Medical Visual Hallucination Test (MedVH), a novel evaluation framework assessing hallucination tendencies in both medical-specific and general-purpose LVLMs. MedVH encompasses six tasks targeting medical hallucinations, including two traditional tasks and four novel tasks formatted as multi-choice visual question answering and long response generation. Our extensive experiments with six evaluation metrics reveal that medical LVLMs, despite promising performance on standard medical tasks, are particularly susceptible to hallucinations-often more so than general models. This raises significant concerns about domain-specific model reliability. For real-world applications, medical LVLMs must accurately integrate medical knowledge while maintaining robust reasoning to prevent hallucination. We explore mitigation methods without model-specific fine-tuning, including prompt engineering and collaboration between general and domain-specific models. Our work provides a foundation for future evaluation studies. The dataset is available at PhysioNet: https://physionet.org/content/medvh.</p>","PeriodicalId":93858,"journal":{"name":"Advanced intelligent systems (Weinheim an der Bergstrasse, Germany)","volume":" ","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12363988/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced intelligent systems (Weinheim an der Bergstrasse, Germany)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/aisy.202500255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Large vision language models (LVLMs) have achieved superior performance on natural image and text tasks, inspiring extensive fine-tuning research. However, their robustness against hallucination in clinical contexts remains understudied. We propose the Medical Visual Hallucination Test (MedVH), a novel evaluation framework assessing hallucination tendencies in both medical-specific and general-purpose LVLMs. MedVH encompasses six tasks targeting medical hallucinations, including two traditional tasks and four novel tasks formatted as multi-choice visual question answering and long response generation. Our extensive experiments with six evaluation metrics reveal that medical LVLMs, despite promising performance on standard medical tasks, are particularly susceptible to hallucinations-often more so than general models. This raises significant concerns about domain-specific model reliability. For real-world applications, medical LVLMs must accurately integrate medical knowledge while maintaining robust reasoning to prevent hallucination. We explore mitigation methods without model-specific fine-tuning, including prompt engineering and collaboration between general and domain-specific models. Our work provides a foundation for future evaluation studies. The dataset is available at PhysioNet: https://physionet.org/content/medvh.