Ahmed T. Elboardy , Ghada Khoriba , Mohammad al-Shatouri , Mohammed Mousa , Essam A. Rashed
{"title":"使用多序列MRI对脑癌诊断的视觉语言模型进行基准测试","authors":"Ahmed T. Elboardy , Ghada Khoriba , Mohammad al-Shatouri , Mohammed Mousa , Essam A. Rashed","doi":"10.1016/j.imu.2025.101692","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid adoption of Large Language Models (LLMs) in various fields has prompted growing interest in their application within the healthcare ecosystem. In particular, Vision Language Models (VLMs) offer potential for generating radiology reports. This study aims to perform a comprehensive evaluation of state-of-the-art VLMs with varying sizes and domain specializations to determine their effectiveness in generating radiology reports for MRI scans of brain cancer patients. We conducted a comparative analysis of several open-source VLMs, including Qwen2-VL, Meta-Vision 3.2, PaliGemma 2, DeepSeek-VL2, Nvidia open-source models, and medical-specific VLMs. Each model family was assessed across small, medium, and large variants. A benchmark dataset comprising multisequence brain MRI scans of cancer patients was curated with expert annotation and guidance from board-certified radiologists. The models were evaluated on their ability to generate complete radiology reports, using objective metrics, reasoning models (R1 and o1 models that judged the AI generations based on completeness, conciseness, and correctness), and human experts. Model performance varied significantly across size and type. Among large-scale models, Meta-Vision 3.2 90B achieved the highest scores with o1 of 70.19% and R1 of 68.09%. In the medium category, Meta-Vision 11B and DeepSeek-VL-2-27B outperformed others, achieving o1 scores of 57.56% and 53.44%, and R1 scores of 51.06% and 52.75%, respectively. For smaller models, Qwen2-VL-2B demonstrated the best performance (o1: 23.88%, R1: 23.25%). To our knowledge, this is the first study evaluating VLMs for generating comprehensive radiology reports for brain cancer diagnosis using multisequence MRI. Our findings reveal substantial performance differences based on model size and specialization, offering important guidance for future development and optimization of medical VLMs to support diagnostic radiology workflows.</div></div>","PeriodicalId":13953,"journal":{"name":"Informatics in Medicine Unlocked","volume":"58 ","pages":"Article 101692"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking vision-language models for brain cancer diagnosis using multisequence MRI\",\"authors\":\"Ahmed T. Elboardy , Ghada Khoriba , Mohammad al-Shatouri , Mohammed Mousa , Essam A. Rashed\",\"doi\":\"10.1016/j.imu.2025.101692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The rapid adoption of Large Language Models (LLMs) in various fields has prompted growing interest in their application within the healthcare ecosystem. In particular, Vision Language Models (VLMs) offer potential for generating radiology reports. This study aims to perform a comprehensive evaluation of state-of-the-art VLMs with varying sizes and domain specializations to determine their effectiveness in generating radiology reports for MRI scans of brain cancer patients. We conducted a comparative analysis of several open-source VLMs, including Qwen2-VL, Meta-Vision 3.2, PaliGemma 2, DeepSeek-VL2, Nvidia open-source models, and medical-specific VLMs. Each model family was assessed across small, medium, and large variants. A benchmark dataset comprising multisequence brain MRI scans of cancer patients was curated with expert annotation and guidance from board-certified radiologists. The models were evaluated on their ability to generate complete radiology reports, using objective metrics, reasoning models (R1 and o1 models that judged the AI generations based on completeness, conciseness, and correctness), and human experts. Model performance varied significantly across size and type. Among large-scale models, Meta-Vision 3.2 90B achieved the highest scores with o1 of 70.19% and R1 of 68.09%. In the medium category, Meta-Vision 11B and DeepSeek-VL-2-27B outperformed others, achieving o1 scores of 57.56% and 53.44%, and R1 scores of 51.06% and 52.75%, respectively. For smaller models, Qwen2-VL-2B demonstrated the best performance (o1: 23.88%, R1: 23.25%). To our knowledge, this is the first study evaluating VLMs for generating comprehensive radiology reports for brain cancer diagnosis using multisequence MRI. Our findings reveal substantial performance differences based on model size and specialization, offering important guidance for future development and optimization of medical VLMs to support diagnostic radiology workflows.</div></div>\",\"PeriodicalId\":13953,\"journal\":{\"name\":\"Informatics in Medicine Unlocked\",\"volume\":\"58 \",\"pages\":\"Article 101692\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics in Medicine Unlocked\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352914825000814\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics in Medicine Unlocked","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352914825000814","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
Benchmarking vision-language models for brain cancer diagnosis using multisequence MRI
The rapid adoption of Large Language Models (LLMs) in various fields has prompted growing interest in their application within the healthcare ecosystem. In particular, Vision Language Models (VLMs) offer potential for generating radiology reports. This study aims to perform a comprehensive evaluation of state-of-the-art VLMs with varying sizes and domain specializations to determine their effectiveness in generating radiology reports for MRI scans of brain cancer patients. We conducted a comparative analysis of several open-source VLMs, including Qwen2-VL, Meta-Vision 3.2, PaliGemma 2, DeepSeek-VL2, Nvidia open-source models, and medical-specific VLMs. Each model family was assessed across small, medium, and large variants. A benchmark dataset comprising multisequence brain MRI scans of cancer patients was curated with expert annotation and guidance from board-certified radiologists. The models were evaluated on their ability to generate complete radiology reports, using objective metrics, reasoning models (R1 and o1 models that judged the AI generations based on completeness, conciseness, and correctness), and human experts. Model performance varied significantly across size and type. Among large-scale models, Meta-Vision 3.2 90B achieved the highest scores with o1 of 70.19% and R1 of 68.09%. In the medium category, Meta-Vision 11B and DeepSeek-VL-2-27B outperformed others, achieving o1 scores of 57.56% and 53.44%, and R1 scores of 51.06% and 52.75%, respectively. For smaller models, Qwen2-VL-2B demonstrated the best performance (o1: 23.88%, R1: 23.25%). To our knowledge, this is the first study evaluating VLMs for generating comprehensive radiology reports for brain cancer diagnosis using multisequence MRI. Our findings reveal substantial performance differences based on model size and specialization, offering important guidance for future development and optimization of medical VLMs to support diagnostic radiology workflows.
期刊介绍:
Informatics in Medicine Unlocked (IMU) is an international gold open access journal covering a broad spectrum of topics within medical informatics, including (but not limited to) papers focusing on imaging, pathology, teledermatology, public health, ophthalmological, nursing and translational medicine informatics. The full papers that are published in the journal are accessible to all who visit the website.