Pathologyvlm: a large vision-language model for pathology image understanding

IF 10.7 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Dawei Dai, Yuanhui Zhang, Qianlan Yang, Long Xu, Xiaojing Shen, Shuyin Xia, Guoyin Wang
{"title":"Pathologyvlm: a large vision-language model for pathology image understanding","authors":"Dawei Dai,&nbsp;Yuanhui Zhang,&nbsp;Qianlan Yang,&nbsp;Long Xu,&nbsp;Xiaojing Shen,&nbsp;Shuyin Xia,&nbsp;Guoyin Wang","doi":"10.1007/s10462-025-11190-1","DOIUrl":null,"url":null,"abstract":"<div><p>The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies have demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large vision-language model (PathologyVLM) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder to extract the features of pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PathologyVLM, first stage for domain alignment, and second stage for end to end visual question &amp; answering (VQA) task. In experiments, we evaluate our PathologyVLM on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PathologyVLM model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 6","pages":""},"PeriodicalIF":10.7000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11190-1.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11190-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies have demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large vision-language model (PathologyVLM) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder to extract the features of pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PathologyVLM, first stage for domain alignment, and second stage for end to end visual question & answering (VQA) task. In experiments, we evaluate our PathologyVLM on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PathologyVLM model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA

病理学:用于病理图像理解的大型视觉语言模型
先前在病理图像理解方面的进展主要涉及开发针对特定任务的模型。近年来的研究表明,大视觉语言模型可以提高医学图像理解中各种下游任务的性能。在这项研究中,我们开发了一个用于病理图像理解的特定领域大视觉语言模型(PathologyVLM)。具体而言,(1)首先对公共医学图像文本数据进行清洗,构建人类病理图像文本数据集,并进行特定领域比对;(2)利用提出的图像-文本数据,首先训练病理语言-图像预训练(PLIP)模型作为专门的视觉编码器来提取病理图像的特征,然后开发尺度不变连接器来避免图像缩放造成的信息丢失;(3)采用两阶段学习训练PathologyVLM,第一阶段进行领域对齐,第二阶段进行端到端视觉问题训练;回答(VQA)任务。在实验中,我们在监督VQA和零射击VQA数据集上对我们的PathologyVLM进行了评估,我们的模型在类似规模的多模态模型中取得了最好的整体性能。烧蚀实验也证实了设计的有效性。我们假设我们的PathologyVLM模型和本工作中提出的数据集可以促进计算病理学领域的研究。所有代码可在https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA上获得
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Artificial Intelligence Review
Artificial Intelligence Review 工程技术-计算机:人工智能
CiteScore
22.00
自引率
3.30%
发文量
194
审稿时长
5.3 months
期刊介绍: Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信