Pathologyvlm: a large vision-language model for pathology image understanding

IF 10.7 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Artificial Intelligence Review Pub Date : 2025-03-28 DOI:10.1007/s10462-025-11190-1

Dawei Dai, Yuanhui Zhang, Qianlan Yang, Long Xu, Xiaojing Shen, Shuyin Xia, Guoyin Wang

{"title":"Pathologyvlm: a large vision-language model for pathology image understanding","authors":"Dawei Dai, Yuanhui Zhang, Qianlan Yang, Long Xu, Xiaojing Shen, Shuyin Xia, Guoyin Wang","doi":"10.1007/s10462-025-11190-1","DOIUrl":null,"url":null,"abstract":"<div><p>The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies have demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large vision-language model (PathologyVLM) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder to extract the features of pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PathologyVLM, first stage for domain alignment, and second stage for end to end visual question & answering (VQA) task. In experiments, we evaluate our PathologyVLM on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PathologyVLM model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA</p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"58 6","pages":""},"PeriodicalIF":10.7000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-025-11190-1.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-025-11190-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies have demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large vision-language model (PathologyVLM) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder to extract the features of pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PathologyVLM, first stage for domain alignment, and second stage for end to end visual question & answering (VQA) task. In experiments, we evaluate our PathologyVLM on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PathologyVLM model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA

查看原文本刊更多论文

病理学：用于病理图像理解的大型视觉语言模型

先前在病理图像理解方面的进展主要涉及开发针对特定任务的模型。近年来的研究表明，大视觉语言模型可以提高医学图像理解中各种下游任务的性能。在这项研究中，我们开发了一个用于病理图像理解的特定领域大视觉语言模型（PathologyVLM）。具体而言，(1)首先对公共医学图像文本数据进行清洗，构建人类病理图像文本数据集，并进行特定领域比对；(2)利用提出的图像-文本数据，首先训练病理语言-图像预训练（PLIP）模型作为专门的视觉编码器来提取病理图像的特征，然后开发尺度不变连接器来避免图像缩放造成的信息丢失；(3)采用两阶段学习训练PathologyVLM，第一阶段进行领域对齐，第二阶段进行端到端视觉问题训练；回答（VQA）任务。在实验中，我们在监督VQA和零射击VQA数据集上对我们的PathologyVLM进行了评估，我们的模型在类似规模的多模态模型中取得了最好的整体性能。烧蚀实验也证实了设计的有效性。我们假设我们的PathologyVLM模型和本工作中提出的数据集可以促进计算病理学领域的研究。所有代码可在https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA上获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial Intelligence Review 工程技术-计算机：人工智能

CiteScore

22.00

自引率

3.30%

发文量

194

审稿时长

5.3 months

期刊介绍： Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.