Che Liu;Sibo Cheng;Miaojing Shi;Anand Shah;Wenjia Bai;Rossella Arcucci
{"title":"IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training","authors":"Che Liu;Sibo Cheng;Miaojing Shi;Anand Shah;Wenjia Bai;Rossella Arcucci","doi":"10.1109/TMI.2024.3449690","DOIUrl":null,"url":null,"abstract":"In medical Vision-Language Pre-training (VLP), significant work focuses on extracting text and image features from clinical reports and medical images. Yet, existing methods may overlooked the potential of the natural hierarchical structure in clinical reports, typically divided into ‘findings’ for description and ‘impressions’ for conclusions. Current VLP approaches tend to oversimplify these reports into a single entity or fragmented tokens, ignoring this structured format. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Experimental results show benefits of using hierarchical structures in medical reports for VLP. Code: \n<uri>https://github.com/cheliu-computation/IMITATE-TMI2024</uri>\n.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 1","pages":"519-529"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10646593/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In medical Vision-Language Pre-training (VLP), significant work focuses on extracting text and image features from clinical reports and medical images. Yet, existing methods may overlooked the potential of the natural hierarchical structure in clinical reports, typically divided into ‘findings’ for description and ‘impressions’ for conclusions. Current VLP approaches tend to oversimplify these reports into a single entity or fragmented tokens, ignoring this structured format. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Experimental results show benefits of using hierarchical structures in medical reports for VLP. Code:
https://github.com/cheliu-computation/IMITATE-TMI2024
.