{"title":"Enhancing vision–language contrastive representation learning using domain knowledge","authors":"Xiaoyang Wei, Camille Kurtz, Florence Cloppet","doi":"10.1016/j.cviu.2025.104403","DOIUrl":null,"url":null,"abstract":"<div><div>Visual representation learning plays a key role in solving medical computer vision tasks. Recent advances in the literature often rely on vision–language models aiming to learn the representation of medical images from the supervision of paired captions in a label-free manner. The training of such models is however very data/time intensive and the alignment strategies involved in the contrastive loss functions may not capture the full richness of information carried by inter-data relationships. We assume here that considering expert knowledge from the medical domain can provide solutions to these problems during model optimization. To this end, we propose a novel knowledge-augmented vision–language contrastive representation learning framework consisting of the following steps: (1) Modeling the hierarchical relationships between various medical concepts using expert knowledge and medical images in a dataset through a knowledge graph, followed by translating each node into a knowledge embedding; And (2) integrating knowledge embeddings into a vision–language contrastive learning framework, either by introducing an additional alignment loss between visual and knowledge embeddings or by relaxing binary constraints of vision–language alignment using knowledge embeddings. Our results demonstrate that the proposed solution achieves competitive performances against state-of-the-art approaches for downstream tasks while requiring significantly less training data. Our code is available at <span><span>https://github.com/Wxy-24/KL-CVR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104403"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001262","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Visual representation learning plays a key role in solving medical computer vision tasks. Recent advances in the literature often rely on vision–language models aiming to learn the representation of medical images from the supervision of paired captions in a label-free manner. The training of such models is however very data/time intensive and the alignment strategies involved in the contrastive loss functions may not capture the full richness of information carried by inter-data relationships. We assume here that considering expert knowledge from the medical domain can provide solutions to these problems during model optimization. To this end, we propose a novel knowledge-augmented vision–language contrastive representation learning framework consisting of the following steps: (1) Modeling the hierarchical relationships between various medical concepts using expert knowledge and medical images in a dataset through a knowledge graph, followed by translating each node into a knowledge embedding; And (2) integrating knowledge embeddings into a vision–language contrastive learning framework, either by introducing an additional alignment loss between visual and knowledge embeddings or by relaxing binary constraints of vision–language alignment using knowledge embeddings. Our results demonstrate that the proposed solution achieves competitive performances against state-of-the-art approaches for downstream tasks while requiring significantly less training data. Our code is available at https://github.com/Wxy-24/KL-CVR.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems