Xiulong Yi , You Fu , Enxu Bi , Jianguo Liang , Hao Zhang , Jianzhi Yu , Qianqian Li , Rong Hua , Rui Wang
{"title":"Radiology report generation via visual-semantic ambivalence-aware network and focal self-critical sequence training","authors":"Xiulong Yi , You Fu , Enxu Bi , Jianguo Liang , Hao Zhang , Jianzhi Yu , Qianqian Li , Rong Hua , Rui Wang","doi":"10.1016/j.neunet.2025.108102","DOIUrl":null,"url":null,"abstract":"<div><div>Radiology report generation, which aims to provide accurate descriptions of both normal and abnormal regions, has been attracting growing research attention. Recently, despite considerable progress, data-driven deep-learning based models still face challenges in capturing and describing the abnormalities, due to the data bias problem. To address this problem, we propose to generate radiology reports via the Visual-Semantic Ambivalence-Aware Network (VSANet) and the Focal Self-Critical Sequence Training (FSCST). In detail, our VSANet follows the encoder-decoder framework. In the encoder part, we first deploy a multi-grained abnormality extractor and a visual extractor to capture both semantic and visual features from given images, and then introduce a Parameter Shared Dual-way Encoder (PSDwE) to delve into the inter- and intra-relationships among these features. In the decoder part, we propose the Visual-Semantic Ambivalence-Aware (VSA) module to generate the abnormality-aware visual features to mitigate the data bias problem. In implementation, our VSA introduces three sub-modules: Dual-way Attention (DwA), introduced to generate both the word-related visual and semantic features; Dual-way Attention on Attention (DwAoA), designed to mitigate redundant information; Score-based Feature Fusion (SFF), constructed to fuse the visual and semantic features in an ambivalence way. We further introduce the FSCST to enhance the overall performance of our VSANet by allocating more attention toward difficult samples. Experimental results demonstrate that our proposal achieves superior performance on various evaluation metrics. Source code have released at <span><span>https://github.com/SKD-HPC/VSANet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"194 ","pages":"Article 108102"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009827","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Radiology report generation, which aims to provide accurate descriptions of both normal and abnormal regions, has been attracting growing research attention. Recently, despite considerable progress, data-driven deep-learning based models still face challenges in capturing and describing the abnormalities, due to the data bias problem. To address this problem, we propose to generate radiology reports via the Visual-Semantic Ambivalence-Aware Network (VSANet) and the Focal Self-Critical Sequence Training (FSCST). In detail, our VSANet follows the encoder-decoder framework. In the encoder part, we first deploy a multi-grained abnormality extractor and a visual extractor to capture both semantic and visual features from given images, and then introduce a Parameter Shared Dual-way Encoder (PSDwE) to delve into the inter- and intra-relationships among these features. In the decoder part, we propose the Visual-Semantic Ambivalence-Aware (VSA) module to generate the abnormality-aware visual features to mitigate the data bias problem. In implementation, our VSA introduces three sub-modules: Dual-way Attention (DwA), introduced to generate both the word-related visual and semantic features; Dual-way Attention on Attention (DwAoA), designed to mitigate redundant information; Score-based Feature Fusion (SFF), constructed to fuse the visual and semantic features in an ambivalence way. We further introduce the FSCST to enhance the overall performance of our VSANet by allocating more attention toward difficult samples. Experimental results demonstrate that our proposal achieves superior performance on various evaluation metrics. Source code have released at https://github.com/SKD-HPC/VSANet.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.