Pingzhu Liu , Wenbin Qian , Jintao Huang , Yanqiang Tu , Yiu-Ming Cheung
{"title":"Transformer-driven feature fusion network and visual feature coding for multi-label image classification","authors":"Pingzhu Liu , Wenbin Qian , Jintao Huang , Yanqiang Tu , Yiu-Ming Cheung","doi":"10.1016/j.patcog.2025.111584","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-label image classification (MLIC) has attracted extensive research attention in recent years. Nevertheless, most of the existing methods have difficulty in effectively fusing multi-scale features and focusing on critical visual information, which makes it difficult to recognize objects from images. Besides, recent studies have utilized graph convolutional networks and attention mechanisms to model label dependencies in order to improve the model performance. However, these methods often rely on manually predefined label structures, which limits flexibility and model generality. And they also fail to capture intrinsic object correlations within images and spatial contexts. To address these challenges, we propose a novel Feature Fusion network combined with Transformer (FFTran) to fuse different visual features. Firstly, to address the difficulties of current methods in recognizing small objects, we propose a Multi-level Scale Information Integration Mechanism (MSIIM) that fuses different feature maps from the backbone network. Secondly, we develop an Intra-Image Spatial-Channel Semantic Mining (ISCM) module for learning important spaces and channel information. Thirdly, we design a Visual Feature Coding based on Transformer (VFCT) module to enhance the contextual information by pooling different visual features. Compared to the baseline model, FFTran achieves a significant boost in mean Average Precision (mAP) on both the VOC2007 and COCO2014 datasets, with enhancements of 2.9% and 5.1% respectively, highlighting its superior performance in multi-label image classification tasks.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"164 ","pages":"Article 111584"},"PeriodicalIF":7.5000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325002444","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multi-label image classification (MLIC) has attracted extensive research attention in recent years. Nevertheless, most of the existing methods have difficulty in effectively fusing multi-scale features and focusing on critical visual information, which makes it difficult to recognize objects from images. Besides, recent studies have utilized graph convolutional networks and attention mechanisms to model label dependencies in order to improve the model performance. However, these methods often rely on manually predefined label structures, which limits flexibility and model generality. And they also fail to capture intrinsic object correlations within images and spatial contexts. To address these challenges, we propose a novel Feature Fusion network combined with Transformer (FFTran) to fuse different visual features. Firstly, to address the difficulties of current methods in recognizing small objects, we propose a Multi-level Scale Information Integration Mechanism (MSIIM) that fuses different feature maps from the backbone network. Secondly, we develop an Intra-Image Spatial-Channel Semantic Mining (ISCM) module for learning important spaces and channel information. Thirdly, we design a Visual Feature Coding based on Transformer (VFCT) module to enhance the contextual information by pooling different visual features. Compared to the baseline model, FFTran achieves a significant boost in mean Average Precision (mAP) on both the VOC2007 and COCO2014 datasets, with enhancements of 2.9% and 5.1% respectively, highlighting its superior performance in multi-label image classification tasks.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.