Transformer-driven feature fusion network and visual feature coding for multi-label image classification

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-03-17 DOI:10.1016/j.patcog.2025.111584

Pingzhu Liu , Wenbin Qian , Jintao Huang , Yanqiang Tu , Yiu-Ming Cheung

{"title":"Transformer-driven feature fusion network and visual feature coding for multi-label image classification","authors":"Pingzhu Liu , Wenbin Qian , Jintao Huang , Yanqiang Tu , Yiu-Ming Cheung","doi":"10.1016/j.patcog.2025.111584","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-label image classification (MLIC) has attracted extensive research attention in recent years. Nevertheless, most of the existing methods have difficulty in effectively fusing multi-scale features and focusing on critical visual information, which makes it difficult to recognize objects from images. Besides, recent studies have utilized graph convolutional networks and attention mechanisms to model label dependencies in order to improve the model performance. However, these methods often rely on manually predefined label structures, which limits flexibility and model generality. And they also fail to capture intrinsic object correlations within images and spatial contexts. To address these challenges, we propose a novel Feature Fusion network combined with Transformer (FFTran) to fuse different visual features. Firstly, to address the difficulties of current methods in recognizing small objects, we propose a Multi-level Scale Information Integration Mechanism (MSIIM) that fuses different feature maps from the backbone network. Secondly, we develop an Intra-Image Spatial-Channel Semantic Mining (ISCM) module for learning important spaces and channel information. Thirdly, we design a Visual Feature Coding based on Transformer (VFCT) module to enhance the contextual information by pooling different visual features. Compared to the baseline model, FFTran achieves a significant boost in mean Average Precision (mAP) on both the VOC2007 and COCO2014 datasets, with enhancements of 2.9% and 5.1% respectively, highlighting its superior performance in multi-label image classification tasks.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"164 ","pages":"Article 111584"},"PeriodicalIF":7.6000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325002444","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-label image classification (MLIC) has attracted extensive research attention in recent years. Nevertheless, most of the existing methods have difficulty in effectively fusing multi-scale features and focusing on critical visual information, which makes it difficult to recognize objects from images. Besides, recent studies have utilized graph convolutional networks and attention mechanisms to model label dependencies in order to improve the model performance. However, these methods often rely on manually predefined label structures, which limits flexibility and model generality. And they also fail to capture intrinsic object correlations within images and spatial contexts. To address these challenges, we propose a novel Feature Fusion network combined with Transformer (FFTran) to fuse different visual features. Firstly, to address the difficulties of current methods in recognizing small objects, we propose a Multi-level Scale Information Integration Mechanism (MSIIM) that fuses different feature maps from the backbone network. Secondly, we develop an Intra-Image Spatial-Channel Semantic Mining (ISCM) module for learning important spaces and channel information. Thirdly, we design a Visual Feature Coding based on Transformer (VFCT) module to enhance the contextual information by pooling different visual features. Compared to the baseline model, FFTran achieves a significant boost in mean Average Precision (mAP) on both the VOC2007 and COCO2014 datasets, with enhancements of 2.9% and 5.1% respectively, highlighting its superior performance in multi-label image classification tasks.

查看原文本刊更多论文

变压器驱动特征融合网络与多标签图像分类的视觉特征编码

近年来，多标签图像分类得到了广泛的研究关注。然而，现有的方法大多难以有效地融合多尺度特征和关注关键的视觉信息，这给从图像中识别物体带来了困难。此外，最近的研究利用图卷积网络和注意机制对标签依赖关系进行建模，以提高模型的性能。然而，这些方法通常依赖于手动预定义的标签结构，这限制了灵活性和模型的通用性。而且它们也无法捕捉图像和空间环境中物体的内在相关性。为了解决这些挑战，我们提出了一种新的特征融合网络，结合变压器（FFTran）来融合不同的视觉特征。首先，针对当前小目标识别方法存在的困难，提出了一种融合骨干网不同特征映射的多层次尺度信息集成机制（MSIIM）。其次，我们开发了图像内空间通道语义挖掘（ISCM）模块，用于学习重要空间和通道信息。第三，设计了一种基于Transformer的视觉特征编码（VFCT）模块，通过汇集不同的视觉特征来增强上下文信息。与基线模型相比，FFTran在VOC2007和COCO2014数据集上的平均精度（mAP）都有显著提高，分别提高了2.9%和5.1%，突出了其在多标签图像分类任务中的优越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.