Sulan Zhang , Zhenwen Liao , Jianeng Li , Lihua Hu , Jifu Zhang
{"title":"一种用于多标签图像识别的带变压器的渐进注意网络","authors":"Sulan Zhang , Zhenwen Liao , Jianeng Li , Lihua Hu , Jifu Zhang","doi":"10.1016/j.patcog.2025.112439","DOIUrl":null,"url":null,"abstract":"<div><div>Recent research typically improves the performance of multi-label image recognition by constructing higher-order pairwise label correlations. However, these methods lack the ability to effectively learn multi-scale features, which makes it difficult to distinguish small-scale objects. Moreover, most current attention-based methods to capture local salient features may ignore many useful non-salient features. To address the aforementioned issues, we propose a Transformer-based Progressive Attention Network (TPANet) for multi-label image recognition. Specifically, we first design a new adaptive multi-scale feature attention (AMSA) module to learn cross-scale features in multi-level features. Then, to excavate various useful object features, we introduce the transformer encoder to construct a semantic spatial attention (ESA) module and also propose a context-aware feature enhanced (CAFE) module. The former ESA module is used to discover complete object regions and capture discriminative features, and the latter CAFE module leverages object-local features to enhance pixel-level global features. The proposed TPANet model can generate more accurate object labels in three popular benchmark datasets (i.e., MS-COCO 2014, Pascal VOC 2007 and Visual Genome), and is competitive to state-of-the-art models (e.g., SST and FL-Tran, etc.).</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112439"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A progressive attention network with transformer for multi-label image recognition\",\"authors\":\"Sulan Zhang , Zhenwen Liao , Jianeng Li , Lihua Hu , Jifu Zhang\",\"doi\":\"10.1016/j.patcog.2025.112439\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recent research typically improves the performance of multi-label image recognition by constructing higher-order pairwise label correlations. However, these methods lack the ability to effectively learn multi-scale features, which makes it difficult to distinguish small-scale objects. Moreover, most current attention-based methods to capture local salient features may ignore many useful non-salient features. To address the aforementioned issues, we propose a Transformer-based Progressive Attention Network (TPANet) for multi-label image recognition. Specifically, we first design a new adaptive multi-scale feature attention (AMSA) module to learn cross-scale features in multi-level features. Then, to excavate various useful object features, we introduce the transformer encoder to construct a semantic spatial attention (ESA) module and also propose a context-aware feature enhanced (CAFE) module. The former ESA module is used to discover complete object regions and capture discriminative features, and the latter CAFE module leverages object-local features to enhance pixel-level global features. The proposed TPANet model can generate more accurate object labels in three popular benchmark datasets (i.e., MS-COCO 2014, Pascal VOC 2007 and Visual Genome), and is competitive to state-of-the-art models (e.g., SST and FL-Tran, etc.).</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"172 \",\"pages\":\"Article 112439\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325011008\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325011008","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A progressive attention network with transformer for multi-label image recognition
Recent research typically improves the performance of multi-label image recognition by constructing higher-order pairwise label correlations. However, these methods lack the ability to effectively learn multi-scale features, which makes it difficult to distinguish small-scale objects. Moreover, most current attention-based methods to capture local salient features may ignore many useful non-salient features. To address the aforementioned issues, we propose a Transformer-based Progressive Attention Network (TPANet) for multi-label image recognition. Specifically, we first design a new adaptive multi-scale feature attention (AMSA) module to learn cross-scale features in multi-level features. Then, to excavate various useful object features, we introduce the transformer encoder to construct a semantic spatial attention (ESA) module and also propose a context-aware feature enhanced (CAFE) module. The former ESA module is used to discover complete object regions and capture discriminative features, and the latter CAFE module leverages object-local features to enhance pixel-level global features. The proposed TPANet model can generate more accurate object labels in three popular benchmark datasets (i.e., MS-COCO 2014, Pascal VOC 2007 and Visual Genome), and is competitive to state-of-the-art models (e.g., SST and FL-Tran, etc.).
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.