{"title":"2S-DFN: Dual-semantic Decoding Fusion Networks for Fine-grained Image Recognition","authors":"Pufen Zhang, Peng Shi, Song Zhang","doi":"10.1109/icme55011.2023.00012","DOIUrl":null,"url":null,"abstract":"In previous fine-grained image recognition (FGIR) methods, the single global or local semantic fusion view may not be comprehensive to reveal the semantic associations between image and text. Besides, the encoding fusion strategy cannot fuse the semantics finely because the low-order text semantic dependence and the irrelevant semantic concepts are fused. To address these issues, a novel Dual-Semantic Decoding Fusion Networks (2S-DFN) is proposed for FGIR. Specifically, a multilayer text semantic encoder is first constructed to extract the higher-order semantics dependence among text. To obtain sufficient semantic association, two decoding semantic fusion streams are symmetrically designed from the global and local perspectives. Moreover, by decoding way to implant text features to semantic fusion layer as well as cascading it deeply, two streams fuse the semantics of text and image finely. Extensive experiments demonstrate that the effectiveness of the proposed method and 2S-DFN attains the state-of-the-art results on two benchmark datasets.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icme55011.2023.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In previous fine-grained image recognition (FGIR) methods, the single global or local semantic fusion view may not be comprehensive to reveal the semantic associations between image and text. Besides, the encoding fusion strategy cannot fuse the semantics finely because the low-order text semantic dependence and the irrelevant semantic concepts are fused. To address these issues, a novel Dual-Semantic Decoding Fusion Networks (2S-DFN) is proposed for FGIR. Specifically, a multilayer text semantic encoder is first constructed to extract the higher-order semantics dependence among text. To obtain sufficient semantic association, two decoding semantic fusion streams are symmetrically designed from the global and local perspectives. Moreover, by decoding way to implant text features to semantic fusion layer as well as cascading it deeply, two streams fuse the semantics of text and image finely. Extensive experiments demonstrate that the effectiveness of the proposed method and 2S-DFN attains the state-of-the-art results on two benchmark datasets.