{"title":"Exploring coherence from heterogeneous representations for OCR image captioning","authors":"Yao Zhang, Zijie Song, Zhenzhen Hu","doi":"10.1007/s00530-024-01470-1","DOIUrl":null,"url":null,"abstract":"<p>Text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. Text-based image contains both textual and visual information, which is difficult to be described comprehensively. Recent works fail to adequately model the relationship between features of different modalities and fine-grained alignment. Due to the multimodal characteristics of scene texts, the representations of text usually come from multiple encoders of visual and textual, leading to heterogeneous features. Though lots of works have paid attention to fuse features from different sources, they ignore the direct correlation between heterogeneous features, and the coherence in scene text has not been fully exploited. In this paper, we propose Heterogeneous Attention Module (HAM) to enhance the cross-modal representations of OCR tokens and devote it to text-based image captioning. The HAM is designed to capture the coherence between different modalities of OCR tokens and provide context-aware scene text representations to generate accurate image captions. To the best of our knowledge, we are the first to apply the heterogeneous attention mechanism to explore the coherence in OCR tokens for text-based image captioning. By calculating the heterogeneous similarity, we interactively enhance the alignment between visual and textual information in OCR. We conduct the experiments on the TextCaps dataset. Under the same setting, the results show that our model achieves competitive performances compared with the advanced methods and ablation study demonstrates that our framework enhances the original model in all metrics.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01470-1","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. Text-based image contains both textual and visual information, which is difficult to be described comprehensively. Recent works fail to adequately model the relationship between features of different modalities and fine-grained alignment. Due to the multimodal characteristics of scene texts, the representations of text usually come from multiple encoders of visual and textual, leading to heterogeneous features. Though lots of works have paid attention to fuse features from different sources, they ignore the direct correlation between heterogeneous features, and the coherence in scene text has not been fully exploited. In this paper, we propose Heterogeneous Attention Module (HAM) to enhance the cross-modal representations of OCR tokens and devote it to text-based image captioning. The HAM is designed to capture the coherence between different modalities of OCR tokens and provide context-aware scene text representations to generate accurate image captions. To the best of our knowledge, we are the first to apply the heterogeneous attention mechanism to explore the coherence in OCR tokens for text-based image captioning. By calculating the heterogeneous similarity, we interactively enhance the alignment between visual and textual information in OCR. We conduct the experiments on the TextCaps dataset. Under the same setting, the results show that our model achieves competitive performances compared with the advanced methods and ablation study demonstrates that our framework enhances the original model in all metrics.