Exploring coherence from heterogeneous representations for OCR image captioning

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems Pub Date : 2024-09-06 DOI:10.1007/s00530-024-01470-1

Yao Zhang, Zijie Song, Zhenzhen Hu

{"title":"Exploring coherence from heterogeneous representations for OCR image captioning","authors":"Yao Zhang, Zijie Song, Zhenzhen Hu","doi":"10.1007/s00530-024-01470-1","DOIUrl":null,"url":null,"abstract":"<p>Text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. Text-based image contains both textual and visual information, which is difficult to be described comprehensively. Recent works fail to adequately model the relationship between features of different modalities and fine-grained alignment. Due to the multimodal characteristics of scene texts, the representations of text usually come from multiple encoders of visual and textual, leading to heterogeneous features. Though lots of works have paid attention to fuse features from different sources, they ignore the direct correlation between heterogeneous features, and the coherence in scene text has not been fully exploited. In this paper, we propose Heterogeneous Attention Module (HAM) to enhance the cross-modal representations of OCR tokens and devote it to text-based image captioning. The HAM is designed to capture the coherence between different modalities of OCR tokens and provide context-aware scene text representations to generate accurate image captions. To the best of our knowledge, we are the first to apply the heterogeneous attention mechanism to explore the coherence in OCR tokens for text-based image captioning. By calculating the heterogeneous similarity, we interactively enhance the alignment between visual and textual information in OCR. We conduct the experiments on the TextCaps dataset. Under the same setting, the results show that our model achieves competitive performances compared with the advanced methods and ablation study demonstrates that our framework enhances the original model in all metrics.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"49 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01470-1","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. Text-based image contains both textual and visual information, which is difficult to be described comprehensively. Recent works fail to adequately model the relationship between features of different modalities and fine-grained alignment. Due to the multimodal characteristics of scene texts, the representations of text usually come from multiple encoders of visual and textual, leading to heterogeneous features. Though lots of works have paid attention to fuse features from different sources, they ignore the direct correlation between heterogeneous features, and the coherence in scene text has not been fully exploited. In this paper, we propose Heterogeneous Attention Module (HAM) to enhance the cross-modal representations of OCR tokens and devote it to text-based image captioning. The HAM is designed to capture the coherence between different modalities of OCR tokens and provide context-aware scene text representations to generate accurate image captions. To the best of our knowledge, we are the first to apply the heterogeneous attention mechanism to explore the coherence in OCR tokens for text-based image captioning. By calculating the heterogeneous similarity, we interactively enhance the alignment between visual and textual information in OCR. We conduct the experiments on the TextCaps dataset. Under the same setting, the results show that our model achieves competitive performances compared with the advanced methods and ablation study demonstrates that our framework enhances the original model in all metrics.

Abstract Image

查看原文本刊更多论文

从异构表征中探索连贯性，为 OCR 图像添加标题

基于文本的图像标题是一项重要任务，旨在通过阅读和推理图像中的场景文本生成描述。基于文本的图像包含文本和视觉信息，很难对其进行全面描述。近期的研究未能充分模拟不同模态特征之间的关系和细粒度对齐。由于场景文本的多模态特征，文本的表征通常来自视觉和文本的多个编码器，从而导致特征的异构。虽然很多研究都注意融合不同来源的特征，但却忽略了异构特征之间的直接相关性，场景文本中的一致性没有得到充分利用。在本文中，我们提出了异构关注模块（HAM）来增强 OCR 标记的跨模态表示，并将其应用于基于文本的图像字幕。HAM 旨在捕捉 OCR 标记的不同模态之间的一致性，并提供上下文感知的场景文本表示，从而生成准确的图像标题。据我们所知，我们是第一个将异质关注机制应用于探索基于文本的图像字幕的 OCR 标记一致性的公司。通过计算异质相似性，我们以交互方式增强了 OCR 中视觉信息和文本信息之间的一致性。我们在 TextCaps 数据集上进行了实验。在相同的设置下，结果表明我们的模型与先进的方法相比取得了具有竞争力的性能，而消融研究表明我们的框架在所有指标上都增强了原始模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Multimedia Systems 工程技术-计算机：理论方法

CiteScore

5.40

自引率

7.70%

发文量

148

审稿时长

4.5 months

期刊介绍： This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.