Towards better text image machine translation with multimodal codebook and multi-stage training

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-05-23 DOI:10.1016/j.neunet.2025.107599

Zhibin Lan , Jiawei Yu , Shiyu Liu , Junfeng Yao , Degen Huang , Jinsong Su

{"title":"Towards better text image machine translation with multimodal codebook and multi-stage training","authors":"Zhibin Lan , Jiawei Yu , Shiyu Liu , Junfeng Yao , Degen Huang , Jinsong Su","doi":"10.1016/j.neunet.2025.107599","DOIUrl":null,"url":null,"abstract":"<div><div>As a widely-used machine translation task, text image machine translation (TIMT) aims to translate the source texts embedded in the image to target translations. However, studies in this aspect face two challenges: (1) constructed in a cascaded manner, dominant models suffer from the error propagation of optical character recognition (OCR), and (2) they lack publicly available large-scale datasets. To deal with these issues, we propose a multimodal codebook based TIMT model. In addition to a text encoder, an image encoder, and a text decoder, our model is equipped with a multimodal codebook that effectively associates images with relevant texts, thus providing useful supplementary information for translation. Particularly, we present a multi-stage training framework to fully exploit various datasets to effectively train our model. Concretely, we first conduct preliminary training on the text encoder and decoder using bilingual texts. Subsequently, via an additional code-conditioned mask translation task, we use the bilingual texts to continuously train the text encoder, multimodal codebook, and decoder. Afterwards, by further introducing an image-text alignment task and adversarial training, we train the whole model except for the text decoder on the OCR dataset. Finally, through the above training tasks except for text translation, we adopt a TIMT dataset to fine-tune the whole model. Besides, we manually annotate a Chinese-English TIMT dataset, named OCRMT30K, and extend it to Chinese-German TIMT dataset through an automatic translation tool. To the best of our knowledge, it is the first public manually-annotated TIMT dataset, which facilitates future studies in this task. To investigate the effectiveness of our model, we conduct extensive experiments on Chinese-English and Chinese-German TIMT tasks. Experimental results and in-depth analyses strongly demonstrate the effectiveness of our model. We release our code and dataset on <span><span>https://github.com/DeepLearnXMU/mc_tit</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"189 ","pages":"Article 107599"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025004794","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

As a widely-used machine translation task, text image machine translation (TIMT) aims to translate the source texts embedded in the image to target translations. However, studies in this aspect face two challenges: (1) constructed in a cascaded manner, dominant models suffer from the error propagation of optical character recognition (OCR), and (2) they lack publicly available large-scale datasets. To deal with these issues, we propose a multimodal codebook based TIMT model. In addition to a text encoder, an image encoder, and a text decoder, our model is equipped with a multimodal codebook that effectively associates images with relevant texts, thus providing useful supplementary information for translation. Particularly, we present a multi-stage training framework to fully exploit various datasets to effectively train our model. Concretely, we first conduct preliminary training on the text encoder and decoder using bilingual texts. Subsequently, via an additional code-conditioned mask translation task, we use the bilingual texts to continuously train the text encoder, multimodal codebook, and decoder. Afterwards, by further introducing an image-text alignment task and adversarial training, we train the whole model except for the text decoder on the OCR dataset. Finally, through the above training tasks except for text translation, we adopt a TIMT dataset to fine-tune the whole model. Besides, we manually annotate a Chinese-English TIMT dataset, named OCRMT30K, and extend it to Chinese-German TIMT dataset through an automatic translation tool. To the best of our knowledge, it is the first public manually-annotated TIMT dataset, which facilitates future studies in this task. To investigate the effectiveness of our model, we conduct extensive experiments on Chinese-English and Chinese-German TIMT tasks. Experimental results and in-depth analyses strongly demonstrate the effectiveness of our model. We release our code and dataset on https://github.com/DeepLearnXMU/mc_tit.

查看原文本刊更多论文

基于多模态码本和多阶段训练的文本图像机器翻译

文本图像机器翻译（TIMT）是一种应用广泛的机器翻译任务，其目的是将嵌入在图像中的源文本翻译成目标译文。然而，这方面的研究面临两个挑战：(1)以级联方式构建的主流模型容易受到光学字符识别（OCR）误差传播的影响；(2)缺乏公开可用的大规模数据集。为了解决这些问题，我们提出了一种基于多模态码本的TIMT模型。除了文本编码器、图像编码器和文本解码器外，我们的模型还配备了一个多模态码本，可以有效地将图像与相关文本关联起来，从而为翻译提供有用的补充信息。特别是，我们提出了一个多阶段的训练框架，以充分利用各种数据集来有效地训练我们的模型。具体来说，我们首先使用双语文本对文本编码器和解码器进行初步训练。随后，通过附加的代码条件掩码翻译任务，我们使用双语文本连续训练文本编码器、多模态码本和解码器。然后，通过进一步引入图像-文本对齐任务和对抗训练，我们在OCR数据集上训练除了文本解码器之外的整个模型。最后，通过上述训练任务，除文本翻译外，我们采用TIMT数据集对整个模型进行微调。此外，我们还手工标注了一个名为OCRMT30K的汉英TIMT数据集，并通过自动翻译工具将其扩展为汉英TIMT数据集。据我们所知，这是第一个公开的手动注释的TIMT数据集，这有助于未来在这项任务中的研究。为了验证该模型的有效性，我们在中英和中德语的TIMT任务上进行了大量的实验。实验结果和深入分析有力地证明了该模型的有效性。我们在https://github.com/DeepLearnXMU/mc_tit上发布代码和数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.