Subrota Kumar Mondal , Yijun Chen , Yuning Cheng , Hong-Ning Dai , Syed B. Alam , H.M. Dipu Kabir
{"title":"On English-Chinese Neural Machine Translation leveraging Transformer model","authors":"Subrota Kumar Mondal , Yijun Chen , Yuning Cheng , Hong-Ning Dai , Syed B. Alam , H.M. Dipu Kabir","doi":"10.1016/j.nlp.2025.100166","DOIUrl":null,"url":null,"abstract":"<div><div>In today’s era of globalization, people’s cross-cultural communication has become increasingly frequent, and photo translation (photo, image, or scene text translation) technology has become an important tool. By using this translation technology, people can easily recognize and translate text from other languages without the need for manual input or translation. This has important practical value for people in fields such as tourism, business, education, and research. Therefore, photo translation technology has become an indispensable tool, providing more convenience to people’s lives and work. To this, this paper aims to achieve high accuracy English to Chinese photo translation, which can be divided into three stages: <span>text detection</span>, <span>text recognition</span>, and <span>text translation (i.e., machine translation)</span>. We observe that in text detection and recognition, we have challenges with occluded text, hand-written text, scene text, text with complex layout, distorted text, and many others. However, in this paper, we limit our analysis to Translation phase. For detection and recognition phase, we make use of current state-of-the-art methodologies, such as <span>DBNet</span> (Liao et al., 2020) model for detection and the <span>ABINet</span> (Fang et al., 2021) model for recognition. In the translation part, we use Transformer model with modifications towards improving the translation accuracy. The modifications are mainly reflected in two aspects: <span>data preprocessing</span> and <span>optimizer</span>. In the data preprocessing part, we use the <span>BPE</span> (Byte Pair Encoding) algorithm instead of basic word-centered tokenization algorithms. In the context, <span>BPE</span> algorithm can divide words into smaller subwords, which can solve the problem of rare words to some extent and provide better word vectors for language model training. In the optimizer part, we use the <span>Lion</span> model proposed by Google instead of the widely used <span>Adam</span> optimizer that helps reduce the loss more quickly than using Adam optimizer for small size batch — with batch size 256 achieves the lowest test loss 0.392842 (−1.072171) and the highest BLEU4 score 0.381281 (+0.24063). This adds value in reducing the consumption of training resources and the sustainability of deep learning.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100166"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000421","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In today’s era of globalization, people’s cross-cultural communication has become increasingly frequent, and photo translation (photo, image, or scene text translation) technology has become an important tool. By using this translation technology, people can easily recognize and translate text from other languages without the need for manual input or translation. This has important practical value for people in fields such as tourism, business, education, and research. Therefore, photo translation technology has become an indispensable tool, providing more convenience to people’s lives and work. To this, this paper aims to achieve high accuracy English to Chinese photo translation, which can be divided into three stages: text detection, text recognition, and text translation (i.e., machine translation). We observe that in text detection and recognition, we have challenges with occluded text, hand-written text, scene text, text with complex layout, distorted text, and many others. However, in this paper, we limit our analysis to Translation phase. For detection and recognition phase, we make use of current state-of-the-art methodologies, such as DBNet (Liao et al., 2020) model for detection and the ABINet (Fang et al., 2021) model for recognition. In the translation part, we use Transformer model with modifications towards improving the translation accuracy. The modifications are mainly reflected in two aspects: data preprocessing and optimizer. In the data preprocessing part, we use the BPE (Byte Pair Encoding) algorithm instead of basic word-centered tokenization algorithms. In the context, BPE algorithm can divide words into smaller subwords, which can solve the problem of rare words to some extent and provide better word vectors for language model training. In the optimizer part, we use the Lion model proposed by Google instead of the widely used Adam optimizer that helps reduce the loss more quickly than using Adam optimizer for small size batch — with batch size 256 achieves the lowest test loss 0.392842 (−1.072171) and the highest BLEU4 score 0.381281 (+0.24063). This adds value in reducing the consumption of training resources and the sustainability of deep learning.