IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Complex & Intelligent Systems Pub Date : 2025-01-04 DOI:10.1007/s40747-024-01713-8

Qianfeng Zhang, Feng Liu, Wanru Song

{"title":"IMTLM-Net: improved multi-task transformer based on localization mechanism network for handwritten English text recognition","authors":"Qianfeng Zhang, Feng Liu, Wanru Song","doi":"10.1007/s40747-024-01713-8","DOIUrl":null,"url":null,"abstract":"<p>Intelligence technology has widely empowered education. As an example, Optical Character Recognition (OCR) can be used in smart education scenarios such as online homework correction and teaching data analysis. One of the fundamental yet challenging tasks is to recognize images of handwritten English text as editable text accurately. This is because handwritten text tends to have different writing habits as well as smearing and overlapping, resulting in hard alignment between the image and the real text. Additionally, the lack of data on handwritten text further leads to a lower recognition rate. To address the above issue, on the one hand, this paper extends the existing dataset and introduces hyphenated data annotation to provide data support for improving the robustness and discrimination of the model; on the other hand, a novel framework named Improved Multi-task Transformer based on Localization Mechanism Network (IMTLM-Net) is proposed for handwritten English text recognition. IMTLM-Net contains two parts, namely the encoding and decoding modules. The encoding module introduces a dual-stream processing mechanism. That is, in the simultaneous processing of text and images, a Vision Transformer (VIT) is utilized to encode images, and a Permutation Language Model (PLM) is designed for word arrangement. Two Multiple Head Attention (MHA) units are employed in the decoding module, focusing on text sequences and image sequences. Moreover, the localization mechanism (LM) is applied to enhance font structure feature extraction from image data, which in turn improves the model’s ability to capture complex details. Numerous experiments demonstrate that the proposed method achieves state-of-the-art results in handwritten text recognition.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"5 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01713-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Intelligence technology has widely empowered education. As an example, Optical Character Recognition (OCR) can be used in smart education scenarios such as online homework correction and teaching data analysis. One of the fundamental yet challenging tasks is to recognize images of handwritten English text as editable text accurately. This is because handwritten text tends to have different writing habits as well as smearing and overlapping, resulting in hard alignment between the image and the real text. Additionally, the lack of data on handwritten text further leads to a lower recognition rate. To address the above issue, on the one hand, this paper extends the existing dataset and introduces hyphenated data annotation to provide data support for improving the robustness and discrimination of the model; on the other hand, a novel framework named Improved Multi-task Transformer based on Localization Mechanism Network (IMTLM-Net) is proposed for handwritten English text recognition. IMTLM-Net contains two parts, namely the encoding and decoding modules. The encoding module introduces a dual-stream processing mechanism. That is, in the simultaneous processing of text and images, a Vision Transformer (VIT) is utilized to encode images, and a Permutation Language Model (PLM) is designed for word arrangement. Two Multiple Head Attention (MHA) units are employed in the decoding module, focusing on text sequences and image sequences. Moreover, the localization mechanism (LM) is applied to enhance font structure feature extraction from image data, which in turn improves the model’s ability to capture complex details. Numerous experiments demonstrate that the proposed method achieves state-of-the-art results in handwritten text recognition.

查看原文本刊更多论文

IMTLM-Net：改进的基于定位机制网络的多任务转换器，用于手写体英语文本识别

智能技术广泛赋能了教育。例如，OCR （Optical Character Recognition）可用于在线批改作业、教学数据分析等智能教育场景。将英文手写文本图像准确地识别为可编辑文本是一项基本而又具有挑战性的任务。这是因为手写文本往往具有不同的书写习惯以及涂抹和重叠，导致图像和实际文本之间难以对齐。此外，手写文本数据的缺乏进一步导致识别率较低。针对上述问题，本文一方面对现有数据集进行扩展，引入连字符数据标注，为提高模型的鲁棒性和判别性提供数据支持；另一方面，提出了一种基于定位机制网络的改进多任务转换器（IMTLM-Net）的手写体英语文本识别框架。IMTLM-Net包括编码和解码两部分。编码模块引入了双流处理机制。即在文本和图像的同时处理中，利用视觉转换器（Vision Transformer， VIT）对图像进行编码，设计排列语言模型（Permutation Language Model， PLM）对单词进行排列。解码模块采用了两个多头注意（MHA）单元，分别对文本序列和图像序列进行解码。此外，利用定位机制（LM）增强图像数据中字体结构特征的提取，从而提高模型对复杂细节的捕捉能力。大量实验表明，该方法在手写体文本识别中取得了较好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.