Ayush Roy , Shivakumara Palaiahnakote , Umapada Pal , Cheng-Lin Liu
{"title":"Split-net:双变压器编码器,可拆分场景文本图像,用于脚本识别","authors":"Ayush Roy , Shivakumara Palaiahnakote , Umapada Pal , Cheng-Lin Liu","doi":"10.1016/j.patrec.2025.05.026","DOIUrl":null,"url":null,"abstract":"<div><div>Script identification is vital for understanding scenes and video images. It is challenging due to high variations in physical appearance, typeface design, complex background, distortion, and significant overlap in the characteristics of different scripts. Unlike existing models, which aim to tackle the script images utilizing the scene text image as a whole, we propose to split the image into upper and lower halves to capture the intricate differences in stroke and style of various scripts. Motivated by the accomplishments of the transformer, a modified script-style-aware Mobile-Vision Transformer (M-ViT) is explored for encoding visual features of the images. To enrich the features of the transformer blocks, a novel Edge Enhanced Style Aware Channel Attention Module (EESA-CAM) has been integrated with M-ViT. Furthermore, the model fuses the features of the dual encoders (extracting features from the upper and the lower half of the images) by a dynamic weighted average procedure utilizing the gradient information of the encoders as the weights. In experiments on three standard datasets, MLe2e, CVSI2015, and SIW-13, the proposed model yielded superior performance compared to state-of-the-art models.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"196 ","pages":"Pages 100-108"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Split-net: Dual transformer encoder with splitting scene text image for script identification\",\"authors\":\"Ayush Roy , Shivakumara Palaiahnakote , Umapada Pal , Cheng-Lin Liu\",\"doi\":\"10.1016/j.patrec.2025.05.026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Script identification is vital for understanding scenes and video images. It is challenging due to high variations in physical appearance, typeface design, complex background, distortion, and significant overlap in the characteristics of different scripts. Unlike existing models, which aim to tackle the script images utilizing the scene text image as a whole, we propose to split the image into upper and lower halves to capture the intricate differences in stroke and style of various scripts. Motivated by the accomplishments of the transformer, a modified script-style-aware Mobile-Vision Transformer (M-ViT) is explored for encoding visual features of the images. To enrich the features of the transformer blocks, a novel Edge Enhanced Style Aware Channel Attention Module (EESA-CAM) has been integrated with M-ViT. Furthermore, the model fuses the features of the dual encoders (extracting features from the upper and the lower half of the images) by a dynamic weighted average procedure utilizing the gradient information of the encoders as the weights. In experiments on three standard datasets, MLe2e, CVSI2015, and SIW-13, the proposed model yielded superior performance compared to state-of-the-art models.</div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"196 \",\"pages\":\"Pages 100-108\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865525002211\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002211","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Split-net: Dual transformer encoder with splitting scene text image for script identification
Script identification is vital for understanding scenes and video images. It is challenging due to high variations in physical appearance, typeface design, complex background, distortion, and significant overlap in the characteristics of different scripts. Unlike existing models, which aim to tackle the script images utilizing the scene text image as a whole, we propose to split the image into upper and lower halves to capture the intricate differences in stroke and style of various scripts. Motivated by the accomplishments of the transformer, a modified script-style-aware Mobile-Vision Transformer (M-ViT) is explored for encoding visual features of the images. To enrich the features of the transformer blocks, a novel Edge Enhanced Style Aware Channel Attention Module (EESA-CAM) has been integrated with M-ViT. Furthermore, the model fuses the features of the dual encoders (extracting features from the upper and the lower half of the images) by a dynamic weighted average procedure utilizing the gradient information of the encoders as the weights. In experiments on three standard datasets, MLe2e, CVSI2015, and SIW-13, the proposed model yielded superior performance compared to state-of-the-art models.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.