Scene Text Image Super-Resolution Via Semantic Distillation and Text Perceptual Loss

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI:10.1109/TMM.2024.3521759

Cairong Zhao;Rui Shu;Shuyang Feng;Liang Zhu;Xuekuan Wang

{"title":"Scene Text Image Super-Resolution Via Semantic Distillation and Text Perceptual Loss","authors":"Cairong Zhao;Rui Shu;Shuyang Feng;Liang Zhu;Xuekuan Wang","doi":"10.1109/TMM.2024.3521759","DOIUrl":null,"url":null,"abstract":"Text Super-Resolution (SR) technology aims to recover lost information in low-resolution text images. With the proposal of TextZoom, which is the first dataset aiming at text super-resolution in real scenes, more and more scene text super-resolution models have been presented on the basis of it. Although these methods have achieved excellent performance, they do not consider how to make full and efficient use of semantic information. Out of this consideration, a Semantic-aware Trident Network (STNet) for Scene Text Image Super-Resolution is proposed. Specifically, pre-trained text recognition model ASTER (Attentional Scene Text Recognizer) is utilized to assist this process in two ways. Firstly, a novel basic block named Semantic-aware Trident Block (STB) is designed to build the STNet, which incorporates an added branch for semantic distillation to learn semantic information of pre-trained recognition model. Secondly, we expand our model in an adversarial training manner and propose new text perceptual loss based on ASTER to further enhance semantic information in SR images. Extensive experiments on TextZoom dataset show that compared with directly recognizing bicubic images, the proposed STNet boosts the recognition accuracy of ASTER, MORAN (Multi-Object Rectified Attention Network), and CRNN (Convolutional Recurrent Neural Network) by 17.4%, 18.2%, and 24.3%, respectively, which is higher than the performance of several existing state-of-the-art (SOTA) SR network models. Besides, experiments in real scenes (on ICDAR 2015 dataset) and in restricted scenarios (defense against adversarial attacks) validate that addition of semantic information enables the proposed method to achieve promising cross-dataset performance. Since the proposed method is trained on cropped images, when applied to real-world scenarios, locations of text in natural images are firstly localized through scene text detection methods, and then cropped text images are obtained based on detected text positions.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1153-1164"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814978/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Text Super-Resolution (SR) technology aims to recover lost information in low-resolution text images. With the proposal of TextZoom, which is the first dataset aiming at text super-resolution in real scenes, more and more scene text super-resolution models have been presented on the basis of it. Although these methods have achieved excellent performance, they do not consider how to make full and efficient use of semantic information. Out of this consideration, a Semantic-aware Trident Network (STNet) for Scene Text Image Super-Resolution is proposed. Specifically, pre-trained text recognition model ASTER (Attentional Scene Text Recognizer) is utilized to assist this process in two ways. Firstly, a novel basic block named Semantic-aware Trident Block (STB) is designed to build the STNet, which incorporates an added branch for semantic distillation to learn semantic information of pre-trained recognition model. Secondly, we expand our model in an adversarial training manner and propose new text perceptual loss based on ASTER to further enhance semantic information in SR images. Extensive experiments on TextZoom dataset show that compared with directly recognizing bicubic images, the proposed STNet boosts the recognition accuracy of ASTER, MORAN (Multi-Object Rectified Attention Network), and CRNN (Convolutional Recurrent Neural Network) by 17.4%, 18.2%, and 24.3%, respectively, which is higher than the performance of several existing state-of-the-art (SOTA) SR network models. Besides, experiments in real scenes (on ICDAR 2015 dataset) and in restricted scenarios (defense against adversarial attacks) validate that addition of semantic information enables the proposed method to achieve promising cross-dataset performance. Since the proposed method is trained on cropped images, when applied to real-world scenarios, locations of text in natural images are firstly localized through scene text detection methods, and then cropped text images are obtained based on detected text positions.

查看原文本刊更多论文

基于语义蒸馏和文本感知损失的场景文本图像超分辨率

文本超分辨率（SR）技术旨在恢复低分辨率文本图像中丢失的信息。TextZoom是第一个针对真实场景文本超分辨率的数据集，随着它的提出，越来越多的场景文本超分辨率模型在此基础上被提出。这些方法虽然取得了优异的性能，但没有考虑如何充分有效地利用语义信息。基于此，提出了一种面向场景文本图像超分辨率的语义感知三叉戟网络（STNet）。具体来说，利用预训练文本识别模型ASTER （attention Scene text Recognizer）从两方面协助这一过程。首先，设计了语义感知三叉戟块（STB）作为STNet的基本块，并在STNet中加入了语义蒸馏分支来学习预训练的识别模型的语义信息；其次，我们以对抗训练的方式扩展了我们的模型，提出了新的基于ASTER的文本感知损失，以进一步增强SR图像的语义信息。在TextZoom数据集上的大量实验表明，与直接识别双三次图像相比，本文提出的STNet将ASTER、MORAN （Multi-Object Rectified Attention Network）和CRNN （Convolutional Recurrent Neural Network）的识别准确率分别提高了17.4%、18.2%和24.3%，高于现有的几种最先进（SOTA） SR网络模型的性能。此外，在真实场景（在ICDAR 2015数据集上）和受限场景（防御对抗性攻击）的实验验证了添加语义信息使所提出的方法能够实现良好的跨数据集性能。由于该方法是在裁剪图像上进行训练的，因此在应用于真实场景时，首先通过场景文本检测方法对自然图像中的文本位置进行定位，然后根据检测到的文本位置获得裁剪后的文本图像。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.