Multi-Granularity Prediction for Scene Text Recognition

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision Pub Date : 2022-09-08 DOI:10.48550/arXiv.2209.03592

P. Wang, Cheng Da, C. Yao

{"title":"Multi-Granularity Prediction for Scene Text Recognition","authors":"P. Wang, Cheng Da, C. Yao","doi":"10.48550/arXiv.2209.03592","DOIUrl":null,"url":null,"abstract":". Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93 . 35% on standard benchmarks. Code will be released soon.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"112 1","pages":"339-355"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2209.03592","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

. Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93 . 35% on standard benchmarks. Code will be released soon.

查看原文本刊更多论文

场景文本识别的多粒度预测

．场景文本识别(STR)是计算机视觉领域一个活跃的研究课题。为了解决这一具有挑战性的问题，相继提出了许多创新的方法，将语言知识纳入STR模型最近成为一个突出的趋势。在这项工作中，我们首先从视觉转换(ViT)的最新进展中汲取灵感，构建一个概念简单但功能强大的视觉STR模型，该模型建立在视觉转换(ViT)的基础上，优于以前最先进的场景文本识别模型，包括纯视觉模型和语言增强方法。为了整合语言知识，我们进一步提出了一种多粒度预测策略，以隐式的方式将语言形态的信息注入到模型中，即在传统的字符级表示之外，在输出空间中引入NLP中广泛使用的子词表示(BPE和WordPiece)，而不采用独立的语言模型(LM)。由此产生的算法(称为MGP-STR)能够将STR的性能提升到更高的水平。具体来说，它的平均识别准确率达到了93。35%的标准基准。代码将很快发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

自引率

0.00%

发文量