使用变压器进行场景文本识别的二维可学习正弦位置编码

Z. Raisi, Mohamed A. Naiel, Georges Younes, Steven Wardell, J. Zelek
{"title":"使用变压器进行场景文本识别的二维可学习正弦位置编码","authors":"Z. Raisi, Mohamed A. Naiel, Georges Younes, Steven Wardell, J. Zelek","doi":"10.1109/CRV52889.2021.00024","DOIUrl":null,"url":null,"abstract":"Positional Encoding (PE) plays a vital role in a Transformer’s ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-theart RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.","PeriodicalId":413697,"journal":{"name":"2021 18th Conference on Robots and Vision (CRV)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"2LSPE: 2D Learnable Sinusoidal Positional Encoding using Transformer for Scene Text Recognition\",\"authors\":\"Z. Raisi, Mohamed A. Naiel, Georges Younes, Steven Wardell, J. Zelek\",\"doi\":\"10.1109/CRV52889.2021.00024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Positional Encoding (PE) plays a vital role in a Transformer’s ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-theart RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.\",\"PeriodicalId\":413697,\"journal\":{\"name\":\"2021 18th Conference on Robots and Vision (CRV)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 18th Conference on Robots and Vision (CRV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CRV52889.2021.00024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 18th Conference on Robots and Vision (CRV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CRV52889.2021.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

位置编码(PE)在Transformer捕获顺序信息的能力中起着至关重要的作用,使其能够克服排列等价性。最近最先进的基于变压器的场景文本识别方法利用了具有固定正弦频率的二维形式的PE(也称为2SPE)的优势,以更好地编码场景文本图像中字符的二维空间依赖关系。这些基于2spe的Transformer框架优于基于递归神经网络(rnn)的方法,主要是在识别任意形状的文本;然而,它们并不适合手头的数据类型和分类任务。本文将一种最新的可学习正弦频率PE (LSPE)从1D扩展到2D,我们将其称为2LSPE,并研究了如何从输入训练数据中自适应地选择正弦频率。此外,我们还展示了如何将所提出的Transformer架构应用于场景文本识别。我们将我们的方法与11种最先进的方法进行比较,并表明它在超过50%的标准测试中优于它们,并且不低于第二名的性能,而我们在不规则文本数据集(即非水平或垂直布局)上优于所有其他方法。实验结果表明,该方法在4个具有挑战性的不规则文本识别数据集上提供了比两种基于transformer的方法和11种基于rnn的最新技术更高的单词识别精度(WRA),同时在常规文本数据集上保持了最高的WRA值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
2LSPE: 2D Learnable Sinusoidal Positional Encoding using Transformer for Scene Text Recognition
Positional Encoding (PE) plays a vital role in a Transformer’s ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-theart RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信