使用变压器进行场景文本识别的二维可学习正弦位置编码

2021 18th Conference on Robots and Vision (CRV) Pub Date : 2021-05-01 DOI:10.1109/CRV52889.2021.00024

Z. Raisi, Mohamed A. Naiel, Georges Younes, Steven Wardell, J. Zelek

{"title":"使用变压器进行场景文本识别的二维可学习正弦位置编码","authors":"Z. Raisi, Mohamed A. Naiel, Georges Younes, Steven Wardell, J. Zelek","doi":"10.1109/CRV52889.2021.00024","DOIUrl":null,"url":null,"abstract":"Positional Encoding (PE) plays a vital role in a Transformer’s ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-theart RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.","PeriodicalId":413697,"journal":{"name":"2021 18th Conference on Robots and Vision (CRV)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"2LSPE: 2D Learnable Sinusoidal Positional Encoding using Transformer for Scene Text Recognition\",\"authors\":\"Z. Raisi, Mohamed A. Naiel, Georges Younes, Steven Wardell, J. Zelek\",\"doi\":\"10.1109/CRV52889.2021.00024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Positional Encoding (PE) plays a vital role in a Transformer’s ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-theart RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.\",\"PeriodicalId\":413697,\"journal\":{\"name\":\"2021 18th Conference on Robots and Vision (CRV)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 18th Conference on Robots and Vision (CRV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CRV52889.2021.00024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 18th Conference on Robots and Vision (CRV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CRV52889.2021.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

位置编码(PE)在Transformer捕获顺序信息的能力中起着至关重要的作用，使其能够克服排列等价性。最近最先进的基于变压器的场景文本识别方法利用了具有固定正弦频率的二维形式的PE(也称为2SPE)的优势，以更好地编码场景文本图像中字符的二维空间依赖关系。这些基于2spe的Transformer框架优于基于递归神经网络(rnn)的方法，主要是在识别任意形状的文本;然而，它们并不适合手头的数据类型和分类任务。本文将一种最新的可学习正弦频率PE (LSPE)从1D扩展到2D，我们将其称为2LSPE，并研究了如何从输入训练数据中自适应地选择正弦频率。此外，我们还展示了如何将所提出的Transformer架构应用于场景文本识别。我们将我们的方法与11种最先进的方法进行比较，并表明它在超过50%的标准测试中优于它们，并且不低于第二名的性能，而我们在不规则文本数据集(即非水平或垂直布局)上优于所有其他方法。实验结果表明，该方法在4个具有挑战性的不规则文本识别数据集上提供了比两种基于transformer的方法和11种基于rnn的最新技术更高的单词识别精度(WRA)，同时在常规文本数据集上保持了最高的WRA值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

2LSPE: 2D Learnable Sinusoidal Positional Encoding using Transformer for Scene Text Recognition

Positional Encoding (PE) plays a vital role in a Transformer’s ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-theart RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 18th Conference on Robots and Vision (CRV)

自引率

0.00%

发文量