MEAN:用于场景文本识别的多元素注意网络

2020 25th International Conference on Pattern Recognition (ICPR) Pub Date : 2021-01-10 DOI:10.1109/ICPR48806.2021.9413166

Ruijie Yan, Liangrui Peng, Shanyu Xiao, Gang Yao, Jaesik Min

{"title":"MEAN:用于场景文本识别的多元素注意网络","authors":"Ruijie Yan, Liangrui Peng, Shanyu Xiao, Gang Yao, Jaesik Min","doi":"10.1109/ICPR48806.2021.9413166","DOIUrl":null,"url":null,"abstract":"Scene text recognition is a challenging problem due to the wide variances in contents, styles, orientations, and image quality of text instances in natural scene images. To learn the intrinsic representation of scene texts, a novel multi-element attention (MEA) mechanism is proposed to exploit geometric structures from local to global levels in feature maps extracted from a scene text image. The MEA mechanism is a generalized form of self-attention technique. The elements in feature maps are taken as the nodes of an undirected graph, and three kinds of adjacency matrices are designed to aggregate information at local, neighborhood and global levels before calculating the attention weights. A multi-element attention network (MEAN) is implemented, which includes a CNN for feature extraction, an encoder with MEA mechanism and a decoder for predicting text codes. Orientational positional encoding is added to feature maps output by the CNN, and a feature vector sequence transformed from the feature maps is used as the input of the encoder. Experimental results show that MEAN has achieved state-of-the-art or competitive performance on seven public English scene text datasets (IIITSk, SVT, IC03, IC13, IC15, SVTP, and CUTE). Further experiments have been conducted on a selected subset of the RCTW Chinese scene text dataset, demonstrating that MEAN can handle horizontal, vertical, and irregular scene text samples.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"24 1","pages":"1-8"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"MEAN: Multi - Element Attention Network for Scene Text Recognition\",\"authors\":\"Ruijie Yan, Liangrui Peng, Shanyu Xiao, Gang Yao, Jaesik Min\",\"doi\":\"10.1109/ICPR48806.2021.9413166\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scene text recognition is a challenging problem due to the wide variances in contents, styles, orientations, and image quality of text instances in natural scene images. To learn the intrinsic representation of scene texts, a novel multi-element attention (MEA) mechanism is proposed to exploit geometric structures from local to global levels in feature maps extracted from a scene text image. The MEA mechanism is a generalized form of self-attention technique. The elements in feature maps are taken as the nodes of an undirected graph, and three kinds of adjacency matrices are designed to aggregate information at local, neighborhood and global levels before calculating the attention weights. A multi-element attention network (MEAN) is implemented, which includes a CNN for feature extraction, an encoder with MEA mechanism and a decoder for predicting text codes. Orientational positional encoding is added to feature maps output by the CNN, and a feature vector sequence transformed from the feature maps is used as the input of the encoder. Experimental results show that MEAN has achieved state-of-the-art or competitive performance on seven public English scene text datasets (IIITSk, SVT, IC03, IC13, IC15, SVTP, and CUTE). Further experiments have been conducted on a selected subset of the RCTW Chinese scene text dataset, demonstrating that MEAN can handle horizontal, vertical, and irregular scene text samples.\",\"PeriodicalId\":6783,\"journal\":{\"name\":\"2020 25th International Conference on Pattern Recognition (ICPR)\",\"volume\":\"24 1\",\"pages\":\"1-8\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 25th International Conference on Pattern Recognition (ICPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPR48806.2021.9413166\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 25th International Conference on Pattern Recognition (ICPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPR48806.2021.9413166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

由于自然场景图像中的文本实例在内容、样式、方向和图像质量方面存在很大差异，因此场景文本识别是一个具有挑战性的问题。为了学习场景文本的内在表征，提出了一种新的多元素注意(MEA)机制，利用从场景文本图像提取的特征映射从局部到全局的几何结构。MEA机制是自注意技术的一种广义形式。将特征图中的元素作为无向图的节点，设计了三种邻接矩阵，分别在局部、邻域和全局层面进行信息聚合，然后计算关注权。实现了一个多元素注意力网络(MEAN)，该网络包括用于特征提取的CNN、具有MEA机制的编码器和用于预测文本代码的解码器。在CNN输出的特征图中加入方向位置编码，并将特征图变换后的特征向量序列作为编码器的输入。实验结果表明，MEAN在七个公共英语场景文本数据集(IIITSk、SVT、IC03、IC13、IC15、SVTP和CUTE)上取得了最先进或具有竞争力的性能。在RCTW中文场景文本数据集的一个子集上进行了进一步的实验，证明了MEAN可以处理水平、垂直和不规则的场景文本样本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MEAN: Multi - Element Attention Network for Scene Text Recognition

Scene text recognition is a challenging problem due to the wide variances in contents, styles, orientations, and image quality of text instances in natural scene images. To learn the intrinsic representation of scene texts, a novel multi-element attention (MEA) mechanism is proposed to exploit geometric structures from local to global levels in feature maps extracted from a scene text image. The MEA mechanism is a generalized form of self-attention technique. The elements in feature maps are taken as the nodes of an undirected graph, and three kinds of adjacency matrices are designed to aggregate information at local, neighborhood and global levels before calculating the attention weights. A multi-element attention network (MEAN) is implemented, which includes a CNN for feature extraction, an encoder with MEA mechanism and a decoder for predicting text codes. Orientational positional encoding is added to feature maps output by the CNN, and a feature vector sequence transformed from the feature maps is used as the input of the encoder. Experimental results show that MEAN has achieved state-of-the-art or competitive performance on seven public English scene text datasets (IIITSk, SVT, IC03, IC13, IC15, SVTP, and CUTE). Further experiments have been conducted on a selected subset of the RCTW Chinese scene text dataset, demonstrating that MEAN can handle horizontal, vertical, and irregular scene text samples.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 25th International Conference on Pattern Recognition (ICPR)

自引率

0.00%

发文量