基于场景文本识别的新型变压器模块组合分析

2021 IEEE International Conference on Image Processing (ICIP) Pub Date : 2021-09-19 DOI:10.1109/ICIP42928.2021.9506779

Yeon-Gyu Kim, Hyunsung Kim, Minseok Kang, Hyug-Jae Lee, Rokkyu Lee, Gunhan Park

{"title":"基于场景文本识别的新型变压器模块组合分析","authors":"Yeon-Gyu Kim, Hyunsung Kim, Minseok Kang, Hyug-Jae Lee, Rokkyu Lee, Gunhan Park","doi":"10.1109/ICIP42928.2021.9506779","DOIUrl":null,"url":null,"abstract":"Various methods for scene text recognition (STR) are proposed every year. These methods dramatically increase the performance of the existing STR field; however, they have not been able to keep up with the progress of general-purpose research in image recognition, detection, speech recognition, and text analysis. In this paper, we evaluate the performance of several deep learning schemes for the encoder part of the Transformer in STR. First, we change the baseline feed forward network (FFN) module of encoder to squeeze-and-excitation (SE)-FFN or cross stage partial (CSP)-FFN. Second, the overall architecture of encoder is replaced with local dense synthesizer attention (LDSA) or Conformer structure. Conformer encoder achieves the best test accuracy in various experiments, and SE or CSP-FFN also showed competitive performance when the number of parameters is considered. Visualizing the attention maps from different encoder combinations allows for qualitative performance.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"516 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Analysis of the Novel Transformer Module Combination for Scene Text Recognition\",\"authors\":\"Yeon-Gyu Kim, Hyunsung Kim, Minseok Kang, Hyug-Jae Lee, Rokkyu Lee, Gunhan Park\",\"doi\":\"10.1109/ICIP42928.2021.9506779\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Various methods for scene text recognition (STR) are proposed every year. These methods dramatically increase the performance of the existing STR field; however, they have not been able to keep up with the progress of general-purpose research in image recognition, detection, speech recognition, and text analysis. In this paper, we evaluate the performance of several deep learning schemes for the encoder part of the Transformer in STR. First, we change the baseline feed forward network (FFN) module of encoder to squeeze-and-excitation (SE)-FFN or cross stage partial (CSP)-FFN. Second, the overall architecture of encoder is replaced with local dense synthesizer attention (LDSA) or Conformer structure. Conformer encoder achieves the best test accuracy in various experiments, and SE or CSP-FFN also showed competitive performance when the number of parameters is considered. Visualizing the attention maps from different encoder combinations allows for qualitative performance.\",\"PeriodicalId\":314429,\"journal\":{\"name\":\"2021 IEEE International Conference on Image Processing (ICIP)\",\"volume\":\"516 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Image Processing (ICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIP42928.2021.9506779\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP42928.2021.9506779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

每年都会提出各种场景文本识别(STR)方法。这些方法极大地提高了现有STR字段的性能;然而，在图像识别、检测、语音识别和文本分析等通用研究方面，它们还不能跟上进展。在本文中，我们评估了几种深度学习方案在STR中变压器编码器部分的性能。首先，我们将编码器的基线前馈网络(FFN)模块更改为压缩激励(SE)-FFN或跨阶段部分(CSP)-FFN。其次，将编码器的整体结构替换为局部密集合成器(LDSA)或Conformer结构。在各种实验中，共形编码器获得了最好的测试精度，在考虑参数数量的情况下，SE或CSP-FFN也表现出竞争力。从不同的编码器组合可视化的注意图允许定性性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis of the Novel Transformer Module Combination for Scene Text Recognition

Various methods for scene text recognition (STR) are proposed every year. These methods dramatically increase the performance of the existing STR field; however, they have not been able to keep up with the progress of general-purpose research in image recognition, detection, speech recognition, and text analysis. In this paper, we evaluate the performance of several deep learning schemes for the encoder part of the Transformer in STR. First, we change the baseline feed forward network (FFN) module of encoder to squeeze-and-excitation (SE)-FFN or cross stage partial (CSP)-FFN. Second, the overall architecture of encoder is replaced with local dense synthesizer attention (LDSA) or Conformer structure. Conformer encoder achieves the best test accuracy in various experiments, and SE or CSP-FFN also showed competitive performance when the number of parameters is considered. Visualizing the attention maps from different encoder combinations allows for qualitative performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Conference on Image Processing (ICIP)

自引率

0.00%

发文量