Yeon-Gyu Kim, Hyunsung Kim, Minseok Kang, Hyug-Jae Lee, Rokkyu Lee, Gunhan Park
{"title":"基于场景文本识别的新型变压器模块组合分析","authors":"Yeon-Gyu Kim, Hyunsung Kim, Minseok Kang, Hyug-Jae Lee, Rokkyu Lee, Gunhan Park","doi":"10.1109/ICIP42928.2021.9506779","DOIUrl":null,"url":null,"abstract":"Various methods for scene text recognition (STR) are proposed every year. These methods dramatically increase the performance of the existing STR field; however, they have not been able to keep up with the progress of general-purpose research in image recognition, detection, speech recognition, and text analysis. In this paper, we evaluate the performance of several deep learning schemes for the encoder part of the Transformer in STR. First, we change the baseline feed forward network (FFN) module of encoder to squeeze-and-excitation (SE)-FFN or cross stage partial (CSP)-FFN. Second, the overall architecture of encoder is replaced with local dense synthesizer attention (LDSA) or Conformer structure. Conformer encoder achieves the best test accuracy in various experiments, and SE or CSP-FFN also showed competitive performance when the number of parameters is considered. Visualizing the attention maps from different encoder combinations allows for qualitative performance.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"516 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Analysis of the Novel Transformer Module Combination for Scene Text Recognition\",\"authors\":\"Yeon-Gyu Kim, Hyunsung Kim, Minseok Kang, Hyug-Jae Lee, Rokkyu Lee, Gunhan Park\",\"doi\":\"10.1109/ICIP42928.2021.9506779\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Various methods for scene text recognition (STR) are proposed every year. These methods dramatically increase the performance of the existing STR field; however, they have not been able to keep up with the progress of general-purpose research in image recognition, detection, speech recognition, and text analysis. In this paper, we evaluate the performance of several deep learning schemes for the encoder part of the Transformer in STR. First, we change the baseline feed forward network (FFN) module of encoder to squeeze-and-excitation (SE)-FFN or cross stage partial (CSP)-FFN. Second, the overall architecture of encoder is replaced with local dense synthesizer attention (LDSA) or Conformer structure. Conformer encoder achieves the best test accuracy in various experiments, and SE or CSP-FFN also showed competitive performance when the number of parameters is considered. Visualizing the attention maps from different encoder combinations allows for qualitative performance.\",\"PeriodicalId\":314429,\"journal\":{\"name\":\"2021 IEEE International Conference on Image Processing (ICIP)\",\"volume\":\"516 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Image Processing (ICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIP42928.2021.9506779\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP42928.2021.9506779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Analysis of the Novel Transformer Module Combination for Scene Text Recognition
Various methods for scene text recognition (STR) are proposed every year. These methods dramatically increase the performance of the existing STR field; however, they have not been able to keep up with the progress of general-purpose research in image recognition, detection, speech recognition, and text analysis. In this paper, we evaluate the performance of several deep learning schemes for the encoder part of the Transformer in STR. First, we change the baseline feed forward network (FFN) module of encoder to squeeze-and-excitation (SE)-FFN or cross stage partial (CSP)-FFN. Second, the overall architecture of encoder is replaced with local dense synthesizer attention (LDSA) or Conformer structure. Conformer encoder achieves the best test accuracy in various experiments, and SE or CSP-FFN also showed competitive performance when the number of parameters is considered. Visualizing the attention maps from different encoder combinations allows for qualitative performance.