{"title":"Few-shot learning for word-level scene text script identification","authors":"Veronica Naosekpam, Nilkanta Sahu","doi":"10.1111/coin.12612","DOIUrl":null,"url":null,"abstract":"<p>Script identification of text in scene images has attracted massive attention recently. However, the existing techniques primarily emphasize on scripts where data are available abundantly, such as English, European, or East Asian. Although these methods are robust in dealing with high-resource data, how these techniques will work on low-resource scripts has yet to be discovered. For example, in India, there is a disparity among the text scripts across the country's demographic. To bridge the research gap for resource-constraint script identification, we present a few-shot learning network called the TextScriptFSLNet. This network does not require huge training data while achieving state-of-the-art performance on benchmark datasets. Our proposed method acts in accordance with a <math>\n <semantics>\n <mrow>\n <mi>C</mi>\n </mrow>\n <annotation>$$ C $$</annotation>\n </semantics></math>-way <math>\n <semantics>\n <mrow>\n <mi>K</mi>\n </mrow>\n <annotation>$$ K $$</annotation>\n </semantics></math>-shot paradigm by splitting the train set as support and query set, respectively. The support set learns representative knowledge of each class and creates its prototypes. We use multi-kernel spatial attention fused 2-layer convolutional neural network and averaging technique to generate the prototype of each class. This spatial attention aids in grasping important information in an image and enriches the feature representation. To the best of our knowledge, the proposed work is the first of its kind in the scene text understanding domain. Additionally, we created a dataset called Indic-FSL2023 comprising 10 of the 22 officially recognized Indian scripts. The proposed method achieves the highest accuracy among the tested methods on the newly created Indic-FSL2023. Experiments are also conducted on MLe2e to demonstrate its versatility. Furthermore, we also showed how our proposed model performed concerning illumination changes and blur on scene text script images.</p>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2023-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.12612","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Script identification of text in scene images has attracted massive attention recently. However, the existing techniques primarily emphasize on scripts where data are available abundantly, such as English, European, or East Asian. Although these methods are robust in dealing with high-resource data, how these techniques will work on low-resource scripts has yet to be discovered. For example, in India, there is a disparity among the text scripts across the country's demographic. To bridge the research gap for resource-constraint script identification, we present a few-shot learning network called the TextScriptFSLNet. This network does not require huge training data while achieving state-of-the-art performance on benchmark datasets. Our proposed method acts in accordance with a -way -shot paradigm by splitting the train set as support and query set, respectively. The support set learns representative knowledge of each class and creates its prototypes. We use multi-kernel spatial attention fused 2-layer convolutional neural network and averaging technique to generate the prototype of each class. This spatial attention aids in grasping important information in an image and enriches the feature representation. To the best of our knowledge, the proposed work is the first of its kind in the scene text understanding domain. Additionally, we created a dataset called Indic-FSL2023 comprising 10 of the 22 officially recognized Indian scripts. The proposed method achieves the highest accuracy among the tested methods on the newly created Indic-FSL2023. Experiments are also conducted on MLe2e to demonstrate its versatility. Furthermore, we also showed how our proposed model performed concerning illumination changes and blur on scene text script images.
场景图像文本的脚本识别是近年来备受关注的问题。然而,现有的技术主要强调数据丰富的脚本,如英语、欧洲语或东亚语。尽管这些方法在处理高资源数据方面是健壮的,但是这些技术如何处理低资源脚本还有待发现。例如,在印度,在全国人口中,文本脚本存在差异。为了弥补资源约束脚本识别的研究空白,我们提出了一个名为TextScriptFSLNet的单次学习网络。该网络不需要大量的训练数据,同时在基准数据集上实现最先进的性能。我们提出的方法根据C $$ C $$ -way K $$ K $$ -shot范式,将训练集分别拆分为支持集和查询集。支持集学习每个类的代表性知识并创建其原型。我们使用多核空间注意融合的2层卷积神经网络和平均技术来生成每个类的原型。这种空间注意有助于捕捉图像中的重要信息,丰富特征表征。据我们所知,本文是场景文本理解领域的首个同类研究。此外,我们创建了一个名为index - fsl2023的数据集,其中包含22种官方认可的印度文字中的10种。在新研制的index - fsl2023芯片上,该方法在所有测试方法中精度最高。在MLe2e上也进行了实验,以证明其通用性。此外,我们还展示了我们提出的模型如何处理场景文本脚本图像上的照明变化和模糊。
期刊介绍:
This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.