Weakly Supervised Text Attention Network for Generating Text Proposals in Scene Images

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2017-11-01 DOI:10.1109/ICDAR.2017.61

Li Rong, En MengYi, Liang Jianqiang, Zhang Haibin

{"title":"Weakly Supervised Text Attention Network for Generating Text Proposals in Scene Images","authors":"Li Rong, En MengYi, Liang Jianqiang, Zhang Haibin","doi":"10.1109/ICDAR.2017.61","DOIUrl":null,"url":null,"abstract":"Detection and recognition of textual information in scene images is useful but challenging tasks. Numerous methods have been proposed to solve the problem. Recently the best results are attained by deep neural network based methods. Training such networks needs large amounts of bounding box-level or pixel-level annotated data. Generating large amounts of such data always requires huge amounts of labor which can be expensive and time consuming. In this paper we explore the utilization of weakly supervised deep neural network for generating text proposals in natural scene images. The network allows multi-scale inputs and is trained to perform whole image binary classification to tell whether an image contains text or not. After training the network acquired learning of powerful discriminated features that are capable of distinguishing text from other objects. To get the text location, text confidence score map is generated based on feature maps from the top two convolutional layers by extracting class activation map. Value of each pixel in the score map denotes the confidence score of whether the pixel belongs to text or not. By setting a threshold the score map is converted to a binary mask map. Foregrounds of the mask map are probable text areas. Then Maximally Stable Extremal Regions (MSERs) are extracted from these probable text areas and are aggregated as groups. By processing these groups, text proposals are obtained. Experimental results show that without using any bounding boxes or pixel-level annotation, the algorithm achieves recall rate comparable to some fully supervised methods in ICDAR 2013 focused text dataset and In ICDAR 2015 incidental text dataset.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.61","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Detection and recognition of textual information in scene images is useful but challenging tasks. Numerous methods have been proposed to solve the problem. Recently the best results are attained by deep neural network based methods. Training such networks needs large amounts of bounding box-level or pixel-level annotated data. Generating large amounts of such data always requires huge amounts of labor which can be expensive and time consuming. In this paper we explore the utilization of weakly supervised deep neural network for generating text proposals in natural scene images. The network allows multi-scale inputs and is trained to perform whole image binary classification to tell whether an image contains text or not. After training the network acquired learning of powerful discriminated features that are capable of distinguishing text from other objects. To get the text location, text confidence score map is generated based on feature maps from the top two convolutional layers by extracting class activation map. Value of each pixel in the score map denotes the confidence score of whether the pixel belongs to text or not. By setting a threshold the score map is converted to a binary mask map. Foregrounds of the mask map are probable text areas. Then Maximally Stable Extremal Regions (MSERs) are extracted from these probable text areas and are aggregated as groups. By processing these groups, text proposals are obtained. Experimental results show that without using any bounding boxes or pixel-level annotation, the algorithm achieves recall rate comparable to some fully supervised methods in ICDAR 2013 focused text dataset and In ICDAR 2015 incidental text dataset.

查看原文本刊更多论文

用于场景图像文本建议生成的弱监督文本注意网络

场景图像中文本信息的检测和识别是一项有用但具有挑战性的任务。已经提出了许多方法来解决这个问题。目前，基于深度神经网络的方法取得了较好的效果。训练这样的网络需要大量的边界盒级或像素级标注数据。生成大量这样的数据总是需要大量的劳动力，这既昂贵又耗时。本文探讨了利用弱监督深度神经网络在自然场景图像中生成文本建议。该网络允许多尺度输入，并被训练来执行整个图像的二值分类，以判断图像是否包含文本。经过训练，网络获得了强大的识别特征，能够将文本与其他对象区分开来。为了获得文本位置，通过提取类激活图，基于前两层卷积层的特征图生成文本置信度评分图。分数图中每个像素的值表示该像素是否属于文本的置信度得分。通过设置阈值，分数映射转换为二进制掩码映射。遮罩地图的前景是可能的文本区域。然后从这些可能的文本区域中提取最大稳定极值区域(mser)，并将其聚合成组。通过对这些组进行处理，得到文本建议。实验结果表明，在不使用任何边界框和像素级标注的情况下，该算法在ICDAR 2013聚焦文本数据集和ICDAR 2015附带文本数据集上的召回率与一些完全监督方法相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量