{"title":"通用文本扫描解决方案","authors":"Narayana Darapaneni, Arjun Makar, Sumathi Gunasekaran, Trapti Kalra, Bhanu Jain, Anwesh Reddy Padur, Divakar Joshi, M. Jain","doi":"10.1109/ICIIS51140.2020.9342646","DOIUrl":null,"url":null,"abstract":"Receipt data extraction and digitization is difficult even today attributing to the fact that receipts have a lot of variations mainly in the form of being crumpled, soiled and the overall scanning quality of the images being low. The major problems the industries are facing today in the domain are:(i)The lack of generalization in standard OCR solutions and other custom pipelines built from open source api like tesseract etc.(ii)High cost, yet low accuracy of commercially available solutions.(iii)Requirement for organization to supply large volumes of hand annotated images for training. In the paper we explain a strategy to overcome these limitations and to build a holistic pipeline for text detection and extraction deployable in real word. We have surveyed traditional methods as well as known recent CNN based architectures and moved on to explain the application of the novel architecture Connectionist Text Proposal Network(CTPN),to solve for the specific task of text detection in scanned text heavy images. We also compared the CTPN outcomes against outcomes on the state-of-art-trained SSD on sample dataset and it justified how the CTPN is a more suitable algorithm for this use case.","PeriodicalId":352858,"journal":{"name":"2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Universal Text Scanner Solution\",\"authors\":\"Narayana Darapaneni, Arjun Makar, Sumathi Gunasekaran, Trapti Kalra, Bhanu Jain, Anwesh Reddy Padur, Divakar Joshi, M. Jain\",\"doi\":\"10.1109/ICIIS51140.2020.9342646\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Receipt data extraction and digitization is difficult even today attributing to the fact that receipts have a lot of variations mainly in the form of being crumpled, soiled and the overall scanning quality of the images being low. The major problems the industries are facing today in the domain are:(i)The lack of generalization in standard OCR solutions and other custom pipelines built from open source api like tesseract etc.(ii)High cost, yet low accuracy of commercially available solutions.(iii)Requirement for organization to supply large volumes of hand annotated images for training. In the paper we explain a strategy to overcome these limitations and to build a holistic pipeline for text detection and extraction deployable in real word. We have surveyed traditional methods as well as known recent CNN based architectures and moved on to explain the application of the novel architecture Connectionist Text Proposal Network(CTPN),to solve for the specific task of text detection in scanned text heavy images. We also compared the CTPN outcomes against outcomes on the state-of-art-trained SSD on sample dataset and it justified how the CTPN is a more suitable algorithm for this use case.\",\"PeriodicalId\":352858,\"journal\":{\"name\":\"2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIIS51140.2020.9342646\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIIS51140.2020.9342646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
收据数据的提取和数字化即使在今天也是困难的,因为收据有很多变化,主要是皱巴巴的,脏的,图像的整体扫描质量很低。行业目前在该领域面临的主要问题是:(i)标准OCR解决方案和其他由开源api(如tesseract等)构建的定制管道缺乏泛化;(ii)商业可用解决方案成本高,但准确性低。(iii)组织需要提供大量用于培训的手工注释图像。在本文中,我们解释了一种克服这些限制的策略,并建立了一个可在现实世界中部署的文本检测和提取的整体管道。我们调查了传统的方法以及最近已知的基于CNN的架构,并解释了新架构Connectionist Text Proposal Network(CTPN)的应用,以解决扫描文本重图像中文本检测的具体任务。我们还将CTPN结果与样本数据集上最先进的SSD上的结果进行了比较,并证明了CTPN如何更适合此用例的算法。
Receipt data extraction and digitization is difficult even today attributing to the fact that receipts have a lot of variations mainly in the form of being crumpled, soiled and the overall scanning quality of the images being low. The major problems the industries are facing today in the domain are:(i)The lack of generalization in standard OCR solutions and other custom pipelines built from open source api like tesseract etc.(ii)High cost, yet low accuracy of commercially available solutions.(iii)Requirement for organization to supply large volumes of hand annotated images for training. In the paper we explain a strategy to overcome these limitations and to build a holistic pipeline for text detection and extraction deployable in real word. We have surveyed traditional methods as well as known recent CNN based architectures and moved on to explain the application of the novel architecture Connectionist Text Proposal Network(CTPN),to solve for the specific task of text detection in scanned text heavy images. We also compared the CTPN outcomes against outcomes on the state-of-art-trained SSD on sample dataset and it justified how the CTPN is a more suitable algorithm for this use case.