双语机器打印图像文档中的文本识别-挑战与调查:双语打印图像中文本提取的主要和关键问题综述

2016 10th International Conference on Intelligent Systems and Control (ISCO) Pub Date : 1900-01-01 DOI:10.1109/ISCO.2016.7727069

Shalini Puri, S. Singh

{"title":"双语机器打印图像文档中的文本识别-挑战与调查:双语打印图像中文本提取的主要和关键问题综述","authors":"Shalini Puri, S. Singh","doi":"10.1109/ISCO.2016.7727069","DOIUrl":null,"url":null,"abstract":"In this digital world, accurate text identification and recognition has become an important key area of image document analysis and processing. Textual data, ranging from simple to complex images along with language variations - mono, bi, tri or multilingual scripts, is identified and extracted. This paper is designed to focus the challenges and complex issues of text recognition in bilingual machine printed imaged documents. Major crucial factors are discovered and mentioned which become the bottlenecks in correct and accurate recognition. With this, a hierarchical structure depicting three Classification Schemes (CS) A, B and C of bilingual printed imaged document is shown, where A, B and C are related to the content form, image mining and language or script determination. Some loopholes of OCR working are also discussed. To analyze the existing algorithms and methods, a survey is presented to focus on their critical issues, proposed solutions along with constraints and errors found during text processing. It leads to find out the shortcomings and limitations of different methods. Various specifications and factors found from the techniques are also shown as their characteristics and are compared relatively to distinguish them. It is observed that most of the existing methods are based on the classification schemes CS A-A1 and C-C1 and C2 and are designed for the script identification with 300 dpi gray scale image using SVM classifier.","PeriodicalId":320699,"journal":{"name":"2016 10th International Conference on Intelligent Systems and Control (ISCO)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Text recognition in bilingual machine printed image documents — Challenges and survey: A review on principal and crucial concerns of text extraction in bilingual printed images\",\"authors\":\"Shalini Puri, S. Singh\",\"doi\":\"10.1109/ISCO.2016.7727069\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this digital world, accurate text identification and recognition has become an important key area of image document analysis and processing. Textual data, ranging from simple to complex images along with language variations - mono, bi, tri or multilingual scripts, is identified and extracted. This paper is designed to focus the challenges and complex issues of text recognition in bilingual machine printed imaged documents. Major crucial factors are discovered and mentioned which become the bottlenecks in correct and accurate recognition. With this, a hierarchical structure depicting three Classification Schemes (CS) A, B and C of bilingual printed imaged document is shown, where A, B and C are related to the content form, image mining and language or script determination. Some loopholes of OCR working are also discussed. To analyze the existing algorithms and methods, a survey is presented to focus on their critical issues, proposed solutions along with constraints and errors found during text processing. It leads to find out the shortcomings and limitations of different methods. Various specifications and factors found from the techniques are also shown as their characteristics and are compared relatively to distinguish them. It is observed that most of the existing methods are based on the classification schemes CS A-A1 and C-C1 and C2 and are designed for the script identification with 300 dpi gray scale image using SVM classifier.\",\"PeriodicalId\":320699,\"journal\":{\"name\":\"2016 10th International Conference on Intelligent Systems and Control (ISCO)\",\"volume\":\"77 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 10th International Conference on Intelligent Systems and Control (ISCO)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCO.2016.7727069\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 10th International Conference on Intelligent Systems and Control (ISCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCO.2016.7727069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

在这个数字化的世界里，准确的文本识别已经成为图像文档分析与处理的一个重要的关键领域。识别和提取文本数据，从简单到复杂的图像以及语言变化-单语，双语，三语或多语脚本。本文旨在探讨双语机器打印图像文档中文本识别的挑战和复杂问题。发现并提出了主要的关键因素，这些因素成为正确准确识别的瓶颈。据此，给出了双语印刷图像文档a、B、C三种分类方案的层次结构，其中a、B、C分别与内容形式、图像挖掘和语言或脚本确定相关。讨论了OCR工作中的一些漏洞。为了分析现有的算法和方法，本文提出了一项调查，重点讨论了它们的关键问题，提出了解决方案，以及在文本处理过程中发现的约束和错误。从而发现不同方法的缺点和局限性。从技术中发现的各种规格和因素也显示为它们的特征，并进行相对比较以区分它们。可以看出，现有的方法大多是基于CS A-A1和C-C1、C2分类方案，针对300 dpi灰度图像的文字识别设计的SVM分类器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Text recognition in bilingual machine printed image documents — Challenges and survey: A review on principal and crucial concerns of text extraction in bilingual printed images

In this digital world, accurate text identification and recognition has become an important key area of image document analysis and processing. Textual data, ranging from simple to complex images along with language variations - mono, bi, tri or multilingual scripts, is identified and extracted. This paper is designed to focus the challenges and complex issues of text recognition in bilingual machine printed imaged documents. Major crucial factors are discovered and mentioned which become the bottlenecks in correct and accurate recognition. With this, a hierarchical structure depicting three Classification Schemes (CS) A, B and C of bilingual printed imaged document is shown, where A, B and C are related to the content form, image mining and language or script determination. Some loopholes of OCR working are also discussed. To analyze the existing algorithms and methods, a survey is presented to focus on their critical issues, proposed solutions along with constraints and errors found during text processing. It leads to find out the shortcomings and limitations of different methods. Various specifications and factors found from the techniques are also shown as their characteristics and are compared relatively to distinguish them. It is observed that most of the existing methods are based on the classification schemes CS A-A1 and C-C1 and C2 and are designed for the script identification with 300 dpi gray scale image using SVM classifier.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 10th International Conference on Intelligent Systems and Control (ISCO)

自引率

0.00%

发文量