{"title":"Page-level script identification from multi-script handwritten documents","authors":"P. Singh, S. Dalal, R. Sarkar, M. Nasipuri","doi":"10.1109/C3IT.2015.7060113","DOIUrl":null,"url":null,"abstract":"Script identification has long been the forerunner of many Optical Character Recognition (OCR) processes in a multi-lingual document environment. Script identification has numerous applications in the field of document image analysis, such as document sorting, indexing, retrieval and translation, etc. In this paper, we have developed a page-level script identification technique for handwritten documents using the texture features. The texture features are extracted from the document pages based on the Gray Level Co-occurrence Matrix (GLCM). The proposed technique has been evaluated on four scripts namely, Bangla, Devnagari, Telugu, and Roman using multiple classifiers. Based on their identification accuracies, it is observed that Multi Layer Perceptron (MLP) classifier performs the best. The experimental results demonstrate the effectiveness of the GLCM features in identification of handwritten scripts. Experiments are conducted on a total of 120 document pages and the overall accuracy of the system is found to be 91.48%. Though the system is evaluated on limited dataset, considering the complexities of the scripts, the result can be assumed satisfactory.","PeriodicalId":402311,"journal":{"name":"Proceedings of the 2015 Third International Conference on Computer, Communication, Control and Information Technology (C3IT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 Third International Conference on Computer, Communication, Control and Information Technology (C3IT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/C3IT.2015.7060113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21
Abstract
Script identification has long been the forerunner of many Optical Character Recognition (OCR) processes in a multi-lingual document environment. Script identification has numerous applications in the field of document image analysis, such as document sorting, indexing, retrieval and translation, etc. In this paper, we have developed a page-level script identification technique for handwritten documents using the texture features. The texture features are extracted from the document pages based on the Gray Level Co-occurrence Matrix (GLCM). The proposed technique has been evaluated on four scripts namely, Bangla, Devnagari, Telugu, and Roman using multiple classifiers. Based on their identification accuracies, it is observed that Multi Layer Perceptron (MLP) classifier performs the best. The experimental results demonstrate the effectiveness of the GLCM features in identification of handwritten scripts. Experiments are conducted on a total of 120 document pages and the overall accuracy of the system is found to be 91.48%. Though the system is evaluated on limited dataset, considering the complexities of the scripts, the result can be assumed satisfactory.