{"title":"从手写文档图像中识别印度文字——一种不受约束的块级方法","authors":"S. Obaidullah, N. Das, C. Halder, K. Roy","doi":"10.1109/ReTIS.2015.7232880","DOIUrl":null,"url":null,"abstract":"In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.","PeriodicalId":161306,"journal":{"name":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Indic script identification from handwritten document images — An unconstrained block-level approach\",\"authors\":\"S. Obaidullah, N. Das, C. Halder, K. Roy\",\"doi\":\"10.1109/ReTIS.2015.7232880\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.\",\"PeriodicalId\":161306,\"journal\":{\"name\":\"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ReTIS.2015.7232880\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReTIS.2015.7232880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Indic script identification from handwritten document images — An unconstrained block-level approach
In a multi-script country like India, prior identification of script from document images is an essential step before choosing appropriate script specific OCR. The problem becomes more complex and challenging in case of HSI (Handwritten Script Identification). An automatic HSI technique for document images of six popular Indic scripts namely Bangla, Devanagari, Malayalam, Oriya, Roman and Urdu is proposed in this paper. A Block-level approach is followed for the same and initially 34-dimensional feature vector is constructed applying transform based (BRT, BDCT, BFFT and BDT), textural and statistical techniques. Finally using a GAS (Greedy Attribute Selection) method 20 attributes are selected for learning process. Total 600 unconstrained document image blocks of size 512×512 each, are prepared with equal distribution of each script type. The whole dataset is divided into 2:1 ratio for training and testing. Extensive experimentation is carried out for Six-scripts, Tetra-scripts, Tri-scripts and Bi-scripts combinations. Experimental result shows promising and comparable performance.