Maninder Singh Nehra, N. Nain, Mushtaq Ahmed, P. Choudhary, Deepa Modi
{"title":"Amalgamated Approach for Devanagari Script Corpus for OCR & Demographic Purpose and XML for Linguistic Annotation","authors":"Maninder Singh Nehra, N. Nain, Mushtaq Ahmed, P. Choudhary, Deepa Modi","doi":"10.1109/SITIS.2017.50","DOIUrl":null,"url":null,"abstract":"In this paper, we present compilation of Hindi handwritten text image Corpus and its linguistics perspective in the field of OCR and information retrieval from handwritten document. Devnagari script is little bit complicated to enter a single character; it requires a combination of multiples, due to use of modifier. A mixed approach is proposed and demonstrated for Hindi Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like AADHAR, driving license, Railway Reservation etc. This would increase the participation of Hindi language community in understanding and taking benefit of the government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking and ZipF' s law to analyze the distribution and behavior of words in the corpus.","PeriodicalId":153165,"journal":{"name":"2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SITIS.2017.50","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In this paper, we present compilation of Hindi handwritten text image Corpus and its linguistics perspective in the field of OCR and information retrieval from handwritten document. Devnagari script is little bit complicated to enter a single character; it requires a combination of multiples, due to use of modifier. A mixed approach is proposed and demonstrated for Hindi Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like AADHAR, driving license, Railway Reservation etc. This would increase the participation of Hindi language community in understanding and taking benefit of the government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking and ZipF' s law to analyze the distribution and behavior of words in the corpus.