Chamila Liyanage, Thilini Nadungodage, R. Weerasinghe
{"title":"开发用于识别字体和大小无关的文本的商业级泰米尔OCR","authors":"Chamila Liyanage, Thilini Nadungodage, R. Weerasinghe","doi":"10.1109/ICTER.2015.7377678","DOIUrl":null,"url":null,"abstract":"Optical Character Recognition (OCR) of Indic scripts such as Tamil and Sinhala has lagged behind those for languages based on the Latin script. Several attempts to build commercial grade OCR for these languages have failed in the past owing to them not generalizing well. This paper describes a set of training regimes for Tamil using the Tesseract engine that have enabled us to develop a robust Tamil OCR system. We describe in detail our training regime, which results in a performance improvement of 12.5% over the default Tamil module shipped with Tesseract on a set of ancient Tamil documents, which were part of an authentic project to digitize important Tamil manuscripts of Sri Lanka.","PeriodicalId":142561,"journal":{"name":"2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Developing a commercial grade Tamil OCR for recognizing font and size independent text\",\"authors\":\"Chamila Liyanage, Thilini Nadungodage, R. Weerasinghe\",\"doi\":\"10.1109/ICTER.2015.7377678\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Optical Character Recognition (OCR) of Indic scripts such as Tamil and Sinhala has lagged behind those for languages based on the Latin script. Several attempts to build commercial grade OCR for these languages have failed in the past owing to them not generalizing well. This paper describes a set of training regimes for Tamil using the Tesseract engine that have enabled us to develop a robust Tamil OCR system. We describe in detail our training regime, which results in a performance improvement of 12.5% over the default Tamil module shipped with Tesseract on a set of ancient Tamil documents, which were part of an authentic project to digitize important Tamil manuscripts of Sri Lanka.\",\"PeriodicalId\":142561,\"journal\":{\"name\":\"2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTER.2015.7377678\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTER.2015.7377678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Developing a commercial grade Tamil OCR for recognizing font and size independent text
Optical Character Recognition (OCR) of Indic scripts such as Tamil and Sinhala has lagged behind those for languages based on the Latin script. Several attempts to build commercial grade OCR for these languages have failed in the past owing to them not generalizing well. This paper describes a set of training regimes for Tamil using the Tesseract engine that have enabled us to develop a robust Tamil OCR system. We describe in detail our training regime, which results in a performance improvement of 12.5% over the default Tamil module shipped with Tesseract on a set of ancient Tamil documents, which were part of an authentic project to digitize important Tamil manuscripts of Sri Lanka.