Creation of data resources and design of an evaluation test bed for Devanagari script recognition

Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation Pub Date : 2003-03-10 DOI:10.1109/RIDE.2003.1249846

S. Setlur, Suryaprakash Kompalli, V. Ramanaprasad, V. Govindaraju

{"title":"Creation of data resources and design of an evaluation test bed for Devanagari script recognition","authors":"S. Setlur, Suryaprakash Kompalli, V. Ramanaprasad, V. Govindaraju","doi":"10.1109/RIDE.2003.1249846","DOIUrl":null,"url":null,"abstract":"The Indian subcontinent has a large number of languages, dialects, and scripts with the Devanagari script being the primary and most widely used of all the scripts. To date, much of the Devanagari optical character recognition (OCR) research has been restricted to a handful of groups. So, techniques have not yet been widely disseminated or evaluated independently and automated evaluation tools are currently not available for lack of a standard representation of ground-truth and result data. A key reason for the absence of sustained research efforts in off-line Devanagari OCR appears to be the paucity of data resources. Ground truthed data for words and characters, on-line dictionaries, corpora of text documents and reliable, standardized statistical analyses and evaluation tools are currently lacking. So, the creation of such data resources will undoubtedly provide a much needed fillip to researchers working on Devanagari OCR. This paper describes a National Science Foundation sponsored project under the International Digital Libraries program to create data resources that will facilitate development of Devanagari OCR technology and provide a standardized test bed and evaluation tools for Devanagari script recognition.","PeriodicalId":208636,"journal":{"name":"Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIDE.2003.1249846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

The Indian subcontinent has a large number of languages, dialects, and scripts with the Devanagari script being the primary and most widely used of all the scripts. To date, much of the Devanagari optical character recognition (OCR) research has been restricted to a handful of groups. So, techniques have not yet been widely disseminated or evaluated independently and automated evaluation tools are currently not available for lack of a standard representation of ground-truth and result data. A key reason for the absence of sustained research efforts in off-line Devanagari OCR appears to be the paucity of data resources. Ground truthed data for words and characters, on-line dictionaries, corpora of text documents and reliable, standardized statistical analyses and evaluation tools are currently lacking. So, the creation of such data resources will undoubtedly provide a much needed fillip to researchers working on Devanagari OCR. This paper describes a National Science Foundation sponsored project under the International Digital Libraries program to create data resources that will facilitate development of Devanagari OCR technology and provide a standardized test bed and evaluation tools for Devanagari script recognition.

查看原文本刊更多论文

数据资源的创建和Devanagari文字识别评估测试平台的设计

印度次大陆有大量的语言、方言和文字，Devanagari文字是所有文字中最主要和最广泛使用的。迄今为止，许多Devanagari光学字符识别(OCR)的研究仅限于少数几个小组。因此，技术尚未得到广泛传播或独立评估，由于缺乏对基础真相和结果数据的标准表示，目前也没有自动化评估工具。离线Devanagari OCR缺乏持续研究的一个关键原因似乎是数据资源的缺乏。目前缺乏真实的字词数据、在线词典、文本文档语料库和可靠的、标准化的统计分析和评估工具。因此，这些数据资源的创建无疑将为Devanagari OCR的研究人员提供急需的刺激。本文描述了国家科学基金会在国际数字图书馆计划下赞助的一个项目，该项目旨在创建数据资源，以促进Devanagari OCR技术的发展，并为Devanagari脚本识别提供标准化的测试平台和评估工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation

自引率

0.00%

发文量