An annotated Urdu corpus of handwritten text image and benchmarking of corpus

2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) Pub Date : 2014-05-26 DOI:10.1109/MIPRO.2014.6859743

P. Choudhary, N. Nain

{"title":"An annotated Urdu corpus of handwritten text image and benchmarking of corpus","authors":"P. Choudhary, N. Nain","doi":"10.1109/MIPRO.2014.6859743","DOIUrl":null,"url":null,"abstract":"For linguistics related research on a language there is always a need for a large collection of database which includes all features of a language such as grammatical information, style of writing, syntax etc. Corpus provides a platform for investigation on a natural language. As compared to other languages very limited research work is done on Urdu language due to its segmentation dilemma and difficult character shape. Very less number of editable printed text data is available in Urdu language, most of the data is available in graphical or picture format. To increase Natural Language Processing research work on Urdu language there is a need for a large database which contains a range of variance in annotated Urdu handwritten as well as printed text. In our work we purpose a large database of Urdu text including 1000 handwritten text images written by 500 different writers. Each image would be four to six lines of Urdu text having 60-80 words per line the estimated number of words would be around .35 million. Selection of words would be done from six different categories so that maximum number of distinct words can be included. Corpus would be annotated for line as well as word segmentation where a word may be an individual character or component. The corpus would be a benchmark for quantitative analysis of Handwritten Text Recognition techniques for Urdu language such as text line extraction, word segmentation and character recognition etc., and for linguistic research in Part of Speech, writer identification, dictionary etc.","PeriodicalId":299409,"journal":{"name":"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MIPRO.2014.6859743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

For linguistics related research on a language there is always a need for a large collection of database which includes all features of a language such as grammatical information, style of writing, syntax etc. Corpus provides a platform for investigation on a natural language. As compared to other languages very limited research work is done on Urdu language due to its segmentation dilemma and difficult character shape. Very less number of editable printed text data is available in Urdu language, most of the data is available in graphical or picture format. To increase Natural Language Processing research work on Urdu language there is a need for a large database which contains a range of variance in annotated Urdu handwritten as well as printed text. In our work we purpose a large database of Urdu text including 1000 handwritten text images written by 500 different writers. Each image would be four to six lines of Urdu text having 60-80 words per line the estimated number of words would be around .35 million. Selection of words would be done from six different categories so that maximum number of distinct words can be included. Corpus would be annotated for line as well as word segmentation where a word may be an individual character or component. The corpus would be a benchmark for quantitative analysis of Handwritten Text Recognition techniques for Urdu language such as text line extraction, word segmentation and character recognition etc., and for linguistic research in Part of Speech, writer identification, dictionary etc.

查看原文本刊更多论文

手写体文本图像的标注乌尔都语语料库及语料库的基准测试

对于语言的语言学相关研究，总是需要大量的数据库，其中包括语言的所有特征，如语法信息，写作风格，句法等。语料库为研究自然语言提供了一个平台。与其他语言相比，乌尔都语由于其分词困难和字形困难，研究工作非常有限。乌尔都语可编辑的印刷文本数据很少，大多数数据以图形或图片格式提供。为了增加乌尔都语的自然语言处理研究工作，需要一个包含乌尔都语注释手写和印刷文本的大数据库。在我们的工作中，我们的目的是建立一个乌尔都语文本的大型数据库，其中包括500位不同作家写的1000个手写文本图像。每个图像将由四到六行乌尔都语文本组成，每行60-80个单词，估计单词数约为35万个。单词的选择将从六个不同的类别中完成，以便可以包含最大数量的不同单词。语料库将被注释为行和分词，其中一个词可能是一个单独的字符或组件。该语料库将为乌尔都语手写体文本识别技术的文本线提取、分词、字符识别等定量分析以及词性、写作者识别、词典等语言学研究提供参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

自引率

0.00%

发文量