Improving Projection Profile for Segmenting Characters from Javanese Manuscripts

Aditya W. Mahastama, Lucia D. Krisnawati
{"title":"Improving Projection Profile for Segmenting Characters from Javanese Manuscripts","authors":"Aditya W. Mahastama, Lucia D. Krisnawati","doi":"10.5220/0008526900770082","DOIUrl":null,"url":null,"abstract":"The emergence of non-latin scripts in the Unicode character set has opened the possibilities to do Optical Character Recognition (OCR) for manuscripts written in non-alphabetic scripts. Javanese is one of the Southeast Asian languages which has vast collections of manuscripts. Unfortunately, these manuscripts are prone to damage due to lack of maintenance. Therefore, digitising them through OCR has become the most obvious option. This research focuses on the segmentation process of our OCR project which implements the Projection-Profile Cutting (PPC). The rationale is that PPC is well known as having a low computational cost. As the object of segmentation, we sampled 72 scanned pages of Serat Mangkunegara IV, Wulang Maca, and Kitab Rum. Our preliminary evaluation showed that implementing PPC per se exhibits unsatisfactory results. Hence, we refined it by applying a statistical analysis to segment lines of characters whose distance is too low. The proposed algorithm results in 19.112 segments. To evaluate the system outputs, we conducted two levels of evaluation: the line and character segmentations. The refinement of PPC has proved to increase the line segmentation accuracy by 32.84%. To evaluate the character segmentation, we collaborated with Javanese Wikipedia Community which verified them manually in 4 batches. Only 15.386 segments were verified, in which 73.59% (11.322) system outputs are correctly segmented, 22.5% (3.464) are oversegmented, 1.3% (206) are under-segmented, and the rest has not been labelled as either one of three","PeriodicalId":416923,"journal":{"name":"Proceedings of the 1st International Conference on Intermedia Arts and Creative Technology","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International Conference on Intermedia Arts and Creative Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0008526900770082","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

The emergence of non-latin scripts in the Unicode character set has opened the possibilities to do Optical Character Recognition (OCR) for manuscripts written in non-alphabetic scripts. Javanese is one of the Southeast Asian languages which has vast collections of manuscripts. Unfortunately, these manuscripts are prone to damage due to lack of maintenance. Therefore, digitising them through OCR has become the most obvious option. This research focuses on the segmentation process of our OCR project which implements the Projection-Profile Cutting (PPC). The rationale is that PPC is well known as having a low computational cost. As the object of segmentation, we sampled 72 scanned pages of Serat Mangkunegara IV, Wulang Maca, and Kitab Rum. Our preliminary evaluation showed that implementing PPC per se exhibits unsatisfactory results. Hence, we refined it by applying a statistical analysis to segment lines of characters whose distance is too low. The proposed algorithm results in 19.112 segments. To evaluate the system outputs, we conducted two levels of evaluation: the line and character segmentations. The refinement of PPC has proved to increase the line segmentation accuracy by 32.84%. To evaluate the character segmentation, we collaborated with Javanese Wikipedia Community which verified them manually in 4 batches. Only 15.386 segments were verified, in which 73.59% (11.322) system outputs are correctly segmented, 22.5% (3.464) are oversegmented, 1.3% (206) are under-segmented, and the rest has not been labelled as either one of three
爪哇文手抄本汉字切分投影轮廓的改进
Unicode字符集中非拉丁文字的出现为用非字母文字书写的手稿进行光学字符识别(OCR)提供了可能性。爪哇语是东南亚语言之一,拥有大量的手稿收藏。不幸的是,由于缺乏维护,这些手稿很容易损坏。因此,通过OCR将其数字化成为最明显的选择。本研究的重点是OCR项目的分割过程,该项目实现了投影轮廓切割(PPC)。其基本原理是PPC众所周知具有较低的计算成本。作为分割对象,我们选取了72张扫描页的Serat Mangkunegara IV、Wulang Maca和Kitab Rum。我们的初步评估表明,实施PPC本身表现出令人不满意的结果。因此,我们通过统计分析对距离过低的字符分段行进行改进。该算法得到19.112个片段。为了评估系统输出,我们进行了两个层次的评估:行和字符分割。经过改进后,直线分割的准确率提高了32.84%。为了评估字符分割,我们与爪哇维基百科社区合作,分4批手工验证。只验证了15.386个片段,其中73.59%(11.322)的系统输出被正确分割,22.5%(3.464)的系统输出被过度分割,1.3%(206)的系统输出被分割不足,其余的没有被标记为三种之一
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信