Problems and Approaches for Oriental Document Analysis

J. H. Kim
{"title":"Problems and Approaches for Oriental Document Analysis","authors":"J. H. Kim","doi":"10.1109/ICDAR.1997.10004","DOIUrl":null,"url":null,"abstract":"Machine understanding of hand,filled documents in China, Japan and Korea requires not only general solutions of document analysis but also ability to handle peculiarities of the Oriental languages. As expected, handwritten Chinese character recognition is the major task for it. In addition, Japanese Kana, Korean Hangul, Roman alphabet as well as numerals are targets of recognition. The main difficulties of Oriental character recognition originate from their large character sets. A practical system should be able to handle at least 5000 classes from possibly 50000 over classes. For Hangul, 11720 classes can be made in theory. The difficulty closely depends on writing styles. Oriental script is generally classified into regular, fluent, cursive style. Needless to say, cursive style is deformed most seriously and, therefore, most difficult to recognize. Regular style writing is often attacked successfully by feature matching and statistical analysis, while fluent style is now actively under investigation by stroke analyses and structural matching. Cursive style recognition is seldom found even in research papers. Since Chinese and Hangul characters are intrinsically hierarchical, often hierarchical analysis has been applied. A Hangul character, which corresponds a syllable, is formed by 2 to 5 basic graphemes, drawn from 24 classes, deploying them in two dimensional way. Recognizing component graphemes is, we believe, the viable approach to handle the large set of Hangul classes. Therefore, segmentation into graphemes, which is another difficult task, is the key for hierarchical recognition For robust recognition of fluent to cursive style, the following research directions are suggested.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1997.10004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Machine understanding of hand,filled documents in China, Japan and Korea requires not only general solutions of document analysis but also ability to handle peculiarities of the Oriental languages. As expected, handwritten Chinese character recognition is the major task for it. In addition, Japanese Kana, Korean Hangul, Roman alphabet as well as numerals are targets of recognition. The main difficulties of Oriental character recognition originate from their large character sets. A practical system should be able to handle at least 5000 classes from possibly 50000 over classes. For Hangul, 11720 classes can be made in theory. The difficulty closely depends on writing styles. Oriental script is generally classified into regular, fluent, cursive style. Needless to say, cursive style is deformed most seriously and, therefore, most difficult to recognize. Regular style writing is often attacked successfully by feature matching and statistical analysis, while fluent style is now actively under investigation by stroke analyses and structural matching. Cursive style recognition is seldom found even in research papers. Since Chinese and Hangul characters are intrinsically hierarchical, often hierarchical analysis has been applied. A Hangul character, which corresponds a syllable, is formed by 2 to 5 basic graphemes, drawn from 24 classes, deploying them in two dimensional way. Recognizing component graphemes is, we believe, the viable approach to handle the large set of Hangul classes. Therefore, segmentation into graphemes, which is another difficult task, is the key for hierarchical recognition For robust recognition of fluent to cursive style, the following research directions are suggested.
东方文献分析的问题与途径
机器理解中国、日本和韩国的手写和填充文档不仅需要文档分析的一般解决方案,还需要处理东方语言的特性的能力。不出所料,手写汉字识别是该系统的主要任务。此外,日语假名、韩国文、罗马字母以及数字也是识别对象。东方文字识别的主要困难在于其庞大的字符集。一个实际的系统应该能够处理至少5000个类(可能超过50000个类)。对于韩文,理论上可以制作11720个课程。难度很大程度上取决于写作风格。东方文字一般分为楷体、流畅体和草体。不用说,草书是最严重的变形,因此,最难以识别。常规文体写作通常通过特征匹配和统计分析来成功地攻击,而流畅文体现在正积极地通过笔画分析和结构匹配来研究。即使在研究论文中也很少发现草书风格的识别。由于汉、韩两种文字具有内在的层次性,因此经常采用层次性分析。一个汉字对应一个音节,由2到5个基本字素组成,从24个类中抽取,以二维方式展开。我们相信,识别组成字素是处理大量韩文类的可行方法。因此,字素分割是分层识别的关键,也是一项艰巨的任务。为了实现流畅体到草体的鲁棒识别,建议研究方向如下:
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信