Problems and Approaches for Oriental Document Analysis

IEEE International Conference on Document Analysis and Recognition Pub Date : 1997-08-18 DOI:10.1109/ICDAR.1997.10004

J. H. Kim

{"title":"Problems and Approaches for Oriental Document Analysis","authors":"J. H. Kim","doi":"10.1109/ICDAR.1997.10004","DOIUrl":null,"url":null,"abstract":"Machine understanding of hand,filled documents in China, Japan and Korea requires not only general solutions of document analysis but also ability to handle peculiarities of the Oriental languages. As expected, handwritten Chinese character recognition is the major task for it. In addition, Japanese Kana, Korean Hangul, Roman alphabet as well as numerals are targets of recognition. The main difficulties of Oriental character recognition originate from their large character sets. A practical system should be able to handle at least 5000 classes from possibly 50000 over classes. For Hangul, 11720 classes can be made in theory. The difficulty closely depends on writing styles. Oriental script is generally classified into regular, fluent, cursive style. Needless to say, cursive style is deformed most seriously and, therefore, most difficult to recognize. Regular style writing is often attacked successfully by feature matching and statistical analysis, while fluent style is now actively under investigation by stroke analyses and structural matching. Cursive style recognition is seldom found even in research papers. Since Chinese and Hangul characters are intrinsically hierarchical, often hierarchical analysis has been applied. A Hangul character, which corresponds a syllable, is formed by 2 to 5 basic graphemes, drawn from 24 classes, deploying them in two dimensional way. Recognizing component graphemes is, we believe, the viable approach to handle the large set of Hangul classes. Therefore, segmentation into graphemes, which is another difficult task, is the key for hierarchical recognition For robust recognition of fluent to cursive style, the following research directions are suggested.","PeriodicalId":294655,"journal":{"name":"IEEE International Conference on Document Analysis and Recognition","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1997.10004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine understanding of hand,filled documents in China, Japan and Korea requires not only general solutions of document analysis but also ability to handle peculiarities of the Oriental languages. As expected, handwritten Chinese character recognition is the major task for it. In addition, Japanese Kana, Korean Hangul, Roman alphabet as well as numerals are targets of recognition. The main difficulties of Oriental character recognition originate from their large character sets. A practical system should be able to handle at least 5000 classes from possibly 50000 over classes. For Hangul, 11720 classes can be made in theory. The difficulty closely depends on writing styles. Oriental script is generally classified into regular, fluent, cursive style. Needless to say, cursive style is deformed most seriously and, therefore, most difficult to recognize. Regular style writing is often attacked successfully by feature matching and statistical analysis, while fluent style is now actively under investigation by stroke analyses and structural matching. Cursive style recognition is seldom found even in research papers. Since Chinese and Hangul characters are intrinsically hierarchical, often hierarchical analysis has been applied. A Hangul character, which corresponds a syllable, is formed by 2 to 5 basic graphemes, drawn from 24 classes, deploying them in two dimensional way. Recognizing component graphemes is, we believe, the viable approach to handle the large set of Hangul classes. Therefore, segmentation into graphemes, which is another difficult task, is the key for hierarchical recognition For robust recognition of fluent to cursive style, the following research directions are suggested.

查看原文本刊更多论文

东方文献分析的问题与途径

机器理解中国、日本和韩国的手写和填充文档不仅需要文档分析的一般解决方案，还需要处理东方语言的特性的能力。不出所料，手写汉字识别是该系统的主要任务。此外，日语假名、韩国文、罗马字母以及数字也是识别对象。东方文字识别的主要困难在于其庞大的字符集。一个实际的系统应该能够处理至少5000个类(可能超过50000个类)。对于韩文，理论上可以制作11720个课程。难度很大程度上取决于写作风格。东方文字一般分为楷体、流畅体和草体。不用说，草书是最严重的变形，因此，最难以识别。常规文体写作通常通过特征匹配和统计分析来成功地攻击，而流畅文体现在正积极地通过笔画分析和结构匹配来研究。即使在研究论文中也很少发现草书风格的识别。由于汉、韩两种文字具有内在的层次性，因此经常采用层次性分析。一个汉字对应一个音节，由2到5个基本字素组成，从24个类中抽取，以二维方式展开。我们相信，识别组成字素是处理大量韩文类的可行方法。因此，字素分割是分层识别的关键，也是一项艰巨的任务。为了实现流畅体到草体的鲁棒识别，建议研究方向如下:

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Document Analysis and Recognition

自引率

0.00%

发文量