识别越南语在线手写分体字符

2008 International Conference on Advanced Language Processing and Web Information Technology Pub Date : 2008-07-23 DOI:10.1109/ALPIT.2008.58

Duy Khuong Nguyen, T. D. Bui

{"title":"识别越南语在线手写分体字符","authors":"Duy Khuong Nguyen, T. D. Bui","doi":"10.1109/ALPIT.2008.58","DOIUrl":null,"url":null,"abstract":"Vietnamese alphabet is based on the Latin alphabet with the addition of nine accent marks or diacritics - four of them to create additional sounds, and the other five to indicate the tone of each word. Because Vietnamese is a tonal language that uses tone to distinguish words, recognizing diacritics is an important part in recognizing Vietnamese word. However, in written form, diacritics are much smaller then the characters, which make very them hard to recognize. Previous works on Vietnamese characters recognition often pre-process input with a graph-based approach by trying to separate the main characters with their diacritics by determining connected regions at pixel level. This approach, however, only works well where the input contains only characters with separable diacritics, for example, scanned image of printed documents. We propose in this paper a robust method to recognize online Vietnamese characters with diacritics. Using cosine transformation with appropriated sampling algorithms, we represent multiple strokes of a character together in a single set of features. This set of features is then used as the input for a well designed machine learning based system. We have tested our system on the combination of Vietnamese characters with diacritics and Section 1c (isolated characters) of the Unipen data set, and have obtained very competitive results.","PeriodicalId":169222,"journal":{"name":"2008 International Conference on Advanced Language Processing and Web Information Technology","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Recognizing Vietnamese Online Handwritten Separated Characters\",\"authors\":\"Duy Khuong Nguyen, T. D. Bui\",\"doi\":\"10.1109/ALPIT.2008.58\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vietnamese alphabet is based on the Latin alphabet with the addition of nine accent marks or diacritics - four of them to create additional sounds, and the other five to indicate the tone of each word. Because Vietnamese is a tonal language that uses tone to distinguish words, recognizing diacritics is an important part in recognizing Vietnamese word. However, in written form, diacritics are much smaller then the characters, which make very them hard to recognize. Previous works on Vietnamese characters recognition often pre-process input with a graph-based approach by trying to separate the main characters with their diacritics by determining connected regions at pixel level. This approach, however, only works well where the input contains only characters with separable diacritics, for example, scanned image of printed documents. We propose in this paper a robust method to recognize online Vietnamese characters with diacritics. Using cosine transformation with appropriated sampling algorithms, we represent multiple strokes of a character together in a single set of features. This set of features is then used as the input for a well designed machine learning based system. We have tested our system on the combination of Vietnamese characters with diacritics and Section 1c (isolated characters) of the Unipen data set, and have obtained very competitive results.\",\"PeriodicalId\":169222,\"journal\":{\"name\":\"2008 International Conference on Advanced Language Processing and Web Information Technology\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 International Conference on Advanced Language Processing and Web Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ALPIT.2008.58\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 International Conference on Advanced Language Processing and Web Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ALPIT.2008.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

越南语的字母是在拉丁字母的基础上加上9个重音标记或变音符号的——其中4个用来创造额外的声音，另外5个用来表示每个单词的音调。由于越南语是一种用声调来区分单词的声调语言，因此识别变音符是识别越南语单词的重要组成部分。然而，在书面形式中，变音符号比字符要小得多，这使得它们很难识别。以往的越南文字符识别工作通常采用基于图形的方法对输入进行预处理，通过在像素级确定连接区域来分离主要字符和变音符。但是，这种方法只适用于输入只包含具有可分离变音符号的字符的情况，例如，打印文档的扫描图像。本文提出了一种鲁棒的在线越南语变音符识别方法。使用余弦变换和适当的采样算法，我们在单个特征集中表示字符的多个笔画。然后将这组特征用作设计良好的基于机器学习的系统的输入。我们在Unipen数据集的带变音符的越南语字符组合和Section 1c(孤立字符)上测试了我们的系统，并获得了非常有竞争力的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Recognizing Vietnamese Online Handwritten Separated Characters

Vietnamese alphabet is based on the Latin alphabet with the addition of nine accent marks or diacritics - four of them to create additional sounds, and the other five to indicate the tone of each word. Because Vietnamese is a tonal language that uses tone to distinguish words, recognizing diacritics is an important part in recognizing Vietnamese word. However, in written form, diacritics are much smaller then the characters, which make very them hard to recognize. Previous works on Vietnamese characters recognition often pre-process input with a graph-based approach by trying to separate the main characters with their diacritics by determining connected regions at pixel level. This approach, however, only works well where the input contains only characters with separable diacritics, for example, scanned image of printed documents. We propose in this paper a robust method to recognize online Vietnamese characters with diacritics. Using cosine transformation with appropriated sampling algorithms, we represent multiple strokes of a character together in a single set of features. This set of features is then used as the input for a well designed machine learning based system. We have tested our system on the combination of Vietnamese characters with diacritics and Section 1c (isolated characters) of the Unipen data set, and have obtained very competitive results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 International Conference on Advanced Language Processing and Web Information Technology

自引率

0.00%

发文量