High Performance Chinese/English Mixed OCR with Character Level Language Identification

Kai Wang, Jianming Jin, Qingren Wang
{"title":"High Performance Chinese/English Mixed OCR with Character Level Language Identification","authors":"Kai Wang, Jianming Jin, Qingren Wang","doi":"10.1109/ICDAR.2009.14","DOIUrl":null,"url":null,"abstract":"Currently, there have been several high performance OCR products for Chinese or for English. However, no one OCR technique can be simultaneously fit for both the English and the Chinese due to the large differences between Chinese and English. On the other hand, Chinese/English mixed document increases drastically with the globalization, so it is rather important to study the Chinese/English mixed document processing. Obviously, the key problem to resolve is how to split the mixed document into two parts: Chinese part and English part, so that the different OCR techniques can be applied to different parts. To further improve the previous system performance, a novel Chinese/English split algorithm based on global information is proposed and a rule for language identification is achieved by Bayesian formula. Experiment shows, the system error rate drops from 1.52% to 0.87% on magazine samples and from 1.32% to 0.75% on book samples, more than 2/5 of errors are excluded, which provides an experimental support for our research work.","PeriodicalId":433762,"journal":{"name":"2009 10th International Conference on Document Analysis and Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 10th International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2009.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Currently, there have been several high performance OCR products for Chinese or for English. However, no one OCR technique can be simultaneously fit for both the English and the Chinese due to the large differences between Chinese and English. On the other hand, Chinese/English mixed document increases drastically with the globalization, so it is rather important to study the Chinese/English mixed document processing. Obviously, the key problem to resolve is how to split the mixed document into two parts: Chinese part and English part, so that the different OCR techniques can be applied to different parts. To further improve the previous system performance, a novel Chinese/English split algorithm based on global information is proposed and a rule for language identification is achieved by Bayesian formula. Experiment shows, the system error rate drops from 1.52% to 0.87% on magazine samples and from 1.32% to 0.75% on book samples, more than 2/5 of errors are excluded, which provides an experimental support for our research work.
具有字符级语言识别的高性能中英文混合OCR
目前,已经有几种高性能的中文或英文OCR产品。然而,由于英汉两种语言的巨大差异,没有一种OCR技术能够同时适用于英汉两种语言。另一方面,随着全球化的发展,中英文混合文档急剧增加,因此对中英文混合文档处理的研究显得尤为重要。显然,要解决的关键问题是如何将混合的文档分成中文部分和英文部分,使不同的OCR技术适用于不同的部分。为了进一步提高系统的性能,提出了一种基于全局信息的中英文分割算法,并利用贝叶斯公式实现了语言识别规则。实验表明,系统在杂志样本上的错误率从1.52%下降到0.87%,在图书样本上的错误率从1.32%下降到0.75%,排除了2/5以上的错误,为我们的研究工作提供了实验支持。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信