利用OCR技术建立文本数据库的研究

S. Hahn, J. Lee, J. H. Kim
{"title":"利用OCR技术建立文本数据库的研究","authors":"S. Hahn, J. Lee, J. H. Kim","doi":"10.1109/DEXA.1999.795250","DOIUrl":null,"url":null,"abstract":"Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.","PeriodicalId":276867,"journal":{"name":"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"A study on utilizing OCR technology in building text database\",\"authors\":\"S. Hahn, J. Lee, J. H. Kim\",\"doi\":\"10.1109/DEXA.1999.795250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.\",\"PeriodicalId\":276867,\"journal\":{\"name\":\"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99\",\"volume\":\"74 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.1999.795250\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.1999.795250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

光学字符识别(OCR)可能是从打印文档构建数据库的最可行的方法。本文介绍了在选择OCR系统建立数据库时应考虑的问题。通过对四种商用OCR系统的实验,我们选择了识别率最高的OCR系统来构建OCR文本数据库。对970份韩文会议论文集的字符识别率为90.5%。这个识别率对于实际应用来说还不够。对于具有大约10%字符级错误的OCR文本的实际使用,我们需要研究自动索引是否产生可接受的检索性能。此外,有必要评估哪种索引方法能带来更好的性能。实验结果表明,在韩语OCR文本数据库中,2克索引与基于语素的索引具有相似的检索效率。此外,索引OCR文本的检索结果与专家选择的结果相似。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A study on utilizing OCR technology in building text database
Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信