利用OCR技术建立文本数据库的研究

Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99 Pub Date : 1999-09-01 DOI:10.1109/DEXA.1999.795250

S. Hahn, J. Lee, J. H. Kim

{"title":"利用OCR技术建立文本数据库的研究","authors":"S. Hahn, J. Lee, J. H. Kim","doi":"10.1109/DEXA.1999.795250","DOIUrl":null,"url":null,"abstract":"Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.","PeriodicalId":276867,"journal":{"name":"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"A study on utilizing OCR technology in building text database\",\"authors\":\"S. Hahn, J. Lee, J. H. Kim\",\"doi\":\"10.1109/DEXA.1999.795250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.\",\"PeriodicalId\":276867,\"journal\":{\"name\":\"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99\",\"volume\":\"74 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.1999.795250\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.1999.795250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

光学字符识别(OCR)可能是从打印文档构建数据库的最可行的方法。本文介绍了在选择OCR系统建立数据库时应考虑的问题。通过对四种商用OCR系统的实验，我们选择了识别率最高的OCR系统来构建OCR文本数据库。对970份韩文会议论文集的字符识别率为90.5%。这个识别率对于实际应用来说还不够。对于具有大约10%字符级错误的OCR文本的实际使用，我们需要研究自动索引是否产生可接受的检索性能。此外，有必要评估哪种索引方法能带来更好的性能。实验结果表明，在韩语OCR文本数据库中，2克索引与基于语素的索引具有相似的检索效率。此外，索引OCR文本的检索结果与专家选择的结果相似。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A study on utilizing OCR technology in building text database

Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99

自引率

0.00%

发文量