{"title":"利用OCR技术建立文本数据库的研究","authors":"S. Hahn, J. Lee, J. H. Kim","doi":"10.1109/DEXA.1999.795250","DOIUrl":null,"url":null,"abstract":"Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.","PeriodicalId":276867,"journal":{"name":"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"A study on utilizing OCR technology in building text database\",\"authors\":\"S. Hahn, J. Lee, J. H. Kim\",\"doi\":\"10.1109/DEXA.1999.795250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.\",\"PeriodicalId\":276867,\"journal\":{\"name\":\"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99\",\"volume\":\"74 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.1999.795250\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.1999.795250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A study on utilizing OCR technology in building text database
Optical character recognition (OCR) might be the most plausible method in building databases from printed documents. The paper describes the points to be considered when one selects an OCR system in order to build a database. Based on our experiments on four commercial OCR systems, we chose one that shows the highest recognition rate to build an OCR text database. The character recognition rate was 90.5% over 970 abstracts of conference proceedings in Korean. This recognition rate is still insufficient for practical use. For practical use of the OCR texts which has approximately 10% of character-level errors, we need to investigate whether automatic indexing generates acceptable retrieval performance. In addition, it is necessary to evaluate which indexing method results in better performance. Experimental results show that 2-gram indexing provides similar retrieval efficiency to morpheme based indexing for the Korean OCR text database. In addition, the retrieved results of the indexed OCR texts are similar to those selected by experts.