Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

ACM Journal on Computing and Cultural Heritage Pub Date : 2023-06-30 DOI:https://dl.acm.org/doi/10.1145/3606705

Mariana Dias, Carla Teixeira Lopes

{"title":"Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents","authors":"Mariana Dias, Carla Teixeira Lopes","doi":"https://dl.acm.org/doi/10.1145/3606705","DOIUrl":null,"url":null,"abstract":"<p>Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods’ parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays’ covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.</p>","PeriodicalId":54310,"journal":{"name":"ACM Journal on Computing and Cultural Heritage","volume":"3 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Computing and Cultural Heritage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3606705","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods’ parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays’ covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.

查看原文本刊更多论文

文化打字文献字符识别图像处理算法优化

关联数据作为一种构建和连接数据的新方法被广泛应用于各个领域。文化遗产机构一直在使用关联数据来改进档案描述，促进信息的发现。大多数档案记录都有非机器可读的扫描图像形式的物理工件的数字表示。光学字符识别(OCR)识别图像中的文本并将其转换为机器编码的文本。本文评估了OCR中图像处理方法和参数调整对打印文化遗产文献的影响。该方法使用多目标问题公式最小化Levenshtein编辑距离并最大化正确识别的单词数量，并使用非主导排序遗传算法(NSGA-II)来调整方法的参数。评价结果表明，数字表示类型的参数化有利于OCR图像预处理算法的性能。此外，我们的研究结果表明，在OCR中使用图像预处理算法可能更适合没有预处理的文本识别任务不能产生良好结果的类型学。特别是，自适应阈值分割、双边滤波和开放分别是戏剧剧本的封面、字母和整个数据集表现最好的算法，应该在OCR之前应用以提高其性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Journal on Computing and Cultural Heritage Arts and Humanities-Conservation

CiteScore

4.60

自引率

8.30%

发文量

期刊介绍： ACM Journal on Computing and Cultural Heritage (JOCCH) publishes papers of significant and lasting value in all areas relating to the use of information and communication technologies (ICT) in support of Cultural Heritage. The journal encourages the submission of manuscripts that demonstrate innovative use of technology for the discovery, analysis, interpretation and presentation of cultural material, as well as manuscripts that illustrate applications in the Cultural Heritage sector that challenge the computational technologies and suggest new research opportunities in computer science.