Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

IF 2.1 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Mariana Dias, Carla Teixeira Lopes
{"title":"Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents","authors":"Mariana Dias, Carla Teixeira Lopes","doi":"https://dl.acm.org/doi/10.1145/3606705","DOIUrl":null,"url":null,"abstract":"<p>Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods’ parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays’ covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.</p>","PeriodicalId":54310,"journal":{"name":"ACM Journal on Computing and Cultural Heritage","volume":"3 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Computing and Cultural Heritage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3606705","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods’ parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays’ covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.

文化打字文献字符识别图像处理算法优化
关联数据作为一种构建和连接数据的新方法被广泛应用于各个领域。文化遗产机构一直在使用关联数据来改进档案描述,促进信息的发现。大多数档案记录都有非机器可读的扫描图像形式的物理工件的数字表示。光学字符识别(OCR)识别图像中的文本并将其转换为机器编码的文本。本文评估了OCR中图像处理方法和参数调整对打印文化遗产文献的影响。该方法使用多目标问题公式最小化Levenshtein编辑距离并最大化正确识别的单词数量,并使用非主导排序遗传算法(NSGA-II)来调整方法的参数。评价结果表明,数字表示类型的参数化有利于OCR图像预处理算法的性能。此外,我们的研究结果表明,在OCR中使用图像预处理算法可能更适合没有预处理的文本识别任务不能产生良好结果的类型学。特别是,自适应阈值分割、双边滤波和开放分别是戏剧剧本的封面、字母和整个数据集表现最好的算法,应该在OCR之前应用以提高其性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Journal on Computing and Cultural Heritage
ACM Journal on Computing and Cultural Heritage Arts and Humanities-Conservation
CiteScore
4.60
自引率
8.30%
发文量
90
期刊介绍: ACM Journal on Computing and Cultural Heritage (JOCCH) publishes papers of significant and lasting value in all areas relating to the use of information and communication technologies (ICT) in support of Cultural Heritage. The journal encourages the submission of manuscripts that demonstrate innovative use of technology for the discovery, analysis, interpretation and presentation of cultural material, as well as manuscripts that illustrate applications in the Cultural Heritage sector that challenge the computational technologies and suggest new research opportunities in computer science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信