基于免疫的OCR误差后处理方法

Puberun Boruah
{"title":"基于免疫的OCR误差后处理方法","authors":"Puberun Boruah","doi":"10.1109/ICECCT56650.2023.10179692","DOIUrl":null,"url":null,"abstract":"Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.","PeriodicalId":180790,"journal":{"name":"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Immuno-Inspired Approach Towards Post-Processing of OCR Errors\",\"authors\":\"Puberun Boruah\",\"doi\":\"10.1109/ICECCT56650.2023.10179692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.\",\"PeriodicalId\":180790,\"journal\":{\"name\":\"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECCT56650.2023.10179692\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECCT56650.2023.10179692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

误差是光学字符识别(OCR)等计算机视觉应用的重要组成部分。不幸的是,这些错误产生的噪音只会在自然语言处理的各个阶段进一步扩散。在已报道的OCR文本后处理工作中,大多数涉及词法方法、基于特征的机器学习模型、合并OCR输出或使用其他语言模型。本文提出了一种基于孤立词的OCR错误检测方法,该方法基于人工免疫系统(AIS)的原理。将OCR错误检测问题视为分类问题,将OCR错误视为病原体,将正确词视为宿主细胞。负选择算法用于将任何新标记分类为OCR错误(病原体)或良好项(宿主细胞)。一系列的实验表明,构建这样一个系统来帮助识别OCR错误独立于语言是可能的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An Immuno-Inspired Approach Towards Post-Processing of OCR Errors
Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信