{"title":"基于免疫的OCR误差后处理方法","authors":"Puberun Boruah","doi":"10.1109/ICECCT56650.2023.10179692","DOIUrl":null,"url":null,"abstract":"Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.","PeriodicalId":180790,"journal":{"name":"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Immuno-Inspired Approach Towards Post-Processing of OCR Errors\",\"authors\":\"Puberun Boruah\",\"doi\":\"10.1109/ICECCT56650.2023.10179692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.\",\"PeriodicalId\":180790,\"journal\":{\"name\":\"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICECCT56650.2023.10179692\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECCT56650.2023.10179692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Immuno-Inspired Approach Towards Post-Processing of OCR Errors
Errors are part and parcel of Computer Vision applications like Optical Character Recognition(OCR). Unfortunately, the noise produced by these errors only proliferates further down the stages of Natural Language Processing pipelines. Among the reported works for post-processing of OCR texts, most involved Lexical approaches, Feature-based machine learning models, Merging OCR outputs, or using other language Models. This paper proposes an Isolated-Word-based approach to detect OCR errors that rely on the principles of the Artificial Immune System(AIS). The problem of OCR error detection is treated as a classification problem where OCR errors are treated as pathogens and correct words as host cells. The Negative Selection Algorithm is used to classify any new token as an OCR error (pathogen) or good term (host cell). A series of experiments illustrate that it is possible to construct such a system to help identify OCR errors independent of the language.