Axel Jean-Caurant, Nouredine Tamani, V. Courboulay, J. Burie
{"title":"命名实体后ocr更正的基于词典编纂的顺序","authors":"Axel Jean-Caurant, Nouredine Tamani, V. Courboulay, J. Burie","doi":"10.1109/ICDAR.2017.197","DOIUrl":null,"url":null,"abstract":"We are in the era of information access in which a huge amount of text is extracted from scanned documents and made available digitally to be used in search processes. However, old or poorly scanned documents suffer from bad recognition, which leads to not only imperfect Optical Character Recognition (OCR), but to bad indexation and unattainable information, as well. To cope with the aforementioned issues, we introduce in this paper a lexicographical-based approach for Post-OCR correction applied to named entities. By combining lexicographically a contextual similarity and an edit distance, the approach builds a graph connecting similar named entities, in order to automatically correct the corresponding OCR processed text. We evaluated our approach on a generated dataset. The first results obtained showed that, despite the high level of degradation of the text, the approach succeeded in correcting more than a third of named entities without the need for any external knowledge.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Lexicographical-Based Order for Post-OCR Correction of Named Entities\",\"authors\":\"Axel Jean-Caurant, Nouredine Tamani, V. Courboulay, J. Burie\",\"doi\":\"10.1109/ICDAR.2017.197\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We are in the era of information access in which a huge amount of text is extracted from scanned documents and made available digitally to be used in search processes. However, old or poorly scanned documents suffer from bad recognition, which leads to not only imperfect Optical Character Recognition (OCR), but to bad indexation and unattainable information, as well. To cope with the aforementioned issues, we introduce in this paper a lexicographical-based approach for Post-OCR correction applied to named entities. By combining lexicographically a contextual similarity and an edit distance, the approach builds a graph connecting similar named entities, in order to automatically correct the corresponding OCR processed text. We evaluated our approach on a generated dataset. The first results obtained showed that, despite the high level of degradation of the text, the approach succeeded in correcting more than a third of named entities without the need for any external knowledge.\",\"PeriodicalId\":433676,\"journal\":{\"name\":\"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)\",\"volume\":\"197 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.2017.197\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.197","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Lexicographical-Based Order for Post-OCR Correction of Named Entities
We are in the era of information access in which a huge amount of text is extracted from scanned documents and made available digitally to be used in search processes. However, old or poorly scanned documents suffer from bad recognition, which leads to not only imperfect Optical Character Recognition (OCR), but to bad indexation and unattainable information, as well. To cope with the aforementioned issues, we introduce in this paper a lexicographical-based approach for Post-OCR correction applied to named entities. By combining lexicographically a contextual similarity and an edit distance, the approach builds a graph connecting similar named entities, in order to automatically correct the corresponding OCR processed text. We evaluated our approach on a generated dataset. The first results obtained showed that, despite the high level of degradation of the text, the approach succeeded in correcting more than a third of named entities without the need for any external knowledge.