Optical Character Recognition and text cleaning in the indigenous South African languages

IF 0.4 Q4 LINGUISTICS
D. Prinsloo, Elsabé Taljard, Michelle Goosen
{"title":"Optical Character Recognition and text cleaning in the indigenous South African languages","authors":"D. Prinsloo, Elsabé Taljard, Michelle Goosen","doi":"10.5842/64-1-867","DOIUrl":null,"url":null,"abstract":"This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term “web-sourced material” to refer to digital data sourced from the internet, whereas “text-based material” refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of ‘noise’ than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.","PeriodicalId":42187,"journal":{"name":"Stellenbosch Papers in Linguistics Plus-SPiL Plus","volume":"1 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Stellenbosch Papers in Linguistics Plus-SPiL Plus","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5842/64-1-867","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term “web-sourced material” to refer to digital data sourced from the internet, whereas “text-based material” refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of ‘noise’ than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.
南非土著语言的光学字符识别和文本清理
本文代表了非洲语言文本和语料库清理策略作者未发表的演讲的后续工作。在这篇文章中,我们提供了一个比较描述的清理网络来源和文本来源的材料,用于语料库的汇编,特别注意清理基于文本的材料,因为这是特别相关的土著南非语言。为了本研究的目的,我们使用术语“网络来源的材料”来指来自互联网的数字数据,而“基于文本的材料”是指硬拷贝的文本材料。我们确定了在这些文本中发现的不同类型的错误,特别关注这些语言中的典型扫描错误,然后评估了三种市售光学字符识别(OCR)工具。我们认为,文本的清洁度是一个粒度问题,取决于由文本组成的语料库的设想应用。用于词典编纂等目的的文本语料库比用于拼写和语法检查等汇编的语料库可以容忍更高水平的“噪音”。最后,我们对南非土著语言的文本清理提出了一些建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
0.60
自引率
0.00%
发文量
0
审稿时长
24 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信