Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

Journal of Information Engineering and Applications Pub Date : 2020-09-01 DOI:10.7176/jiea/10-4-02

Tsegaye Kassa

{"title":"Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach","authors":"Tsegaye Kassa","doi":"10.7176/jiea/10-4-02","DOIUrl":null,"url":null,"abstract":"Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good. The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively. Keywords: S entence level n-gram, real-word spelling error, spell checker , unsupervised corpus based spell checker DOI: 10.7176/JIEA/10-4-02 Publication date: September 30 th 2020","PeriodicalId":440930,"journal":{"name":"Journal of Information Engineering and Applications","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Engineering and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7176/jiea/10-4-02","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good. The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively. Keywords: S entence level n-gram, real-word spelling error, spell checker , unsupervised corpus based spell checker DOI: 10.7176/JIEA/10-4-02 Publication date: September 30 th 2020

查看原文本刊更多论文

句子级N-Gram上下文特征在真实单词拼写错误检测和纠正中的应用:基于无监督语料库的方法

拼写检查是发现拼写错误的单词并可能纠正它们的过程。大多数现代商业拼写检查器使用一种直接的方法来查找拼写错误，当字典中没有找到一个单词时，就认为它是错误的。然而，这种方法不能检查单词在上下文中的正确性，这被称为真实单词拼写错误。为了解决这个问题，目前研究人员使用固定大小的n-gram(即三格)上下文特征，这由于特征有限而降低了模型的有效性。在本文中，我们通过采用句子级n-gram特征进行真实单词拼写错误检测和纠正来解决这一问题。在该技术中，所有可能的词n-图都被用来学习所提出的目标语言属性模型，提高了模型的有效性。在本研究中，训练提议模型所需的唯一语料库是无监督语料库(或原始文本)，这使模型能够灵活地适用于任何自然语言。但是，出于演示目的，我们采用了资源不足的语言，如阿姆哈拉语、阿法安奥罗莫语和蒂格里尼亚语。该模型已在召回率，精度，f测量方面进行了评估，并与文献进行了比较(即固定n-gram上下文特征)，以评估所使用的技术是否表现良好。实验结果表明，本文提出的具有句子级n-gram上下文特征的模型取得了较好的结果:对于Amharic语、Afaan Oromo语和Tigrigna语的真实单词错误检测和纠错，平均F-measure分别达到了90.03%、85.95%和84.24%。关键词:S句级n-gram，真实单词拼写错误，拼写检查器，基于无监督语料库的拼写检查器DOI: 10.7176/JIEA/10-4-02出版日期:2020年9月30日

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Information Engineering and Applications

自引率

0.00%

发文量