A statistical and rule-based spelling and grammar checker for Indonesian text

2017 International Conference on Data and Software Engineering (ICoDSE) Pub Date : 2017-11-01 DOI:10.1109/ICODSE.2017.8285846

Asanilta Fahda, A. Purwarianti

{"title":"A statistical and rule-based spelling and grammar checker for Indonesian text","authors":"Asanilta Fahda, A. Purwarianti","doi":"10.1109/ICODSE.2017.8285846","DOIUrl":null,"url":null,"abstract":"Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. This paper proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. The rule matcher module currently uses 38 rules which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker module examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates. Morphological analysis is also added for certain word forms. A bigram/co-occurrence Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures. By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of −0.4 gave the best results. The document evaluation of this system showed an overall accuracy of 83.18%. This prototype is implemented as a web application.","PeriodicalId":366005,"journal":{"name":"2017 International Conference on Data and Software Engineering (ICoDSE)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Data and Software Engineering (ICoDSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICODSE.2017.8285846","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Spelling and grammar checkers are widely-used tools which aim to help in detecting and correcting various writing errors. However, there are currently no proofreading systems capable of checking both spelling and grammar errors in Indonesian text. This paper proposes an Indonesian spelling and grammar checker prototype which uses a combination of rules and statistical methods. The rule matcher module currently uses 38 rules which detect, correct, and explain common errors in punctuation, word choice, and spelling. The spelling checker module examines every word using a dictionary trie to find misspellings and Damerau-Levenshtein distance neighbors as correction candidates. Morphological analysis is also added for certain word forms. A bigram/co-occurrence Hidden Markov Model is used for ranking and selecting the candidates. The grammar checker uses a trigram language model from tokens, POS tags, or phrase chunks for identifying sentences with incorrect structures. By experiment, the co-occurrence HMM with an emission probability weight coefficient of 0.95 is selected as the most suitable model for the spelling checker. As for the grammar checker, the phrase chunk model which normalizes by chunk length and uses a threshold score of −0.4 gave the best results. The document evaluation of this system showed an overall accuracy of 83.18%. This prototype is implemented as a web application.

查看原文本刊更多论文

印度尼西亚文本的统计和基于规则的拼写和语法检查器

拼写和语法检查器是广泛使用的工具，旨在帮助发现和纠正各种写作错误。但是，目前没有能够检查印尼语文本的拼写和语法错误的校对系统。本文提出了一个采用规则和统计相结合的印尼语拼写语法检查器原型。规则匹配器模块目前使用38条规则来检测、纠正和解释标点、选词和拼写方面的常见错误。拼写检查模块使用字典尝试检查每个单词，以查找拼写错误和Damerau-Levenshtein距离邻居作为纠正候选。词形分析也增加了某些词形。使用重图/共现隐马尔可夫模型对候选对象进行排序和选择。语法检查器使用来自标记、POS标记或短语块的三元组语言模型来识别结构不正确的句子。通过实验，选择发射概率权重系数为0.95的共现HMM作为最适合拼写检查器的模型。在语法检查器方面，采用块长度归一化、阈值得分为−0.4的短语块模型效果最好。经文献评价，该系统的总体准确率为83.18%。这个原型被实现为一个web应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on Data and Software Engineering (ICoDSE)

自引率

0.00%

发文量