Indonesian Part of Speech Tagging Using Hidden Markov Model – Ngram & Viterbi

2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE) Pub Date : 2019-11-01 DOI:10.1109/ICITISEE48480.2019.9003989

D. E. Cahyani, Mtchael Juan Vindiyanto

{"title":"Indonesian Part of Speech Tagging Using Hidden Markov Model – Ngram & Viterbi","authors":"D. E. Cahyani, Mtchael Juan Vindiyanto","doi":"10.1109/ICITISEE48480.2019.9003989","DOIUrl":null,"url":null,"abstract":"Part of Speech (POS) Tagging is a process of labelling word classes on sentences. One of the POS Tagging problems is some words that spelt the same but have a different POS Tag depending on the context of the sentence (ambiguity). The approach to solving this problem is using the Hidden Markov Model (HMM) Ngram Algorithm and the Viterbi Algorithm. This study discusses the development of a system for Indonesian POS Tagging using the HMM N-gram algorithm (Bigram and Trigram) and the Viterbi algorithm and compares the result between the HMM Bigram and HMM trigram. An Indonesian language corpus that has been manually labeled called Indonesian Manually Tagged Corpus is used as the knowledge for the system. Then the corpus is processed using the HMM N-gram algorithm to get the rules. Furthermore, process the data with Viterbi algorithm using the previous formed rules to determine the POS tag with the highest probability. The highest accuracy results is 77.56% using the HMM Bigram - Viterbi Algorithm. While the HMM Trigram– Viterbi algorithm has the highest accuracy of 61.67%. The result shows that the system can solve the problem of tag ambiguity with HMM Ngram – Viterbi algorithm and the accuracy of HMM Bigram is better than the HMM Trigram.","PeriodicalId":380472,"journal":{"name":"2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITISEE48480.2019.9003989","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Part of Speech (POS) Tagging is a process of labelling word classes on sentences. One of the POS Tagging problems is some words that spelt the same but have a different POS Tag depending on the context of the sentence (ambiguity). The approach to solving this problem is using the Hidden Markov Model (HMM) Ngram Algorithm and the Viterbi Algorithm. This study discusses the development of a system for Indonesian POS Tagging using the HMM N-gram algorithm (Bigram and Trigram) and the Viterbi algorithm and compares the result between the HMM Bigram and HMM trigram. An Indonesian language corpus that has been manually labeled called Indonesian Manually Tagged Corpus is used as the knowledge for the system. Then the corpus is processed using the HMM N-gram algorithm to get the rules. Furthermore, process the data with Viterbi algorithm using the previous formed rules to determine the POS tag with the highest probability. The highest accuracy results is 77.56% using the HMM Bigram - Viterbi Algorithm. While the HMM Trigram– Viterbi algorithm has the highest accuracy of 61.67%. The result shows that the system can solve the problem of tag ambiguity with HMM Ngram – Viterbi algorithm and the accuracy of HMM Bigram is better than the HMM Trigram.

查看原文本刊更多论文

印尼语词性标注的隐马尔可夫模型- Ngram & Viterbi

词性标注是在句子上标注词类的过程。词性标注问题之一是一些拼写相同的单词，但根据句子的上下文有不同的词性标注(歧义)。解决这一问题的方法是使用隐马尔可夫模型(HMM) Ngram算法和Viterbi算法。本研究讨论了使用HMM N-gram算法(Bigram和Trigram)和Viterbi算法开发印尼语词性标注系统，并比较了HMM Bigram和HMM triram的结果。人工标记的印尼语语料库称为印尼语手动标记语料库，用作系统的知识。然后使用HMM N-gram算法对语料库进行处理，得到规则。然后，使用前面形成的规则对数据进行Viterbi算法处理，确定概率最大的POS标签。使用HMM Bigram - Viterbi算法，准确率最高，达到77.56%。而HMM Trigram - Viterbi算法的准确率最高，为61.67%。结果表明，该系统可以使用HMM Ngram - Viterbi算法解决标签歧义问题，并且HMM Bigram的准确率优于HMM Trigram。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE)

自引率

0.00%

发文量