Quang-Linh Tran, Gia-Huy Lam, Van-Binh Duong, Trong-Hop Do
{"title":"A Study on Diacritic Restoration Problem in Vietnamese Text using Deep Learning based Models","authors":"Quang-Linh Tran, Gia-Huy Lam, Van-Binh Duong, Trong-Hop Do","doi":"10.1109/COMNETSAT53002.2021.9530818","DOIUrl":null,"url":null,"abstract":"Diacritic restoration is a challenging problem in natural language processing (NLP). With diacritic restoration, one can text faster and easier. Diacritic restoration is also helpful in making use of diacritic-missing texts, which are normally discarded in many NLP applications. This paper deals with the diacritic restoration problem for Vietnamese text. Three state- of-the-art deep learning models including Gated Recurrent Unit, Bidirectional Long-short Term Memory and Bidirectional Gated Recurrent Unit have been examined for the problem and the last one turned out to be the best among them. Besides deep learning models, it was found in this paper that word tokenization, which is the final pre-processing step applied on the data before feeding it to deep learning models also have influences on the final accuracy. Between two examined word tokenization methods: morpheme-based tokenization and phrase-based tokenization, the former yield better results regardless of the applied deep learning models. The experimental results show that the combination of morpheme-based tokenization and Bidirectional-GRU achieve the best performance of diacritic restoration with the Bleu-score of 88.06%.","PeriodicalId":148136,"journal":{"name":"2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMNETSAT53002.2021.9530818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Diacritic restoration is a challenging problem in natural language processing (NLP). With diacritic restoration, one can text faster and easier. Diacritic restoration is also helpful in making use of diacritic-missing texts, which are normally discarded in many NLP applications. This paper deals with the diacritic restoration problem for Vietnamese text. Three state- of-the-art deep learning models including Gated Recurrent Unit, Bidirectional Long-short Term Memory and Bidirectional Gated Recurrent Unit have been examined for the problem and the last one turned out to be the best among them. Besides deep learning models, it was found in this paper that word tokenization, which is the final pre-processing step applied on the data before feeding it to deep learning models also have influences on the final accuracy. Between two examined word tokenization methods: morpheme-based tokenization and phrase-based tokenization, the former yield better results regardless of the applied deep learning models. The experimental results show that the combination of morpheme-based tokenization and Bidirectional-GRU achieve the best performance of diacritic restoration with the Bleu-score of 88.06%.