Investigation of Automatic Part-of-Speech Tagging using CRF, HMM and LSTM on Misspelled and Edited Texts

Farhad Aydinov, Igbal Huseynov, Sofiya Sayadzada, S. Rustamov
{"title":"Investigation of Automatic Part-of-Speech Tagging using CRF, HMM and LSTM on Misspelled and Edited Texts","authors":"Farhad Aydinov, Igbal Huseynov, Sofiya Sayadzada, S. Rustamov","doi":"10.1145/3582099.3582103","DOIUrl":null,"url":null,"abstract":"Part-of-speech tagging is the process of assigning words in a given text to appropriate parts-of speech in order to reduce the disambiguation which may arise depending on the contextual usage of the words. In this paper, the problem of word sense disambiguation in Azerbaijani language is addressed by applying part of speech tagging on two varying data corpora, misspelled, and edited (clean) text using 3 different machine learning algorithms: Hidden Markov Model, Long Short-Term Memory, and Conditional Random Fields. The comparative analysis on the outcomes of the mentioned algorithms and their accuracy scores were analysed in the paper. The misspelled dataset for the experiments is provided by Unibank from their chatbot dialogues while the clean textual data was retrieved from the books and newspapers in Azerbaijani. The experiments showed that the Bidirectional LSTM has the highest accuracy scores for both edited (98.2%) and noisy (96.2%) data corpora. Suggested models can be used in the application of algorithms focuses on part of speech tags and syntactic structure of Azerbaijani language which is an agglutinative language belonging to Turkic languages family, thus enabling the research to be further investigated in other agglutinative languages with similar grammatical structure.","PeriodicalId":222372,"journal":{"name":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582099.3582103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Part-of-speech tagging is the process of assigning words in a given text to appropriate parts-of speech in order to reduce the disambiguation which may arise depending on the contextual usage of the words. In this paper, the problem of word sense disambiguation in Azerbaijani language is addressed by applying part of speech tagging on two varying data corpora, misspelled, and edited (clean) text using 3 different machine learning algorithms: Hidden Markov Model, Long Short-Term Memory, and Conditional Random Fields. The comparative analysis on the outcomes of the mentioned algorithms and their accuracy scores were analysed in the paper. The misspelled dataset for the experiments is provided by Unibank from their chatbot dialogues while the clean textual data was retrieved from the books and newspapers in Azerbaijani. The experiments showed that the Bidirectional LSTM has the highest accuracy scores for both edited (98.2%) and noisy (96.2%) data corpora. Suggested models can be used in the application of algorithms focuses on part of speech tags and syntactic structure of Azerbaijani language which is an agglutinative language belonging to Turkic languages family, thus enabling the research to be further investigated in other agglutinative languages with similar grammatical structure.
基于CRF、HMM和LSTM的词性自动标注研究
词性标注是将给定文本中的单词分配到适当的词性中,以减少因单词的上下文用法而产生的消歧现象。在本文中,通过使用3种不同的机器学习算法:隐马尔可夫模型、长短期记忆和条件随机场,在两个不同的数据语料库、拼写错误和编辑(干净)文本上应用词性标注,解决了阿塞拜疆语中的词义消歧问题。本文对上述算法的结果和准确率进行了对比分析。实验的拼错数据集由Unibank从聊天机器人对话中提供,而干净的文本数据则从阿塞拜疆的书籍和报纸中检索。实验表明,双向LSTM在编辑过的语料库和带噪的语料库上都具有最高的准确率(98.2%)。建议的模型可用于算法的应用,重点关注突厥语系粘着语阿塞拜疆语的词性标签和句法结构,从而使研究进一步深入到具有类似语法结构的其他粘着语中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信