Investigation of Automatic Part-of-Speech Tagging using CRF, HMM and LSTM on Misspelled and Edited Texts

Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference Pub Date : 2022-12-17 DOI:10.1145/3582099.3582103

Farhad Aydinov, Igbal Huseynov, Sofiya Sayadzada, S. Rustamov

{"title":"Investigation of Automatic Part-of-Speech Tagging using CRF, HMM and LSTM on Misspelled and Edited Texts","authors":"Farhad Aydinov, Igbal Huseynov, Sofiya Sayadzada, S. Rustamov","doi":"10.1145/3582099.3582103","DOIUrl":null,"url":null,"abstract":"Part-of-speech tagging is the process of assigning words in a given text to appropriate parts-of speech in order to reduce the disambiguation which may arise depending on the contextual usage of the words. In this paper, the problem of word sense disambiguation in Azerbaijani language is addressed by applying part of speech tagging on two varying data corpora, misspelled, and edited (clean) text using 3 different machine learning algorithms: Hidden Markov Model, Long Short-Term Memory, and Conditional Random Fields. The comparative analysis on the outcomes of the mentioned algorithms and their accuracy scores were analysed in the paper. The misspelled dataset for the experiments is provided by Unibank from their chatbot dialogues while the clean textual data was retrieved from the books and newspapers in Azerbaijani. The experiments showed that the Bidirectional LSTM has the highest accuracy scores for both edited (98.2%) and noisy (96.2%) data corpora. Suggested models can be used in the application of algorithms focuses on part of speech tags and syntactic structure of Azerbaijani language which is an agglutinative language belonging to Turkic languages family, thus enabling the research to be further investigated in other agglutinative languages with similar grammatical structure.","PeriodicalId":222372,"journal":{"name":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582099.3582103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Part-of-speech tagging is the process of assigning words in a given text to appropriate parts-of speech in order to reduce the disambiguation which may arise depending on the contextual usage of the words. In this paper, the problem of word sense disambiguation in Azerbaijani language is addressed by applying part of speech tagging on two varying data corpora, misspelled, and edited (clean) text using 3 different machine learning algorithms: Hidden Markov Model, Long Short-Term Memory, and Conditional Random Fields. The comparative analysis on the outcomes of the mentioned algorithms and their accuracy scores were analysed in the paper. The misspelled dataset for the experiments is provided by Unibank from their chatbot dialogues while the clean textual data was retrieved from the books and newspapers in Azerbaijani. The experiments showed that the Bidirectional LSTM has the highest accuracy scores for both edited (98.2%) and noisy (96.2%) data corpora. Suggested models can be used in the application of algorithms focuses on part of speech tags and syntactic structure of Azerbaijani language which is an agglutinative language belonging to Turkic languages family, thus enabling the research to be further investigated in other agglutinative languages with similar grammatical structure.

查看原文本刊更多论文

基于CRF、HMM和LSTM的词性自动标注研究

词性标注是将给定文本中的单词分配到适当的词性中，以减少因单词的上下文用法而产生的消歧现象。在本文中，通过使用3种不同的机器学习算法:隐马尔可夫模型、长短期记忆和条件随机场，在两个不同的数据语料库、拼写错误和编辑(干净)文本上应用词性标注，解决了阿塞拜疆语中的词义消歧问题。本文对上述算法的结果和准确率进行了对比分析。实验的拼错数据集由Unibank从聊天机器人对话中提供，而干净的文本数据则从阿塞拜疆的书籍和报纸中检索。实验表明，双向LSTM在编辑过的语料库和带噪的语料库上都具有最高的准确率(98.2%)。建议的模型可用于算法的应用，重点关注突厥语系粘着语阿塞拜疆语的词性标签和句法结构，从而使研究进一步深入到具有类似语法结构的其他粘着语中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference

自引率

0.00%

发文量