{"title":"Word embedding-based Part of Speech tagging in Tamil texts","authors":"Sajeetha Thavareesan, S. Mahesan","doi":"10.1109/ICIIS51140.2020.9342640","DOIUrl":null,"url":null,"abstract":"This paper proposes a word embedding-based Part of Speech (POS) tagger for Tamil language. The experiments are conducted with different word embeddings BoW, TF-IDF, Word2vec, fastText and GloVe that are created using UJ-Tamil corpus. Different combinations of eight features with three classifiers linear SVM, Extreme Gradient Boosting and k-Nearest Neighbor are used to build the POS tagger. The results are compared against Viterbi algorithm-based POS tagger. The results show that word embedding can be used for POS tagging with good performance. BoW, TF-IDF and fastText give an impressive performance compared with Word2vec and GloVe. The accuracy of 99% is obtained with word embedding of BoW and TF-IDF with unigrams as well as bigrams and with linear SVM classifier. POS tag of a given word can be identified with 99% of accuracy using word embeddings based POS tagger in Tamil.","PeriodicalId":352858,"journal":{"name":"2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"97","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIIS51140.2020.9342640","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 97
Abstract
This paper proposes a word embedding-based Part of Speech (POS) tagger for Tamil language. The experiments are conducted with different word embeddings BoW, TF-IDF, Word2vec, fastText and GloVe that are created using UJ-Tamil corpus. Different combinations of eight features with three classifiers linear SVM, Extreme Gradient Boosting and k-Nearest Neighbor are used to build the POS tagger. The results are compared against Viterbi algorithm-based POS tagger. The results show that word embedding can be used for POS tagging with good performance. BoW, TF-IDF and fastText give an impressive performance compared with Word2vec and GloVe. The accuracy of 99% is obtained with word embedding of BoW and TF-IDF with unigrams as well as bigrams and with linear SVM classifier. POS tag of a given word can be identified with 99% of accuracy using word embeddings based POS tagger in Tamil.