Muhammad Alfian , Umi Laili Yuhana , Daniel Siahaan , Harum Munazharoh , Eric Pardede
{"title":"LFF-POS: A linguistic fusion method to handle out-of-vocabulary words in low-resource part-of-speech tagging","authors":"Muhammad Alfian , Umi Laili Yuhana , Daniel Siahaan , Harum Munazharoh , Eric Pardede","doi":"10.1016/j.mex.2025.103615","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate part-of-speech (POS) tagging is needed for classroom learning evaluation in order to improve the quality of education. However, accurate POS tagging is hampered by the limited amount of training data and the high proportion of out-of-vocabulary (OOV) tokens. We present LFF-POS, a linguistic feature fusion method that overcomes these limitations for Indonesian. The procedure consists of four sequential steps: (1) tokenizing raw text; (2) extracting three complementary features; (3) merging the resulting vectors; (4) applying self-attention; and (4) training a BiLSTM sequence labeler. By combining the three features, LFF-POS improves tagging accuracy without relying on an external lexicon. Experimental results show that the combined features are able to improve the proposed model's ability to handle OOV words and achieve higher POS Tagging accuracy compared to baseline and existing methods.</div><div>OOV cannot be recognized by the model, thus reducing the accuracy of the POS Tagging model</div><div>This study aims to overcome OOV by combining linguistic features such as orthography, morphology, and characters to improve word representation</div><div>The LFF-POS has been proven to improve POS Tagging performance, especially OOV F1 Score by ±14% over baseline.</div></div>","PeriodicalId":18446,"journal":{"name":"MethodsX","volume":"15 ","pages":"Article 103615"},"PeriodicalIF":1.9000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MethodsX","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2215016125004595","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Accurate part-of-speech (POS) tagging is needed for classroom learning evaluation in order to improve the quality of education. However, accurate POS tagging is hampered by the limited amount of training data and the high proportion of out-of-vocabulary (OOV) tokens. We present LFF-POS, a linguistic feature fusion method that overcomes these limitations for Indonesian. The procedure consists of four sequential steps: (1) tokenizing raw text; (2) extracting three complementary features; (3) merging the resulting vectors; (4) applying self-attention; and (4) training a BiLSTM sequence labeler. By combining the three features, LFF-POS improves tagging accuracy without relying on an external lexicon. Experimental results show that the combined features are able to improve the proposed model's ability to handle OOV words and achieve higher POS Tagging accuracy compared to baseline and existing methods.
OOV cannot be recognized by the model, thus reducing the accuracy of the POS Tagging model
This study aims to overcome OOV by combining linguistic features such as orthography, morphology, and characters to improve word representation
The LFF-POS has been proven to improve POS Tagging performance, especially OOV F1 Score by ±14% over baseline.