L. Manik, Arida Ferti Syafiandini, Hani Febri Mustika, Achmad Fatchuttamam Abka, Y. Rianto
{"title":"Evaluating the Morphological and Capitalization Features for Word Embedding-Based POS Tagger in Bahasa Indonesia","authors":"L. Manik, Arida Ferti Syafiandini, Hani Febri Mustika, Achmad Fatchuttamam Abka, Y. Rianto","doi":"10.1109/IC3INA.2018.8629519","DOIUrl":null,"url":null,"abstract":"In this paper, morphological and capitalization features are employed to improve the current word embedding-based POS tagger for Bahasa Indonesia. The experiments are conducted with an architecture based on neural network model, that is a simple feedforward neural network with two input layers, one merge layer, and two hidden layers. The first input layer uses word embeddings (CBOW and Skip-gram) feature as the input while the second input layer uses morphological and capitalization features. The results show that the selected additional features improve the performance and accuracy of current word embedding-based POS tagger, although it is not really significant. The F1 score averages of all word embedding types are increasing from 93% to 94% and the accuracies are increasing from 92-93% to 94-95% on manually tagged corpus of about 250,000 tokens (12,775 unique tokens).","PeriodicalId":179466,"journal":{"name":"2018 International Conference on Computer, Control, Informatics and its Applications (IC3INA)","volume":"189 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Computer, Control, Informatics and its Applications (IC3INA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC3INA.2018.8629519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
In this paper, morphological and capitalization features are employed to improve the current word embedding-based POS tagger for Bahasa Indonesia. The experiments are conducted with an architecture based on neural network model, that is a simple feedforward neural network with two input layers, one merge layer, and two hidden layers. The first input layer uses word embeddings (CBOW and Skip-gram) feature as the input while the second input layer uses morphological and capitalization features. The results show that the selected additional features improve the performance and accuracy of current word embedding-based POS tagger, although it is not really significant. The F1 score averages of all word embedding types are increasing from 93% to 94% and the accuracies are increasing from 92-93% to 94-95% on manually tagged corpus of about 250,000 tokens (12,775 unique tokens).