{"title":"标准文莱马来语词性标注:基于概率和神经的方法","authors":"Izzati Mohaimin, R. Apong, A. R. Damit","doi":"10.12720/jait.14.4.830-837","DOIUrl":null,"url":null,"abstract":"—As online information increases over the years, text mining researchers developed Natural Language Processing tools to extract relevant and useful information from textual data such as online news articles. The Malay language is widely spoken, especially in the Southeast Asian region, but there is a lack of Natural Language Processing (NLP) tools such as Malay corpora and Part-of-Speech (POS) taggers. Existing NLP tools are mainly based on Standard Malay of Malaysia and Indonesian language, but there is none for the Bruneian Malay. We addressed this issue by designing a Standard Brunei Malay corpus consisting of over 114,000 lexical tokens, annotated using 17 Malay POS tagsets. Furthermore, we implemented two commonly used POS tagging techniques, Conditional Random Field (CRF) and Bi-directional Long Short-Term Memory (BLSTM), to develop Bruneian POS taggers and compared their performances. The results showed that both CRF and BLSTM models performed well in predicting POS tags on Bruneian texts. However, CRF models outperform BLSTM, where CRF using all features achieved an F-Measure of 92.06% on news articles and 90.71% of F-Measure on crime articles. Adding a batch normalization layer to the BLSTM model architecture increased the performance by 7.13%. To further improve the BLSTM models, we suggested increasing the training data and experimenting with different hyperparameter settings. The findings also indicated that modelling BLSTM with fastText has improved the POS prediction of Bruneian words.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Part-of-Speech (POS) Tagging for Standard Brunei Malay: A Probabilistic and Neural-Based Approach\",\"authors\":\"Izzati Mohaimin, R. Apong, A. R. Damit\",\"doi\":\"10.12720/jait.14.4.830-837\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"—As online information increases over the years, text mining researchers developed Natural Language Processing tools to extract relevant and useful information from textual data such as online news articles. The Malay language is widely spoken, especially in the Southeast Asian region, but there is a lack of Natural Language Processing (NLP) tools such as Malay corpora and Part-of-Speech (POS) taggers. Existing NLP tools are mainly based on Standard Malay of Malaysia and Indonesian language, but there is none for the Bruneian Malay. We addressed this issue by designing a Standard Brunei Malay corpus consisting of over 114,000 lexical tokens, annotated using 17 Malay POS tagsets. Furthermore, we implemented two commonly used POS tagging techniques, Conditional Random Field (CRF) and Bi-directional Long Short-Term Memory (BLSTM), to develop Bruneian POS taggers and compared their performances. The results showed that both CRF and BLSTM models performed well in predicting POS tags on Bruneian texts. However, CRF models outperform BLSTM, where CRF using all features achieved an F-Measure of 92.06% on news articles and 90.71% of F-Measure on crime articles. Adding a batch normalization layer to the BLSTM model architecture increased the performance by 7.13%. To further improve the BLSTM models, we suggested increasing the training data and experimenting with different hyperparameter settings. The findings also indicated that modelling BLSTM with fastText has improved the POS prediction of Bruneian words.\",\"PeriodicalId\":0,\"journal\":{\"name\":\"\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12720/jait.14.4.830-837\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12720/jait.14.4.830-837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Part-of-Speech (POS) Tagging for Standard Brunei Malay: A Probabilistic and Neural-Based Approach
—As online information increases over the years, text mining researchers developed Natural Language Processing tools to extract relevant and useful information from textual data such as online news articles. The Malay language is widely spoken, especially in the Southeast Asian region, but there is a lack of Natural Language Processing (NLP) tools such as Malay corpora and Part-of-Speech (POS) taggers. Existing NLP tools are mainly based on Standard Malay of Malaysia and Indonesian language, but there is none for the Bruneian Malay. We addressed this issue by designing a Standard Brunei Malay corpus consisting of over 114,000 lexical tokens, annotated using 17 Malay POS tagsets. Furthermore, we implemented two commonly used POS tagging techniques, Conditional Random Field (CRF) and Bi-directional Long Short-Term Memory (BLSTM), to develop Bruneian POS taggers and compared their performances. The results showed that both CRF and BLSTM models performed well in predicting POS tags on Bruneian texts. However, CRF models outperform BLSTM, where CRF using all features achieved an F-Measure of 92.06% on news articles and 90.71% of F-Measure on crime articles. Adding a batch normalization layer to the BLSTM model architecture increased the performance by 7.13%. To further improve the BLSTM models, we suggested increasing the training data and experimenting with different hyperparameter settings. The findings also indicated that modelling BLSTM with fastText has improved the POS prediction of Bruneian words.