{"title":"HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation","authors":"Peerachet Porkaew, P. Boonkwan, T. Supnithi","doi":"10.1109/iSAI-NLP54397.2021.9678190","DOIUrl":null,"url":null,"abstract":"Recently, pretrained language representations like BERT and RoBERTa have drawn more and more attention in NLP. In this work we propose a pretrained language representation for Thai language, which based on RoBERTa architecture. Our monolingual data used in the training are collected from publicly available resources including Wikipedia, OpenSubtitles, news and articles. Although the pretrained model can be fine-tuned for wide area of individual tasks, fine-tuning the model with multiple objectives also yields a surprisingly effective model. We evaluated the performance of our multi-task model on part-of-speech tagging, named entity recognition and clause boundary prediction. Our model achieves the comparable performance to strong single-task baselines. Our code and models are available at https://github.com/lstnlp/hoogberta.","PeriodicalId":339826,"journal":{"name":"2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSAI-NLP54397.2021.9678190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, pretrained language representations like BERT and RoBERTa have drawn more and more attention in NLP. In this work we propose a pretrained language representation for Thai language, which based on RoBERTa architecture. Our monolingual data used in the training are collected from publicly available resources including Wikipedia, OpenSubtitles, news and articles. Although the pretrained model can be fine-tuned for wide area of individual tasks, fine-tuning the model with multiple objectives also yields a surprisingly effective model. We evaluated the performance of our multi-task model on part-of-speech tagging, named entity recognition and clause boundary prediction. Our model achieves the comparable performance to strong single-task baselines. Our code and models are available at https://github.com/lstnlp/hoogberta.