{"title":"HoogBERTa:使用泰语预训练语言表示的多任务序列标记","authors":"Peerachet Porkaew, P. Boonkwan, T. Supnithi","doi":"10.1109/iSAI-NLP54397.2021.9678190","DOIUrl":null,"url":null,"abstract":"Recently, pretrained language representations like BERT and RoBERTa have drawn more and more attention in NLP. In this work we propose a pretrained language representation for Thai language, which based on RoBERTa architecture. Our monolingual data used in the training are collected from publicly available resources including Wikipedia, OpenSubtitles, news and articles. Although the pretrained model can be fine-tuned for wide area of individual tasks, fine-tuning the model with multiple objectives also yields a surprisingly effective model. We evaluated the performance of our multi-task model on part-of-speech tagging, named entity recognition and clause boundary prediction. Our model achieves the comparable performance to strong single-task baselines. Our code and models are available at https://github.com/lstnlp/hoogberta.","PeriodicalId":339826,"journal":{"name":"2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation\",\"authors\":\"Peerachet Porkaew, P. Boonkwan, T. Supnithi\",\"doi\":\"10.1109/iSAI-NLP54397.2021.9678190\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, pretrained language representations like BERT and RoBERTa have drawn more and more attention in NLP. In this work we propose a pretrained language representation for Thai language, which based on RoBERTa architecture. Our monolingual data used in the training are collected from publicly available resources including Wikipedia, OpenSubtitles, news and articles. Although the pretrained model can be fine-tuned for wide area of individual tasks, fine-tuning the model with multiple objectives also yields a surprisingly effective model. We evaluated the performance of our multi-task model on part-of-speech tagging, named entity recognition and clause boundary prediction. Our model achieves the comparable performance to strong single-task baselines. Our code and models are available at https://github.com/lstnlp/hoogberta.\",\"PeriodicalId\":339826,\"journal\":{\"name\":\"2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iSAI-NLP54397.2021.9678190\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSAI-NLP54397.2021.9678190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation
Recently, pretrained language representations like BERT and RoBERTa have drawn more and more attention in NLP. In this work we propose a pretrained language representation for Thai language, which based on RoBERTa architecture. Our monolingual data used in the training are collected from publicly available resources including Wikipedia, OpenSubtitles, news and articles. Although the pretrained model can be fine-tuned for wide area of individual tasks, fine-tuning the model with multiple objectives also yields a surprisingly effective model. We evaluated the performance of our multi-task model on part-of-speech tagging, named entity recognition and clause boundary prediction. Our model achieves the comparable performance to strong single-task baselines. Our code and models are available at https://github.com/lstnlp/hoogberta.