{"title":"基于预训练语言模型的低资源语言长文本分类","authors":"Hailemariam Mehari Yohannes, T. Amagasa","doi":"10.1109/icict58900.2023.00026","DOIUrl":null,"url":null,"abstract":"Text classification is an essential task of Natural Language Processing (NLP) that intends to classify texts into predefined classes. Most recent studies show that transformer-based pre-trained language models such as BERT and RoBERTa have achieved state-of-the-art performance in several downstream NLP tasks. Despite their advantages, these models suffer from one primary drawback of the restricted input size. Because of this limitation, they cannot operate the entire input long texts. This paper presents an approach that utilizes the self-attention mechanism to address the bottleneck of most pre-trained language models of long input texts in the case of Amharic, regarded as a low-resourced language. Specifically, our method carefully investigates the significance of each word in the dataset using a self-attention mechanism. Then identify and select the most relevant words according to their attention scores. Finally, we train our model on the filtered text. Our results show that the approach achieves better performance in terms of accuracy compared to the baseline model.","PeriodicalId":425057,"journal":{"name":"2023 6th International Conference on Information and Computer Technologies (ICICT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Long Text Classification Using Pre-trained Language Model for a Low-Resource Language\",\"authors\":\"Hailemariam Mehari Yohannes, T. Amagasa\",\"doi\":\"10.1109/icict58900.2023.00026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text classification is an essential task of Natural Language Processing (NLP) that intends to classify texts into predefined classes. Most recent studies show that transformer-based pre-trained language models such as BERT and RoBERTa have achieved state-of-the-art performance in several downstream NLP tasks. Despite their advantages, these models suffer from one primary drawback of the restricted input size. Because of this limitation, they cannot operate the entire input long texts. This paper presents an approach that utilizes the self-attention mechanism to address the bottleneck of most pre-trained language models of long input texts in the case of Amharic, regarded as a low-resourced language. Specifically, our method carefully investigates the significance of each word in the dataset using a self-attention mechanism. Then identify and select the most relevant words according to their attention scores. Finally, we train our model on the filtered text. Our results show that the approach achieves better performance in terms of accuracy compared to the baseline model.\",\"PeriodicalId\":425057,\"journal\":{\"name\":\"2023 6th International Conference on Information and Computer Technologies (ICICT)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 6th International Conference on Information and Computer Technologies (ICICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icict58900.2023.00026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 6th International Conference on Information and Computer Technologies (ICICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icict58900.2023.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Long Text Classification Using Pre-trained Language Model for a Low-Resource Language
Text classification is an essential task of Natural Language Processing (NLP) that intends to classify texts into predefined classes. Most recent studies show that transformer-based pre-trained language models such as BERT and RoBERTa have achieved state-of-the-art performance in several downstream NLP tasks. Despite their advantages, these models suffer from one primary drawback of the restricted input size. Because of this limitation, they cannot operate the entire input long texts. This paper presents an approach that utilizes the self-attention mechanism to address the bottleneck of most pre-trained language models of long input texts in the case of Amharic, regarded as a low-resourced language. Specifically, our method carefully investigates the significance of each word in the dataset using a self-attention mechanism. Then identify and select the most relevant words according to their attention scores. Finally, we train our model on the filtered text. Our results show that the approach achieves better performance in terms of accuracy compared to the baseline model.