{"title":"Multi-Topic Categorization in a Low-Resource Ewe Language: A Modern Transformer Approach","authors":"V. K. Agbesi, Wenyu Chen, N. Kuadey, G. Maale","doi":"10.1109/icccs55155.2022.9846372","DOIUrl":null,"url":null,"abstract":"The evolution of natural language processing (NLP) recently, paved the way for text categorization. With this mechanism, allocating a large volume of textual data to a category is much easier. This task is more challenging in dealing with multi-topic categorizations in a low-resource language. Transformer-based mechanisms have shown much strength in NLP tasks. However, low-resourced, low-data settings and a lack of benchmark datasets make it difficult to perform any NLP-related task in these extremely low-resource languages with data-points and dataset constraints. In this work, the authors focus on creating a new benchmark dataset for a low-resourced language and performed a multi-topic categorization using this dataset. We further propose an EweBERT model, which is built on the pre-trained transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for multi topic categorization. The EweBERT is used to tokenize and represent the input articles as the initial stage in this system. The output of the EweBERT is then sent into a densely connected neural network, which classifies the articles according to six (6) diverse predefined topics. Experimental results prove that our proposed EweBERT-model records 86.2% accuracy, 85.6% F1-score micro, 85.4% F1-score macro, and F1-score mass of 85.7% compared with 3 benchmarked models.","PeriodicalId":121713,"journal":{"name":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icccs55155.2022.9846372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
The evolution of natural language processing (NLP) recently, paved the way for text categorization. With this mechanism, allocating a large volume of textual data to a category is much easier. This task is more challenging in dealing with multi-topic categorizations in a low-resource language. Transformer-based mechanisms have shown much strength in NLP tasks. However, low-resourced, low-data settings and a lack of benchmark datasets make it difficult to perform any NLP-related task in these extremely low-resource languages with data-points and dataset constraints. In this work, the authors focus on creating a new benchmark dataset for a low-resourced language and performed a multi-topic categorization using this dataset. We further propose an EweBERT model, which is built on the pre-trained transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for multi topic categorization. The EweBERT is used to tokenize and represent the input articles as the initial stage in this system. The output of the EweBERT is then sent into a densely connected neural network, which classifies the articles according to six (6) diverse predefined topics. Experimental results prove that our proposed EweBERT-model records 86.2% accuracy, 85.6% F1-score micro, 85.4% F1-score macro, and F1-score mass of 85.7% compared with 3 benchmarked models.