Multi-Topic Categorization in a Low-Resource Ewe Language: A Modern Transformer Approach

2022 7th International Conference on Computer and Communication Systems (ICCCS) Pub Date : 2022-04-22 DOI:10.1109/icccs55155.2022.9846372

V. K. Agbesi, Wenyu Chen, N. Kuadey, G. Maale

{"title":"Multi-Topic Categorization in a Low-Resource Ewe Language: A Modern Transformer Approach","authors":"V. K. Agbesi, Wenyu Chen, N. Kuadey, G. Maale","doi":"10.1109/icccs55155.2022.9846372","DOIUrl":null,"url":null,"abstract":"The evolution of natural language processing (NLP) recently, paved the way for text categorization. With this mechanism, allocating a large volume of textual data to a category is much easier. This task is more challenging in dealing with multi-topic categorizations in a low-resource language. Transformer-based mechanisms have shown much strength in NLP tasks. However, low-resourced, low-data settings and a lack of benchmark datasets make it difficult to perform any NLP-related task in these extremely low-resource languages with data-points and dataset constraints. In this work, the authors focus on creating a new benchmark dataset for a low-resourced language and performed a multi-topic categorization using this dataset. We further propose an EweBERT model, which is built on the pre-trained transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for multi topic categorization. The EweBERT is used to tokenize and represent the input articles as the initial stage in this system. The output of the EweBERT is then sent into a densely connected neural network, which classifies the articles according to six (6) diverse predefined topics. Experimental results prove that our proposed EweBERT-model records 86.2% accuracy, 85.6% F1-score micro, 85.4% F1-score macro, and F1-score mass of 85.7% compared with 3 benchmarked models.","PeriodicalId":121713,"journal":{"name":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icccs55155.2022.9846372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The evolution of natural language processing (NLP) recently, paved the way for text categorization. With this mechanism, allocating a large volume of textual data to a category is much easier. This task is more challenging in dealing with multi-topic categorizations in a low-resource language. Transformer-based mechanisms have shown much strength in NLP tasks. However, low-resourced, low-data settings and a lack of benchmark datasets make it difficult to perform any NLP-related task in these extremely low-resource languages with data-points and dataset constraints. In this work, the authors focus on creating a new benchmark dataset for a low-resourced language and performed a multi-topic categorization using this dataset. We further propose an EweBERT model, which is built on the pre-trained transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for multi topic categorization. The EweBERT is used to tokenize and represent the input articles as the initial stage in this system. The output of the EweBERT is then sent into a densely connected neural network, which classifies the articles according to six (6) diverse predefined topics. Experimental results prove that our proposed EweBERT-model records 86.2% accuracy, 85.6% F1-score micro, 85.4% F1-score macro, and F1-score mass of 85.7% compared with 3 benchmarked models.

查看原文本刊更多论文

低资源Ewe语言中的多主题分类:一种现代变压器方法

近年来自然语言处理(NLP)的发展为文本分类铺平了道路。使用这种机制，将大量文本数据分配到一个类别要容易得多。这个任务在处理低资源语言中的多主题分类时更具挑战性。基于变压器的机制在NLP任务中显示出很大的优势。然而，低资源、低数据设置和缺乏基准数据集使得在这些具有数据点和数据集约束的极低资源语言中执行任何与nlp相关的任务变得困难。在这项工作中，作者专注于为低资源语言创建一个新的基准数据集，并使用该数据集执行多主题分类。我们进一步提出了EweBERT模型，该模型建立在预训练的变压器模型上，称为双向编码器表示(BERT)，用于多主题分类。EweBERT用于标记和表示输入文章，作为该系统的初始阶段。EweBERT的输出随后被发送到一个密集连接的神经网络中，该网络根据六(6)个不同的预定义主题对文章进行分类。实验结果表明，与3种基准模型相比，我们提出的ewebert模型准确率为86.2%，微观f1分数为85.6%，宏观f1分数为85.4%，质量f1分数为85.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 7th International Conference on Computer and Communication Systems (ICCCS)

自引率

0.00%

发文量