Fine-Tuning BERT Models for Multiclass Amharic News Document Categorization

IF 1.7 4区工程技术 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Complexity Pub Date : 2025-01-29 DOI:10.1155/cplx/1884264

Demeke Endalie

{"title":"Fine-Tuning BERT Models for Multiclass Amharic News Document Categorization","authors":"Demeke Endalie","doi":"10.1155/cplx/1884264","DOIUrl":null,"url":null,"abstract":"<div>\n <p>Bidirectional encoder representation from transformer (BERT) models are increasingly being employed in the development of natural language processing (NLP) systems, predominantly for English and other European languages. However, because of the complexity of the language’s morphology and the scarcity of models and resources, the BERT model is not widely employed for Amharic text processing and other NLP applications. This paper describes the fine-tuning of a pretrained BERT model to classify Amharic news documents into different news labels. We modified and retrained the model using a custom news document dataset separated into seven key categories. We utilized 2181 distinct Amharic news articles, each comprising a title, a summary lead, and a comprehensive main body. An experiment was carried out to assess the performance of the fine-tuned BERT model, which achieved 88% accuracy, 88% precision, 87.61% recall, and 87.59% F1-score, respectively. In addition, we evaluated our fine-tuned model against baseline models such as bag-of-words with MLP, Word2Vec with MLP, and fastText classifier utilizing the identical dataset and preprocessing module. Our model outperformed these baselines by 6.3%, 14%, and 8% in terms of accuracy, respectively. In conclusion, our refined BERT model has demonstrated encouraging outcomes in the categorization of Amharic news documents, surpassing conventional methods. Future research could explore further fine-tuning techniques and larger datasets to enhance performance.</p>\n </div>","PeriodicalId":50653,"journal":{"name":"Complexity","volume":"2025 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1155/cplx/1884264","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complexity","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1155/cplx/1884264","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Bidirectional encoder representation from transformer (BERT) models are increasingly being employed in the development of natural language processing (NLP) systems, predominantly for English and other European languages. However, because of the complexity of the language’s morphology and the scarcity of models and resources, the BERT model is not widely employed for Amharic text processing and other NLP applications. This paper describes the fine-tuning of a pretrained BERT model to classify Amharic news documents into different news labels. We modified and retrained the model using a custom news document dataset separated into seven key categories. We utilized 2181 distinct Amharic news articles, each comprising a title, a summary lead, and a comprehensive main body. An experiment was carried out to assess the performance of the fine-tuned BERT model, which achieved 88% accuracy, 88% precision, 87.61% recall, and 87.59% F1-score, respectively. In addition, we evaluated our fine-tuned model against baseline models such as bag-of-words with MLP, Word2Vec with MLP, and fastText classifier utilizing the identical dataset and preprocessing module. Our model outperformed these baselines by 6.3%, 14%, and 8% in terms of accuracy, respectively. In conclusion, our refined BERT model has demonstrated encouraging outcomes in the categorization of Amharic news documents, surpassing conventional methods. Future research could explore further fine-tuning techniques and larger datasets to enhance performance.

Abstract Image

查看原文本刊更多论文

多类阿姆哈拉语新闻文档分类的微调BERT模型

来自变压器（BERT）模型的双向编码器表示越来越多地应用于自然语言处理（NLP）系统的开发，主要用于英语和其他欧洲语言。然而，由于语言形态的复杂性以及模型和资源的稀缺性，BERT模型并未广泛应用于阿姆哈拉语文本处理和其他NLP应用。本文描述了一个预训练的BERT模型的微调，将阿姆哈拉语新闻文档分类为不同的新闻标签。我们使用自定义新闻文档数据集修改并重新训练模型，该数据集分为七个关键类别。我们使用了2181篇不同的阿姆哈拉语新闻文章，每篇文章都包含一个标题、一个摘要导语和一个全面的主体。实验结果表明，改进后的BERT模型准确率为88%，精密度为88%，召回率为87.61%，f1得分为87.59%。此外，我们利用相同的数据集和预处理模块，对基线模型（如使用MLP的词袋、使用MLP的Word2Vec和fastText分类器）评估了我们的微调模型。我们的模型在准确性方面分别比这些基线高出6.3%、14%和8%。总之，我们改进的BERT模型在阿姆哈拉语新闻文档的分类中表现出令人鼓舞的结果，超越了传统的方法。未来的研究可以探索进一步的微调技术和更大的数据集来提高性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Complexity 综合性期刊-数学跨学科应用

CiteScore

5.80

自引率

4.30%

发文量

595

审稿时长

>12 weeks

期刊介绍： Complexity is a cross-disciplinary journal focusing on the rapidly expanding science of complex adaptive systems. The purpose of the journal is to advance the science of complexity. Articles may deal with such methodological themes as chaos, genetic algorithms, cellular automata, neural networks, and evolutionary game theory. Papers treating applications in any area of natural science or human endeavor are welcome, and especially encouraged are papers integrating conceptual themes and applications that cross traditional disciplinary boundaries. Complexity is not meant to serve as a forum for speculation and vague analogies between words like “chaos,” “self-organization,” and “emergence” that are often used in completely different ways in science and in daily life.