Fine-Tuned Pretrained Transformer for Amharic News Headline Generation

Applied AI letters Pub Date : 2024-07-19 DOI:10.1002/ail2.98

Mizanu Zelalem Degu, Million Meshesha

{"title":"Fine-Tuned Pretrained Transformer for Amharic News Headline Generation","authors":"Mizanu Zelalem Degu, Million Meshesha","doi":"10.1002/ail2.98","DOIUrl":null,"url":null,"abstract":"<p>Amharic is one of the under-resourced languages, making news headline generation particularly challenging due to the scarcity of high-quality linguistic datasets necessary for training effective natural language processing models. In this study, we fine-tuned the small check point of the T5v1.1 model (t5-small) to perform Amharic news headline generation with an Amharic dataset that is comprised of over 70k news articles along with their headline. Fine-tuning the model involves dataset collection from Amharic news websites, text cleaning, news article size optimization using the TF-IDF algorithm, and tokenization. In addition, a tokenizer model is developed using the byte pair encoding (BPE) algorithm prior to feeding the dataset for feature extraction and summarization. Metrics including Rouge-L, BLEU, and Meteor were used to evaluate the performance of the model and a score of 0.5, 0.24, and 0.71, respectively, was achieved on the test partition of the dataset that contains 7230 instances. The results were good relative to result of the t5 model without fine-tuning, which are 0.1, 0.03, and 0.14, respectively. A postprocessing technique using a rule-based approach was used for further improving summaries generated by the model. The addition of the postprocessing helped the system to achieve Rouge-L, BLEU, and Meteor scores of 0.72, 0.52, and 0.81, respectively. The result value is relatively better than the result achieved by the nonfine-tuned T5v1.1 model and the result of previous studies report on abstractive-based text summarization for Amharic language, which had a 0.27 Rouge-L score. This contributes a valuable insight for practical application and further improvement of the model in the future by increasing the article length, using more training data, using machine learning–based adaptive postprocessing techniques, and fine-tuning other available pretrained models for text summarization.</p>","PeriodicalId":72253,"journal":{"name":"Applied AI letters","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ail2.98","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied AI letters","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ail2.98","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Amharic is one of the under-resourced languages, making news headline generation particularly challenging due to the scarcity of high-quality linguistic datasets necessary for training effective natural language processing models. In this study, we fine-tuned the small check point of the T5v1.1 model (t5-small) to perform Amharic news headline generation with an Amharic dataset that is comprised of over 70k news articles along with their headline. Fine-tuning the model involves dataset collection from Amharic news websites, text cleaning, news article size optimization using the TF-IDF algorithm, and tokenization. In addition, a tokenizer model is developed using the byte pair encoding (BPE) algorithm prior to feeding the dataset for feature extraction and summarization. Metrics including Rouge-L, BLEU, and Meteor were used to evaluate the performance of the model and a score of 0.5, 0.24, and 0.71, respectively, was achieved on the test partition of the dataset that contains 7230 instances. The results were good relative to result of the t5 model without fine-tuning, which are 0.1, 0.03, and 0.14, respectively. A postprocessing technique using a rule-based approach was used for further improving summaries generated by the model. The addition of the postprocessing helped the system to achieve Rouge-L, BLEU, and Meteor scores of 0.72, 0.52, and 0.81, respectively. The result value is relatively better than the result achieved by the nonfine-tuned T5v1.1 model and the result of previous studies report on abstractive-based text summarization for Amharic language, which had a 0.27 Rouge-L score. This contributes a valuable insight for practical application and further improvement of the model in the future by increasing the article length, using more training data, using machine learning–based adaptive postprocessing techniques, and fine-tuning other available pretrained models for text summarization.

Abstract Image

查看原文本刊更多论文

针对阿姆哈拉语新闻标题生成的微调预训练变换器

阿姆哈拉语是一种资源不足的语言，由于缺乏训练有效自然语言处理模型所需的高质量语言数据集，新闻标题的生成尤其具有挑战性。在本研究中，我们对 T5v1.1 模型（t5-small）的小型检查点进行了微调，以便使用由超过 70k 篇新闻文章及其标题组成的阿姆哈拉语数据集生成阿姆哈拉语新闻标题。对模型的微调包括从阿姆哈拉语新闻网站收集数据集、文本清理、使用 TF-IDF 算法优化新闻文章大小以及标记化。此外，在将数据集输入特征提取和摘要之前，还使用字节对编码（BPE）算法开发了一个标记化模型。该模型在包含 7230 个实例的数据集测试分区上分别获得了 0.5、0.24 和 0.71 分。与未进行微调的 t5 模型的结果（分别为 0.1、0.03 和 0.14）相比，结果很好。为进一步改进模型生成的摘要，使用了基于规则的后处理技术。增加后处理后，系统的 Rouge-L、BLEU 和 Meteor 分数分别达到 0.72、0.52 和 0.81。该结果值相对优于未经微调的 T5v1.1 模型所取得的结果，也优于之前关于基于抽象的阿姆哈拉语文本摘要的研究报告所取得的结果，后者的 Rouge-L 分数为 0.27。这为实际应用提供了宝贵的启示，并有助于今后通过增加文章长度、使用更多训练数据、使用基于机器学习的自适应后处理技术以及微调其他可用的文本摘要预训练模型来进一步改进该模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied AI letters

自引率

0.00%

发文量