Clickbait Detection for Amharic Language using Deep Learning Techniques

Rajesh Sharma R, Akey Sungheetha, Mesfin Abebe Haile, Arefat Hyeredin Kedir, Rajasekaran A, Charles Babu G
{"title":"Clickbait Detection for Amharic Language using Deep Learning Techniques","authors":"Rajesh Sharma R, Akey Sungheetha, Mesfin Abebe Haile, Arefat Hyeredin Kedir, Rajasekaran A, Charles Babu G","doi":"10.53759/7669/jmc202404058","DOIUrl":null,"url":null,"abstract":"Because of, the increasing number of Ethiopians who actively engaging with the Internet and social media platforms, the incidence of clickbait is becomes a significant concern. Clickbait, often utilizing enticing titles to tempt users into clicking, has become rampant for various reasons, including advertising and revenue generation. However, the Amharic language, spoken by a large population, lacks sufficient NLP resources for addressing this issue. In this study, the authors developed a machine learning model for detecting and classifying clickbait titles in Amharic Language. To facilitate this, authors prepared the first Amharic clickbait dataset. 53,227 social media posts from well-known sites including Facebook, Twitter, and YouTube are included in the dataset. To assess the impact of conventional machine learning methods like Random Forest (RF), Logistic Regression (LR), and Support Vector Machines (SVM) with TF-IDF and N-gram feature extraction approaches, the authors set up a baseline. Subsequently, the authors investigated the efficacy of two word embedding techniques, word2vec and fastText, with Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) deep learning algorithms. At 94.27% accuracy and 94.24% F1 score measure, the CNN model with the rapid Text word embedding performs the best compared to the other models, according to the testing data. The study advances natural language processing on low-resource languages and offers insightful advice on how to counter clickbait content in Amharic.","PeriodicalId":516151,"journal":{"name":"Journal of Machine and Computing","volume":" 10","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Machine and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.53759/7669/jmc202404058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Because of, the increasing number of Ethiopians who actively engaging with the Internet and social media platforms, the incidence of clickbait is becomes a significant concern. Clickbait, often utilizing enticing titles to tempt users into clicking, has become rampant for various reasons, including advertising and revenue generation. However, the Amharic language, spoken by a large population, lacks sufficient NLP resources for addressing this issue. In this study, the authors developed a machine learning model for detecting and classifying clickbait titles in Amharic Language. To facilitate this, authors prepared the first Amharic clickbait dataset. 53,227 social media posts from well-known sites including Facebook, Twitter, and YouTube are included in the dataset. To assess the impact of conventional machine learning methods like Random Forest (RF), Logistic Regression (LR), and Support Vector Machines (SVM) with TF-IDF and N-gram feature extraction approaches, the authors set up a baseline. Subsequently, the authors investigated the efficacy of two word embedding techniques, word2vec and fastText, with Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) deep learning algorithms. At 94.27% accuracy and 94.24% F1 score measure, the CNN model with the rapid Text word embedding performs the best compared to the other models, according to the testing data. The study advances natural language processing on low-resource languages and offers insightful advice on how to counter clickbait content in Amharic.
利用深度学习技术检测阿姆哈拉语的点击诱饵
由于越来越多的埃塞俄比亚人积极使用互联网和社交媒体平台,点击诱饵事件已成为一个重大问题。点击诱饵通常利用诱人的标题来吸引用户点击,由于广告和创收等各种原因,点击诱饵已变得十分猖獗。然而,人口众多的阿姆哈拉语缺乏足够的 NLP 资源来解决这一问题。在本研究中,作者开发了一个机器学习模型,用于检测阿姆哈拉语中的点击诱饵标题并对其进行分类。为此,作者准备了首个阿姆哈拉语点击诱饵数据集。数据集中包含了来自 Facebook、Twitter 和 YouTube 等知名网站的 53,227 篇社交媒体帖子。为了评估随机森林(RF)、逻辑回归(LR)和支持向量机(SVM)等传统机器学习方法与 TF-IDF 和 N-gram 特征提取方法的影响,作者建立了一个基线。随后,作者利用卷积神经网络(CNN)、长短期记忆(LSTM)和门控递归单元(GRU)深度学习算法研究了两种单词嵌入技术 word2vec 和 fastText 的功效。根据测试数据,采用快速文本词嵌入的 CNN 模型的准确率为 94.27%,F1 分数为 94.24%,与其他模型相比表现最佳。这项研究推进了低资源语言的自然语言处理,并就如何应对阿姆哈拉语中的点击诱饵内容提出了有见地的建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信