Transformer Models for Recognizing Abusive Language An investigation and review on Tweeteval and SOLID dataset

Fabeela Ali Rawther, Geevarghese Titus
{"title":"Transformer Models for Recognizing Abusive Language An investigation and review on Tweeteval and SOLID dataset","authors":"Fabeela Ali Rawther, Geevarghese Titus","doi":"10.1109/ICEEICT56924.2023.10157848","DOIUrl":null,"url":null,"abstract":"Social engineering communities have become very popular among the kids and elderly alike. In this era of social media, the streaming of comments, opinions, reviews and communications is done via most common social media messaging communities like Twitter, Meta owned WhatsApp, FB and Instagram, Snapchat, telegram and YouTube comments. In this paper we perform a review on the different methods and models used to identify the offensive language using different datasets. Offensive language detection is a tedious task as it is country and language specific. The corpus used to identify the offensiveness and abusiveness is not covering all the word usages. We have done a comparison study of different methods on text to detect the post is offensive or not. The detection of abusive language is an unsolved and challenging problem to researchers in Natural Language Processing (NLP). This has led to be one of the reasons for increased level of mental instability among teenagers to elderly. The crime via social media has increased to a large value than older days. The study and surveys show that to recognize the structure and context of the language is the best way to solve this problem to an extent. The paper aims to four recent transformer models pretrained and fine-tuned for offensive language detection on the tweeteval dataset viz; DistilBERT, RoBERTa, DistilRoBERTa and DeBERTa. All the model had limitation in the performance based on the training data size used but are optimized by tuning hyper parameters during training. The models are limited to English language offensive words and recent works are going on in the area of multilingual tweets on both text and speech processing.","PeriodicalId":345324,"journal":{"name":"2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEEICT56924.2023.10157848","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Social engineering communities have become very popular among the kids and elderly alike. In this era of social media, the streaming of comments, opinions, reviews and communications is done via most common social media messaging communities like Twitter, Meta owned WhatsApp, FB and Instagram, Snapchat, telegram and YouTube comments. In this paper we perform a review on the different methods and models used to identify the offensive language using different datasets. Offensive language detection is a tedious task as it is country and language specific. The corpus used to identify the offensiveness and abusiveness is not covering all the word usages. We have done a comparison study of different methods on text to detect the post is offensive or not. The detection of abusive language is an unsolved and challenging problem to researchers in Natural Language Processing (NLP). This has led to be one of the reasons for increased level of mental instability among teenagers to elderly. The crime via social media has increased to a large value than older days. The study and surveys show that to recognize the structure and context of the language is the best way to solve this problem to an extent. The paper aims to four recent transformer models pretrained and fine-tuned for offensive language detection on the tweeteval dataset viz; DistilBERT, RoBERTa, DistilRoBERTa and DeBERTa. All the model had limitation in the performance based on the training data size used but are optimized by tuning hyper parameters during training. The models are limited to English language offensive words and recent works are going on in the area of multilingual tweets on both text and speech processing.
基于Tweeteval和SOLID数据集的辱骂性语言识别转换模型的研究与回顾
社会工程社区在孩子和老人中都很受欢迎。在这个社交媒体时代,评论、观点、评论和交流是通过最常见的社交媒体信息社区完成的,比如Twitter、Meta旗下的WhatsApp、FB和Instagram、Snapchat、telegram和YouTube评论。在本文中,我们对使用不同数据集识别攻击性语言的不同方法和模型进行了回顾。攻击性语言检测是一项繁琐的任务,因为它是特定于国家和语言的。用于识别冒犯性和辱骂性的语料库并没有涵盖所有的词汇用法。我们对不同的文本检测方法进行了对比研究。谩骂语言的检测一直是自然语言处理(NLP)领域的研究热点和难点。这是青少年到老年人精神不稳定程度增加的原因之一。通过社交媒体的犯罪比以前增加了很多。研究和调查表明,在一定程度上认识语言的结构和语境是解决这一问题的最佳途径。本文旨在对四种最新的变形模型进行预训练和微调,用于在twitter数据集上进行攻击性语言检测,即;蒸馏酒,罗伯塔,蒸馏酒罗伯塔和德伯塔。所有模型的性能都受到训练数据大小的限制,但在训练过程中通过调整超参数进行了优化。这些模型仅限于英语中的冒犯性词汇,最近在多语言推文的文本和语音处理领域正在进行研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信