Gabriel-Razvan Busuioc, Andrei Paraschiv, M. Dascalu
{"title":"FB-RO-Offense – A Romanian Dataset and Baseline Models for Detecting Offensive Language in Facebook Comments","authors":"Gabriel-Razvan Busuioc, Andrei Paraschiv, M. Dascalu","doi":"10.1109/SYNASC57785.2022.00029","DOIUrl":null,"url":null,"abstract":"In the past decade, social media platforms gained a lot of popularity amongst people all around the globe, some of them seizing this opportunity to proliferate offensive language and hate speech. In addition, platforms that choose not to consider text filtering techniques are being exploited by users who tend to use offensive and abusive language. This paper presents the creation and annotation of a novel Romanian language corpus for offensive language detection, FB-RO-Offense, an offensive speech dataset containing 4,455 organic generated comments from Facebook live broadcasts annotated not only for coarse-grained binary detection tasks but also fine-grained, based on the degree of the offense. We describe the data collection process and the annotation procedure and analyze the content of the corpus. Additionally, we present the results of automatic classification processes using state-of-the-art classification processes and establish a strong baseline for this new dataset including SVM, BERT-based, and CNN architectures, with results that show an F1-score of 0.83 for a four-way classification and an F1-score of 0.90 for the binary classification.","PeriodicalId":446065,"journal":{"name":"2022 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","volume":"1 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 24th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC57785.2022.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the past decade, social media platforms gained a lot of popularity amongst people all around the globe, some of them seizing this opportunity to proliferate offensive language and hate speech. In addition, platforms that choose not to consider text filtering techniques are being exploited by users who tend to use offensive and abusive language. This paper presents the creation and annotation of a novel Romanian language corpus for offensive language detection, FB-RO-Offense, an offensive speech dataset containing 4,455 organic generated comments from Facebook live broadcasts annotated not only for coarse-grained binary detection tasks but also fine-grained, based on the degree of the offense. We describe the data collection process and the annotation procedure and analyze the content of the corpus. Additionally, we present the results of automatic classification processes using state-of-the-art classification processes and establish a strong baseline for this new dataset including SVM, BERT-based, and CNN architectures, with results that show an F1-score of 0.83 for a four-way classification and an F1-score of 0.90 for the binary classification.