Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text

IF 1.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Shankar Biradar, Sunil Saumya, Arun Chauhan
{"title":"Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text","authors":"Shankar Biradar, Sunil Saumya, Arun Chauhan","doi":"10.1007/s10579-024-09732-0","DOIUrl":null,"url":null,"abstract":"<p>Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated <span>\\(\\chi ^2\\)</span> value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"38 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09732-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated \(\chi ^2\) value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.

Abstract Image

虚假仇恨:揭开传播仇恨故事的虚假叙事之网:跨语言印地语-英语代码混合文本中的多标签和多类别数据集
不可否认,社交媒体改变了人们的交流方式,但同时也带来了毋庸置疑的弊端,尤其是虚假和仇恨言论的泛滥。最近的观察表明,这两个问题经常并存,关于仇恨话题的讨论经常被虚假言论所主导。因此,探讨虚假叙事在当代仇恨传播中的作用已成为当务之急。在这一方向上,本文提出了一个名为 "虚假仇恨多标签数据集"(FHMLD)的新数据集,该数据集由 8014 条印地语-英语混合编码文本中的虚假仇恨评论组成。据我们所知,这是首次将虚假内容和仇恨内容整合到一个统一的框架中。此外,所提议的数据集收集自 YouTube 和 Twitter 等不同平台,以减少用户相关偏见。为了研究虚假叙事的存在及其对仇恨强度的影响之间的关系,本研究使用卡方检验法(Chi-square test)进行了统计分析。统计结果表明,计算出的(\chi ^2\)值大于标准表中的值,从而拒绝了零假设。此外,本研究还提出了利用单词和句子层面的句法和语义特征对多类别和多标签数据集进行分类的基线方法。实验结果表明,基于 fastText 和 SVM 的方法在二元假憎预测和严重性预测方面的准确率分别为 71% 和 58%,优于其他模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Language Resources and Evaluation
Language Resources and Evaluation 工程技术-计算机:跨学科应用
CiteScore
6.50
自引率
3.70%
发文量
55
审稿时长
>12 weeks
期刊介绍: Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信