RACHNA: Racial hoax code mixed Hindi–English with novel language augmentation

Natural Language Processing Journal Pub Date : 2025-09-16 DOI:10.1016/j.nlp.2025.100183

Shanu SidharthKumar Dhawale , Rahul Ponnusamy , Prasanna Kumar Kumaresan , Sajeetha Thavareesan , Saranya Rajiakodi , Bharathi Raja Chakravarthi

{"title":"RACHNA: Racial hoax code mixed Hindi–English with novel language augmentation","authors":"Shanu SidharthKumar Dhawale , Rahul Ponnusamy , Prasanna Kumar Kumaresan , Sajeetha Thavareesan , Saranya Rajiakodi , Bharathi Raja Chakravarthi","doi":"10.1016/j.nlp.2025.100183","DOIUrl":null,"url":null,"abstract":"<div><div><strong>Warning</strong>: This paper contains derogatory language that may be offensive to some readers. As a type of misinformation, hoaxes seek to propagate incorrect information in order to gain popularity on social media. Racial hoaxes are a particular kind of hoax that is particularly harmful since they falsely link individuals or groups to crimes or incidents. This involves nuanced challenges of identifying false accusations, fabrications, and stereotypes that falsely impact other social, ethnic or out groups in negative actions. On the other hand, social media comments frequently incorporate many languages and are written in scripts that are not native to the user. They also rarely adhere to inflexible grammar norms. Lack of code-mixed racial hoax annotated data for a Low-resource languages like Code-Mixed Hindi and English make this issue more challenging. In order to address this, we collected 210,768 sentences and generated a racial hoax-annotated, code-mixed corpus of 5,105 YouTube comment postings in Hindi–English as HoaxMixPlus corpus. We outline the method of building the corpus and assigning the binary values indicating the presence of racial hoax which fills a critical gap in understanding and combating racialized misinformation along with inter-annotator agreement. We display the results of analysis, training using this corpus as a benchmark, new methodologies which includes dictionary based approach by correctly identifying code-mixed words as well as novel language augmentation strategies like transliteration and language tags. We evaluate several models on this dataset and demonstrate that our augmentation strategies lead to consistent performance gains.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100183"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000597","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Warning: This paper contains derogatory language that may be offensive to some readers. As a type of misinformation, hoaxes seek to propagate incorrect information in order to gain popularity on social media. Racial hoaxes are a particular kind of hoax that is particularly harmful since they falsely link individuals or groups to crimes or incidents. This involves nuanced challenges of identifying false accusations, fabrications, and stereotypes that falsely impact other social, ethnic or out groups in negative actions. On the other hand, social media comments frequently incorporate many languages and are written in scripts that are not native to the user. They also rarely adhere to inflexible grammar norms. Lack of code-mixed racial hoax annotated data for a Low-resource languages like Code-Mixed Hindi and English make this issue more challenging. In order to address this, we collected 210,768 sentences and generated a racial hoax-annotated, code-mixed corpus of 5,105 YouTube comment postings in Hindi–English as HoaxMixPlus corpus. We outline the method of building the corpus and assigning the binary values indicating the presence of racial hoax which fills a critical gap in understanding and combating racialized misinformation along with inter-annotator agreement. We display the results of analysis, training using this corpus as a benchmark, new methodologies which includes dictionary based approach by correctly identifying code-mixed words as well as novel language augmentation strategies like transliteration and language tags. We evaluate several models on this dataset and demonstrate that our augmentation strategies lead to consistent performance gains.

查看原文本刊更多论文

RACHNA：种族恶作剧代码混合了印度语英语和新颖的语言增强

警告：本文包含贬损性语言，可能会冒犯一些读者。作为一种错误信息，恶作剧试图传播不正确的信息，以在社交媒体上获得人气。种族骗局是一种特别有害的恶作剧，因为它们错误地将个人或团体与犯罪或事件联系起来。这涉及识别虚假指控、捏造和刻板印象的微妙挑战，这些错误地影响了其他社会、种族或外部群体的负面行为。另一方面，社交媒体评论经常包含多种语言，并以非用户本地的脚本编写。他们也很少遵守僵化的语法规范。对于像code-mixed印地语和英语这样的低资源语言，缺乏代码混合的种族恶作剧注释数据使这个问题更具挑战性。为了解决这个问题，我们收集了210,768个句子，并生成了一个带有种族恶作剧注释、代码混合的5,105个印地语英语YouTube评论帖子的语料库HoaxMixPlus。我们概述了构建语料库和分配二元值的方法，表明种族骗局的存在，这填补了理解和打击种族化错误信息以及注释者间协议的关键空白。我们展示了分析的结果，以这个语料库为基准的训练，新的方法，包括基于字典的方法，通过正确识别代码混合词，以及新的语言增强策略，如音译和语言标签。我们在这个数据集上评估了几个模型，并证明了我们的增强策略带来了一致的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量