{"title":"RACHNA: Racial hoax code mixed Hindi–English with novel language augmentation","authors":"Shanu SidharthKumar Dhawale , Rahul Ponnusamy , Prasanna Kumar Kumaresan , Sajeetha Thavareesan , Saranya Rajiakodi , Bharathi Raja Chakravarthi","doi":"10.1016/j.nlp.2025.100183","DOIUrl":null,"url":null,"abstract":"<div><div><strong>Warning</strong>: This paper contains derogatory language that may be offensive to some readers. As a type of misinformation, hoaxes seek to propagate incorrect information in order to gain popularity on social media. Racial hoaxes are a particular kind of hoax that is particularly harmful since they falsely link individuals or groups to crimes or incidents. This involves nuanced challenges of identifying false accusations, fabrications, and stereotypes that falsely impact other social, ethnic or out groups in negative actions. On the other hand, social media comments frequently incorporate many languages and are written in scripts that are not native to the user. They also rarely adhere to inflexible grammar norms. Lack of code-mixed racial hoax annotated data for a Low-resource languages like Code-Mixed Hindi and English make this issue more challenging. In order to address this, we collected 210,768 sentences and generated a racial hoax-annotated, code-mixed corpus of 5,105 YouTube comment postings in Hindi–English as HoaxMixPlus corpus. We outline the method of building the corpus and assigning the binary values indicating the presence of racial hoax which fills a critical gap in understanding and combating racialized misinformation along with inter-annotator agreement. We display the results of analysis, training using this corpus as a benchmark, new methodologies which includes dictionary based approach by correctly identifying code-mixed words as well as novel language augmentation strategies like transliteration and language tags. We evaluate several models on this dataset and demonstrate that our augmentation strategies lead to consistent performance gains.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"13 ","pages":"Article 100183"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000597","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Warning: This paper contains derogatory language that may be offensive to some readers. As a type of misinformation, hoaxes seek to propagate incorrect information in order to gain popularity on social media. Racial hoaxes are a particular kind of hoax that is particularly harmful since they falsely link individuals or groups to crimes or incidents. This involves nuanced challenges of identifying false accusations, fabrications, and stereotypes that falsely impact other social, ethnic or out groups in negative actions. On the other hand, social media comments frequently incorporate many languages and are written in scripts that are not native to the user. They also rarely adhere to inflexible grammar norms. Lack of code-mixed racial hoax annotated data for a Low-resource languages like Code-Mixed Hindi and English make this issue more challenging. In order to address this, we collected 210,768 sentences and generated a racial hoax-annotated, code-mixed corpus of 5,105 YouTube comment postings in Hindi–English as HoaxMixPlus corpus. We outline the method of building the corpus and assigning the binary values indicating the presence of racial hoax which fills a critical gap in understanding and combating racialized misinformation along with inter-annotator agreement. We display the results of analysis, training using this corpus as a benchmark, new methodologies which includes dictionary based approach by correctly identifying code-mixed words as well as novel language augmentation strategies like transliteration and language tags. We evaluate several models on this dataset and demonstrate that our augmentation strategies lead to consistent performance gains.