CHUNAV: Analyzing Hindi Hate Speech and Targeted Groups in Indian Election Discourse

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-05-16 DOI:10.1145/3665245

F. Jafri, Kritesh Rauniyar, Surendrabikram Thapa, Mohammad Aman Siddiqui, Matloob Khushi, Usman Naseem

{"title":"CHUNAV: Analyzing Hindi Hate Speech and Targeted Groups in Indian Election Discourse","authors":"F. Jafri, Kritesh Rauniyar, Surendrabikram Thapa, Mohammad Aman Siddiqui, Matloob Khushi, Usman Naseem","doi":"10.1145/3665245","DOIUrl":null,"url":null,"abstract":"\n In the ever-evolving landscape of online discourse and political dialogue, the rise of hate speech poses a significant challenge to maintaining a respectful and inclusive digital environment. The context becomes particularly complex when considering the Hindi language—a low-resource language with limited available data. To address this pressing concern, we introduce the\n CHUNAV\n dataset—a collection of 11,457 Hindi tweets gathered during assembly elections in various states.\n CHUNAV\n is purpose-built for hate speech categorization and the identification of target groups. The dataset is a valuable resource for exploring hate speech within the distinctive socio-political context of Indian elections. The tweets within\n CHUNAV\n have been meticulously categorized into “Hate” and “Non-Hate” labels, and further subdivided to pinpoint the specific targets of hate speech, including “Individual”, “Organization”, and “Community” labels (as shown in Figure 1). Furthermore, this paper presents multiple benchmark models for hate speech detection, along with an innovative ensemble and oversampling-based method. The paper also delves into the results of topic modeling, all aimed at effectively addressing hate speech and target identification in the Hindi language. This contribution seeks to advance the field of hate speech analysis and foster a safer and more inclusive online space within the distinctive realm of Indian Assembly Elections.\n","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3665245","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In the ever-evolving landscape of online discourse and political dialogue, the rise of hate speech poses a significant challenge to maintaining a respectful and inclusive digital environment. The context becomes particularly complex when considering the Hindi language—a low-resource language with limited available data. To address this pressing concern, we introduce the CHUNAV dataset—a collection of 11,457 Hindi tweets gathered during assembly elections in various states. CHUNAV is purpose-built for hate speech categorization and the identification of target groups. The dataset is a valuable resource for exploring hate speech within the distinctive socio-political context of Indian elections. The tweets within CHUNAV have been meticulously categorized into “Hate” and “Non-Hate” labels, and further subdivided to pinpoint the specific targets of hate speech, including “Individual”, “Organization”, and “Community” labels (as shown in Figure 1). Furthermore, this paper presents multiple benchmark models for hate speech detection, along with an innovative ensemble and oversampling-based method. The paper also delves into the results of topic modeling, all aimed at effectively addressing hate speech and target identification in the Hindi language. This contribution seeks to advance the field of hate speech analysis and foster a safer and more inclusive online space within the distinctive realm of Indian Assembly Elections.

查看原文本刊更多论文

CHUNAV：分析印度选举言论中的印地语仇恨言论和目标群体

在不断变化的网络言论和政治对话环境中，仇恨言论的兴起对维护尊重和包容的数字环境构成了重大挑战。如果考虑到印地语--一种可用数据有限的低资源语言，情况就会变得尤为复杂。为了解决这一迫切问题，我们引入了 CHUNAV 数据集--一个在各邦议会选举期间收集到的 11,457 条印地语推文的集合。CHUNAV 专门用于仇恨言论分类和目标群体识别。该数据集是在印度选举的独特社会政治背景下探索仇恨言论的宝贵资源。CHUNAV 中的推文被细致地分为 "仇恨 "和 "非仇恨 "两个标签，并进一步细分以确定仇恨言论的具体目标，包括 "个人"、"组织 "和 "社区 "标签（如图 1 所示）。此外，本文还介绍了用于仇恨言论检测的多个基准模型，以及一种创新的基于集合和超采样的方法。本文还深入探讨了主题建模的结果，所有这些都旨在有效解决印地语中的仇恨言论和目标识别问题。本文旨在推动仇恨言论分析领域的发展，并在印度议会选举这一独特的领域内营造一个更安全、更具包容性的网络空间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.