Homophobia and transphobia span identification in low-resource languages

Prasanna Kumar Kumaresan , Devendra Deepak Kayande , Ruba Priyadharshini , Paul Buitelaar , Bharathi Raja Chakravarthi
{"title":"Homophobia and transphobia span identification in low-resource languages","authors":"Prasanna Kumar Kumaresan ,&nbsp;Devendra Deepak Kayande ,&nbsp;Ruba Priyadharshini ,&nbsp;Paul Buitelaar ,&nbsp;Bharathi Raja Chakravarthi","doi":"10.1016/j.nlp.2025.100169","DOIUrl":null,"url":null,"abstract":"<div><div>Online platforms have become prevalent because they promote free speech and group discussions. However, they also serve as platforms for hate speech, which can negatively impact the psychological well-being of vulnerable people. This is especially true for members of the LGBTQ+ community, who are often the targets of homophobia and transphobia in online environments. Our study makes three main contributions: (1) we developed a new dataset with span-level annotations for homophobia and transphobia in Tamil, English, and Marathi; (2) we employed advanced language models using BERT-based architectures, Conditional Random Field (CRF), and Bidirectional Long Short-Term Memory (BiLSTM) layers to enhance span-level detection of harmful content; and (3) we conducted benchmarking to evaluate the effectiveness of monolingual and multilingual models in detecting subtle forms of hate speech. The annotated dataset, which is collected from real-world social media (YouTube) content, provides diverse language contexts and enhances the representation of low-resource languages. The span-based detection approach enables models to detect subtle linguistic nuances, leading to more precise content moderation that accounts for cultural differences. The experimental results show that our models achieve effective span detection, which provides valuable information for creating inclusive moderation tools. Our research leads to the development of AI systems, and we aim to reduce the burden on moderators and improve the quality of online experiences for LGBTQ+ vulnerable.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100169"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000457","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Online platforms have become prevalent because they promote free speech and group discussions. However, they also serve as platforms for hate speech, which can negatively impact the psychological well-being of vulnerable people. This is especially true for members of the LGBTQ+ community, who are often the targets of homophobia and transphobia in online environments. Our study makes three main contributions: (1) we developed a new dataset with span-level annotations for homophobia and transphobia in Tamil, English, and Marathi; (2) we employed advanced language models using BERT-based architectures, Conditional Random Field (CRF), and Bidirectional Long Short-Term Memory (BiLSTM) layers to enhance span-level detection of harmful content; and (3) we conducted benchmarking to evaluate the effectiveness of monolingual and multilingual models in detecting subtle forms of hate speech. The annotated dataset, which is collected from real-world social media (YouTube) content, provides diverse language contexts and enhances the representation of low-resource languages. The span-based detection approach enables models to detect subtle linguistic nuances, leading to more precise content moderation that accounts for cultural differences. The experimental results show that our models achieve effective span detection, which provides valuable information for creating inclusive moderation tools. Our research leads to the development of AI systems, and we aim to reduce the burden on moderators and improve the quality of online experiences for LGBTQ+ vulnerable.
同性恋恐惧症和跨性别恐惧症跨越了低资源语言的识别
网络平台之所以流行,是因为它们促进了言论自由和小组讨论。然而,它们也成为仇恨言论的平台,这可能对弱势群体的心理健康产生负面影响。对于LGBTQ+社区的成员来说尤其如此,他们经常是网络环境中同性恋恐惧症和变性恐惧症的目标。我们的研究有三个主要贡献:(1)我们开发了一个新的数据集,其中包含泰米尔语、英语和马拉地语的同性恋恐惧症和变性恐惧症的跨越级别注释;(2)采用基于bert架构的高级语言模型、条件随机场(CRF)层和双向长短期记忆(BiLSTM)层来增强跨层有害内容检测;(3)我们进行了基准测试,以评估单语言和多语言模型在检测微妙形式的仇恨言论方面的有效性。带注释的数据集从现实世界的社交媒体(YouTube)内容中收集,提供了不同的语言上下文,并增强了低资源语言的表示。基于跨度的检测方法使模型能够检测细微的语言差异,从而导致更精确的内容审核,从而考虑到文化差异。实验结果表明,我们的模型实现了有效的跨度检测,为创建包容性调节工具提供了有价值的信息。我们的研究引导了人工智能系统的发展,我们的目标是减轻版主的负担,提高LGBTQ+弱势群体的在线体验质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信