警告：用于识别和分类具有宗教攻击性的文本的基准孟加拉语数据集

IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-09-19 DOI:10.1016/j.dib.2025.112094

Suhana Binta Rashid , Bibhas Roy Chowdhury Piyas , Sadia Rahman , Bijoy Roy Chowdhury Preenon

{"title":"警告：用于识别和分类具有宗教攻击性的文本的基准孟加拉语数据集","authors":"Suhana Binta Rashid , Bibhas Roy Chowdhury Piyas , Sadia Rahman , Bijoy Roy Chowdhury Preenon","doi":"10.1016/j.dib.2025.112094","DOIUrl":null,"url":null,"abstract":"<div><div>The widespread proliferation of religiously aggressive contents on social media platforms poses significant threats to societal harmony and communal solidarity. It often incites religious animosity, provokes violence and disseminates life-threatening messages that intensifies societal divisions and undermines social harmony. Despite significant advancements in identifying such contents in high-resource languages like English, there exists a notable scarcity of resources for regional languages like Bengali which constrains the development of effective detection and prevention tools. To address this gap, we introduce ALERT (Analysis of Linguistic Extremism in Religious Texts), a newly developed Bengali dataset along with English translation which includes 4027 annotated instances classified into four categories: hate speech (995), vandalism (909), atrocity (1117), and no aggression (1006). The dataset was sourced from many online platforms, including Facebook, YouTube, online news websites, blogs and group chats. Each of the instances in the dataset was annotated by any two annotators from the list of four having diverse religious, ethnic, geographical, and academic backgrounds. Any conflicts or disagreements between annotators during the annotation process were resolved through consultation with a domain expert. The preprocessing stages include the elimination of English words, duplication and alphanumeric characters to ensure data integrity. The dataset attains a Cohen’s kappa score of 72 % that signifies a strong inter-annotator agreement and a Jaccard similarity score between 16 % and 23 % which reflects the degree of overlap between classes. Moreover, Experiments with various machine learning, deep learning and transformer-based models yield promising results. ALERT serves as a benchmark dataset for religiously aggressive text classification that may contribute to the advancement of research in this field. The dataset is publicly accessible for research purposes to promote innovation and collaboration within the Bengali NLP community.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"63 ","pages":"Article 112094"},"PeriodicalIF":1.4000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ALERT: A benchmark Bengali dataset for identifying and categorizing religiously aggressive texts\",\"authors\":\"Suhana Binta Rashid , Bibhas Roy Chowdhury Piyas , Sadia Rahman , Bijoy Roy Chowdhury Preenon\",\"doi\":\"10.1016/j.dib.2025.112094\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The widespread proliferation of religiously aggressive contents on social media platforms poses significant threats to societal harmony and communal solidarity. It often incites religious animosity, provokes violence and disseminates life-threatening messages that intensifies societal divisions and undermines social harmony. Despite significant advancements in identifying such contents in high-resource languages like English, there exists a notable scarcity of resources for regional languages like Bengali which constrains the development of effective detection and prevention tools. To address this gap, we introduce ALERT (Analysis of Linguistic Extremism in Religious Texts), a newly developed Bengali dataset along with English translation which includes 4027 annotated instances classified into four categories: hate speech (995), vandalism (909), atrocity (1117), and no aggression (1006). The dataset was sourced from many online platforms, including Facebook, YouTube, online news websites, blogs and group chats. Each of the instances in the dataset was annotated by any two annotators from the list of four having diverse religious, ethnic, geographical, and academic backgrounds. Any conflicts or disagreements between annotators during the annotation process were resolved through consultation with a domain expert. The preprocessing stages include the elimination of English words, duplication and alphanumeric characters to ensure data integrity. The dataset attains a Cohen’s kappa score of 72 % that signifies a strong inter-annotator agreement and a Jaccard similarity score between 16 % and 23 % which reflects the degree of overlap between classes. Moreover, Experiments with various machine learning, deep learning and transformer-based models yield promising results. ALERT serves as a benchmark dataset for religiously aggressive text classification that may contribute to the advancement of research in this field. The dataset is publicly accessible for research purposes to promote innovation and collaboration within the Bengali NLP community.</div></div>\",\"PeriodicalId\":10973,\"journal\":{\"name\":\"Data in Brief\",\"volume\":\"63 \",\"pages\":\"Article 112094\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data in Brief\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352340925008169\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925008169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

社交媒体平台上宗教激进内容的广泛扩散对社会和谐和社区团结构成了重大威胁。它经常煽动宗教仇恨，挑起暴力，传播危及生命的信息，加剧社会分裂，破坏社会和谐。尽管英语等资源丰富的语言在识别此类内容方面取得了重大进展，但孟加拉语等区域语言的资源明显缺乏，这限制了有效检测和预防工具的发展。为了解决这一差距，我们引入了ALERT（宗教文本中的语言极端主义分析），这是一个新开发的孟加拉语数据集以及英语翻译，其中包括4027个注释实例，分为四类：仇恨言论（995），故意破坏（909），暴行（1117）和无侵略（1006）。该数据集来自许多在线平台，包括Facebook、YouTube、在线新闻网站、博客和群聊。数据集中的每个实例都由四名具有不同宗教、种族、地理和学术背景的注释者中的任意两名注释者进行注释。在注释过程中，注释者之间的任何冲突或分歧都通过与领域专家协商来解决。预处理阶段包括消除英文单词、重复和字母数字字符，以确保数据的完整性。该数据集达到了72%的科恩kappa分数，这表明注释者之间有很强的一致性，而Jaccard相似度得分在16%到23%之间，反映了类之间的重叠程度。此外，各种机器学习，深度学习和基于变压器的模型的实验产生了有希望的结果。ALERT作为宗教攻击性文本分类的基准数据集，可能有助于该领域研究的进步。该数据集可公开访问，用于研究目的，以促进孟加拉语NLP社区的创新和合作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ALERT: A benchmark Bengali dataset for identifying and categorizing religiously aggressive texts

The widespread proliferation of religiously aggressive contents on social media platforms poses significant threats to societal harmony and communal solidarity. It often incites religious animosity, provokes violence and disseminates life-threatening messages that intensifies societal divisions and undermines social harmony. Despite significant advancements in identifying such contents in high-resource languages like English, there exists a notable scarcity of resources for regional languages like Bengali which constrains the development of effective detection and prevention tools. To address this gap, we introduce ALERT (Analysis of Linguistic Extremism in Religious Texts), a newly developed Bengali dataset along with English translation which includes 4027 annotated instances classified into four categories: hate speech (995), vandalism (909), atrocity (1117), and no aggression (1006). The dataset was sourced from many online platforms, including Facebook, YouTube, online news websites, blogs and group chats. Each of the instances in the dataset was annotated by any two annotators from the list of four having diverse religious, ethnic, geographical, and academic backgrounds. Any conflicts or disagreements between annotators during the annotation process were resolved through consultation with a domain expert. The preprocessing stages include the elimination of English words, duplication and alphanumeric characters to ensure data integrity. The dataset attains a Cohen’s kappa score of 72 % that signifies a strong inter-annotator agreement and a Jaccard similarity score between 16 % and 23 % which reflects the degree of overlap between classes. Moreover, Experiments with various machine learning, deep learning and transformer-based models yield promising results. ALERT serves as a benchmark dataset for religiously aggressive text classification that may contribute to the advancement of research in this field. The dataset is publicly accessible for research purposes to promote innovation and collaboration within the Bengali NLP community.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.