Suhana Binta Rashid , Bibhas Roy Chowdhury Piyas , Sadia Rahman , Bijoy Roy Chowdhury Preenon
{"title":"警告:用于识别和分类具有宗教攻击性的文本的基准孟加拉语数据集","authors":"Suhana Binta Rashid , Bibhas Roy Chowdhury Piyas , Sadia Rahman , Bijoy Roy Chowdhury Preenon","doi":"10.1016/j.dib.2025.112094","DOIUrl":null,"url":null,"abstract":"<div><div>The widespread proliferation of religiously aggressive contents on social media platforms poses significant threats to societal harmony and communal solidarity. It often incites religious animosity, provokes violence and disseminates life-threatening messages that intensifies societal divisions and undermines social harmony. Despite significant advancements in identifying such contents in high-resource languages like English, there exists a notable scarcity of resources for regional languages like Bengali which constrains the development of effective detection and prevention tools. To address this gap, we introduce ALERT (Analysis of Linguistic Extremism in Religious Texts), a newly developed Bengali dataset along with English translation which includes 4027 annotated instances classified into four categories: hate speech (995), vandalism (909), atrocity (1117), and no aggression (1006). The dataset was sourced from many online platforms, including Facebook, YouTube, online news websites, blogs and group chats. Each of the instances in the dataset was annotated by any two annotators from the list of four having diverse religious, ethnic, geographical, and academic backgrounds. Any conflicts or disagreements between annotators during the annotation process were resolved through consultation with a domain expert. The preprocessing stages include the elimination of English words, duplication and alphanumeric characters to ensure data integrity. The dataset attains a Cohen’s kappa score of 72 % that signifies a strong inter-annotator agreement and a Jaccard similarity score between 16 % and 23 % which reflects the degree of overlap between classes. Moreover, Experiments with various machine learning, deep learning and transformer-based models yield promising results. ALERT serves as a benchmark dataset for religiously aggressive text classification that may contribute to the advancement of research in this field. The dataset is publicly accessible for research purposes to promote innovation and collaboration within the Bengali NLP community.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"63 ","pages":"Article 112094"},"PeriodicalIF":1.4000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ALERT: A benchmark Bengali dataset for identifying and categorizing religiously aggressive texts\",\"authors\":\"Suhana Binta Rashid , Bibhas Roy Chowdhury Piyas , Sadia Rahman , Bijoy Roy Chowdhury Preenon\",\"doi\":\"10.1016/j.dib.2025.112094\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The widespread proliferation of religiously aggressive contents on social media platforms poses significant threats to societal harmony and communal solidarity. It often incites religious animosity, provokes violence and disseminates life-threatening messages that intensifies societal divisions and undermines social harmony. Despite significant advancements in identifying such contents in high-resource languages like English, there exists a notable scarcity of resources for regional languages like Bengali which constrains the development of effective detection and prevention tools. To address this gap, we introduce ALERT (Analysis of Linguistic Extremism in Religious Texts), a newly developed Bengali dataset along with English translation which includes 4027 annotated instances classified into four categories: hate speech (995), vandalism (909), atrocity (1117), and no aggression (1006). The dataset was sourced from many online platforms, including Facebook, YouTube, online news websites, blogs and group chats. Each of the instances in the dataset was annotated by any two annotators from the list of four having diverse religious, ethnic, geographical, and academic backgrounds. Any conflicts or disagreements between annotators during the annotation process were resolved through consultation with a domain expert. The preprocessing stages include the elimination of English words, duplication and alphanumeric characters to ensure data integrity. The dataset attains a Cohen’s kappa score of 72 % that signifies a strong inter-annotator agreement and a Jaccard similarity score between 16 % and 23 % which reflects the degree of overlap between classes. Moreover, Experiments with various machine learning, deep learning and transformer-based models yield promising results. ALERT serves as a benchmark dataset for religiously aggressive text classification that may contribute to the advancement of research in this field. The dataset is publicly accessible for research purposes to promote innovation and collaboration within the Bengali NLP community.</div></div>\",\"PeriodicalId\":10973,\"journal\":{\"name\":\"Data in Brief\",\"volume\":\"63 \",\"pages\":\"Article 112094\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data in Brief\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352340925008169\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925008169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
ALERT: A benchmark Bengali dataset for identifying and categorizing religiously aggressive texts
The widespread proliferation of religiously aggressive contents on social media platforms poses significant threats to societal harmony and communal solidarity. It often incites religious animosity, provokes violence and disseminates life-threatening messages that intensifies societal divisions and undermines social harmony. Despite significant advancements in identifying such contents in high-resource languages like English, there exists a notable scarcity of resources for regional languages like Bengali which constrains the development of effective detection and prevention tools. To address this gap, we introduce ALERT (Analysis of Linguistic Extremism in Religious Texts), a newly developed Bengali dataset along with English translation which includes 4027 annotated instances classified into four categories: hate speech (995), vandalism (909), atrocity (1117), and no aggression (1006). The dataset was sourced from many online platforms, including Facebook, YouTube, online news websites, blogs and group chats. Each of the instances in the dataset was annotated by any two annotators from the list of four having diverse religious, ethnic, geographical, and academic backgrounds. Any conflicts or disagreements between annotators during the annotation process were resolved through consultation with a domain expert. The preprocessing stages include the elimination of English words, duplication and alphanumeric characters to ensure data integrity. The dataset attains a Cohen’s kappa score of 72 % that signifies a strong inter-annotator agreement and a Jaccard similarity score between 16 % and 23 % which reflects the degree of overlap between classes. Moreover, Experiments with various machine learning, deep learning and transformer-based models yield promising results. ALERT serves as a benchmark dataset for religiously aggressive text classification that may contribute to the advancement of research in this field. The dataset is publicly accessible for research purposes to promote innovation and collaboration within the Bengali NLP community.
期刊介绍:
Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.