TARGE: large language model-powered explainable hate speech detection.

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-05-30 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2911

Muhammad Haseeb Hashir, Memoona, Sung Won Kim

{"title":"TARGE: large language model-powered explainable hate speech detection.","authors":"Muhammad Haseeb Hashir, Memoona, Sung Won Kim","doi":"10.7717/peerj-cs.2911","DOIUrl":null,"url":null,"abstract":"<p><p>The proliferation of user-generated content on social networking sites has intensified the challenge of accurately and efficiently detecting inflammatory and discriminatory speech at scale. Traditional manual moderation methods are impractical due to the sheer volume and complexity of online discourse, necessitating automated solutions. However, existing deep learning models for hate speech detection typically function as black-box systems, providing binary classifications without interpretable insights into their decision-making processes. This opacity significantly limits their practical utility, particularly in nuanced content moderation tasks. To address this challenge, our research explores leveraging the advanced reasoning and knowledge integration capabilities of state-of-the-art language models, specifically Mistral-7B, to develop transparent hate speech detection systems. We introduce a novel framework wherein large language models (LLMs) generate explicit rationales by identifying and analyzing critical textual features indicative of hate speech. These rationales are subsequently integrated into specialized classifiers designed to perform explainable content moderation. We rigorously evaluate our methodology on multiple benchmark English-language social media datasets. Results demonstrate that incorporating LLM-generated explanations significantly enhances both the interpretability and accuracy of hate speech detection. This approach not only identifies problematic content effectively but also clearly articulates the analytical rationale behind each decision, fulfilling the critical demand for transparency in automated content moderation.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2911"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12192871/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2911","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The proliferation of user-generated content on social networking sites has intensified the challenge of accurately and efficiently detecting inflammatory and discriminatory speech at scale. Traditional manual moderation methods are impractical due to the sheer volume and complexity of online discourse, necessitating automated solutions. However, existing deep learning models for hate speech detection typically function as black-box systems, providing binary classifications without interpretable insights into their decision-making processes. This opacity significantly limits their practical utility, particularly in nuanced content moderation tasks. To address this challenge, our research explores leveraging the advanced reasoning and knowledge integration capabilities of state-of-the-art language models, specifically Mistral-7B, to develop transparent hate speech detection systems. We introduce a novel framework wherein large language models (LLMs) generate explicit rationales by identifying and analyzing critical textual features indicative of hate speech. These rationales are subsequently integrated into specialized classifiers designed to perform explainable content moderation. We rigorously evaluate our methodology on multiple benchmark English-language social media datasets. Results demonstrate that incorporating LLM-generated explanations significantly enhances both the interpretability and accuracy of hate speech detection. This approach not only identifies problematic content effectively but also clearly articulates the analytical rationale behind each decision, fulfilling the critical demand for transparency in automated content moderation.

查看原文本刊更多论文

TARGE：大型语言模型驱动的可解释仇恨言论检测。

社交网站上用户生成内容的激增加剧了准确有效地大规模检测煽动性和歧视性言论的挑战。由于在线话语的庞大数量和复杂性，传统的人工审核方法是不切实际的，需要自动化的解决方案。然而，现有的用于仇恨言论检测的深度学习模型通常作为黑箱系统发挥作用，提供二进制分类，而对其决策过程没有可解释的见解。这种不透明性极大地限制了它们的实际用途，特别是在细微差别的内容审核任务中。为了应对这一挑战，我们的研究探索了利用最先进的语言模型（特别是Mistral-7B）的先进推理和知识集成能力，以开发透明的仇恨言论检测系统。我们引入了一个新的框架，其中大型语言模型（llm）通过识别和分析表明仇恨言论的关键文本特征来生成明确的基本原理。这些基本原理随后被集成到专门的分类器中，用于执行可解释的内容审核。我们在多个基准英语社交媒体数据集上严格评估我们的方法。结果表明，结合法学硕士生成的解释显著提高了仇恨言论检测的可解释性和准确性。这种方法不仅有效地识别有问题的内容，而且清楚地阐明了每个决策背后的分析原理，满足了对自动化内容审核透明度的关键需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.