基于合成数据生成的有限数据环境下的仇恨语音检测

ACM Journal on Computing and Sustainable Societies Pub Date : 2023-10-12 DOI:10.1145/3625679

Aman Khullar, Daniel Nkemelu, Cuong V. Nguyen, Michael L. Best

{"title":"基于合成数据生成的有限数据环境下的仇恨语音检测","authors":"Aman Khullar, Daniel Nkemelu, Cuong V. Nguyen, Michael L. Best","doi":"10.1145/3625679","DOIUrl":null,"url":null,"abstract":"A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly-resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is majorly caused by a lack of training data which is expensive to collect and curate in these settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts using synthetic data generation techniques. Given a handful of hate speech examples in a high-resource language such as English, we present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We apply our approach to generate training data for hate speech classification tasks in Hindi and Vietnamese. Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain. This method can be adopted to bootstrap hate speech detection models from scratch in limited data contexts. As the growth of social media within these contexts continues to outstrip response efforts, this work furthers our capacities for detection, understanding, and response to hate speech. Disclaimer: This work contains terms that are offensive and hateful. These, however, cannot be avoided due to the nature of the work.","PeriodicalId":486506,"journal":{"name":"ACM Journal on Computing and Sustainable Societies","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation\",\"authors\":\"Aman Khullar, Daniel Nkemelu, Cuong V. Nguyen, Michael L. Best\",\"doi\":\"10.1145/3625679\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly-resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is majorly caused by a lack of training data which is expensive to collect and curate in these settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts using synthetic data generation techniques. Given a handful of hate speech examples in a high-resource language such as English, we present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We apply our approach to generate training data for hate speech classification tasks in Hindi and Vietnamese. Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain. This method can be adopted to bootstrap hate speech detection models from scratch in limited data contexts. As the growth of social media within these contexts continues to outstrip response efforts, this work furthers our capacities for detection, understanding, and response to hate speech. Disclaimer: This work contains terms that are offensive and hateful. These, however, cannot be avoided due to the nature of the work.\",\"PeriodicalId\":486506,\"journal\":{\"name\":\"ACM Journal on Computing and Sustainable Societies\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Journal on Computing and Sustainable Societies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3625679\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Computing and Sustainable Societies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3625679","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

越来越多的工作集中在文本分类方法上，以检测在线发布的越来越多的仇恨言论。这一进展仅限于少数资源丰富的语言，导致检测系统要么表现不佳，要么在有限的数据上下文中不存在。这主要是由于缺乏训练数据造成的，在这些环境中收集和管理训练数据的成本很高。在这项工作中，我们提出了一种数据增强方法，该方法使用合成数据生成技术解决了在有限数据背景下在线仇恨言论检测缺乏数据的问题。针对英语等资源丰富的语言中的少量仇恨言论实例，我们提出了三种方法来合成目标语言中的仇恨言论数据新实例，这些实例保留了原始示例中的仇恨情绪，但转移了仇恨目标。我们应用我们的方法生成印地语和越南语仇恨言论分类任务的训练数据。我们的研究结果表明，在合成数据上训练的模型的性能与仅在目标域中可用样本上训练的模型相当，并且在某些情况下优于模型。该方法可以在有限的数据环境中从零开始引导仇恨语音检测模型。在这些背景下，社交媒体的发展继续超过应对努力，这项工作进一步提高了我们检测、理解和应对仇恨言论的能力。免责声明:本作品包含冒犯和仇恨的条款。然而，由于工作的性质，这些是无法避免的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation

A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly-resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is majorly caused by a lack of training data which is expensive to collect and curate in these settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts using synthetic data generation techniques. Given a handful of hate speech examples in a high-resource language such as English, we present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We apply our approach to generate training data for hate speech classification tasks in Hindi and Vietnamese. Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain. This method can be adopted to bootstrap hate speech detection models from scratch in limited data contexts. As the growth of social media within these contexts continues to outstrip response efforts, this work furthers our capacities for detection, understanding, and response to hate speech. Disclaimer: This work contains terms that are offensive and hateful. These, however, cannot be avoided due to the nature of the work.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Journal on Computing and Sustainable Societies

自引率

0.00%

发文量