探索基于性别的仇恨言论检测的数据增强

Journal of Computer Science Pub Date : 2023-10-01 DOI:10.3844/jcssp.2023.1222.1230

Muhammad Amien Ibrahim, Samsul Arifin, Eko Setyo Purwanto

{"title":"探索基于性别的仇恨言论检测的数据增强","authors":"Muhammad Amien Ibrahim, Samsul Arifin, Eko Setyo Purwanto","doi":"10.3844/jcssp.2023.1222.1230","DOIUrl":null,"url":null,"abstract":"Social media moderation is a crucial component to establish healthy online communities and ensuring online safety from hate speech and offensive language. In many cases, hate speech may be targeted at specific gender which could be expressed in many different languages on social media platforms such as Indonesian Twitter. However, difficulties such as data scarcity and the imbalanced gender-based hate speech dataset in Indonesian tweets have slowed the development and implementation of automatic social media moderation. Obtaining more data to increase the number of samples may be costly in terms of resources required to gather and annotate the data. This study looks at the usage of data augmentation methods to increase the amount of textual dataset while keeping the quality of the augmented data. Three augmentation strategies are explored in this study: Random insertion, back translation, and a sequential combination of back translation and random insertion. Additionally, the study examines the preservation of the increased data labels. The performance result demonstrates that classification models trained with augmented data generated from random insertion strategy outperform the other approaches. In terms of label preservation, the three augmentation approaches have been shown to offer enough label preservation without compromising the meaning of the augmented data. The findings imply that by increasing the amount of the dataset while preserving the original label, data augmentation could be utilized to solve issues such as data scarcity and dataset imbalance.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Data Augmentation for Gender-Based Hate Speech Detection\",\"authors\":\"Muhammad Amien Ibrahim, Samsul Arifin, Eko Setyo Purwanto\",\"doi\":\"10.3844/jcssp.2023.1222.1230\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Social media moderation is a crucial component to establish healthy online communities and ensuring online safety from hate speech and offensive language. In many cases, hate speech may be targeted at specific gender which could be expressed in many different languages on social media platforms such as Indonesian Twitter. However, difficulties such as data scarcity and the imbalanced gender-based hate speech dataset in Indonesian tweets have slowed the development and implementation of automatic social media moderation. Obtaining more data to increase the number of samples may be costly in terms of resources required to gather and annotate the data. This study looks at the usage of data augmentation methods to increase the amount of textual dataset while keeping the quality of the augmented data. Three augmentation strategies are explored in this study: Random insertion, back translation, and a sequential combination of back translation and random insertion. Additionally, the study examines the preservation of the increased data labels. The performance result demonstrates that classification models trained with augmented data generated from random insertion strategy outperform the other approaches. In terms of label preservation, the three augmentation approaches have been shown to offer enough label preservation without compromising the meaning of the augmented data. The findings imply that by increasing the amount of the dataset while preserving the original label, data augmentation could be utilized to solve issues such as data scarcity and dataset imbalance.\",\"PeriodicalId\":40005,\"journal\":{\"name\":\"Journal of Computer Science\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3844/jcssp.2023.1222.1230\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3844/jcssp.2023.1222.1230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

社交媒体节制是建立健康的网络社区和确保网络安全免受仇恨言论和攻击性语言侵害的关键组成部分。在许多情况下，仇恨言论可能针对特定性别，可以在社交媒体平台上以多种不同的语言表达，如印度尼西亚的Twitter。然而，数据短缺和印尼推文中基于性别的仇恨言论数据不平衡等困难阻碍了自动社交媒体审核的发展和实施。就收集和注释数据所需的资源而言，获取更多数据以增加样本数量可能代价高昂。本研究着眼于使用数据增强方法来增加文本数据集的数量，同时保持增强数据的质量。本研究探讨了三种增强策略:随机插入、反翻译、反翻译和随机插入的顺序组合。此外，该研究还检查了增加的数据标签的保存。性能结果表明，使用随机插入策略生成的增强数据训练的分类模型优于其他方法。在标签保存方面，这三种增强方法已被证明可以提供足够的标签保存，而不会损害增强数据的含义。研究结果表明，通过在保留原始标签的情况下增加数据集的数量，可以利用数据增强来解决数据稀缺和数据不平衡等问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring Data Augmentation for Gender-Based Hate Speech Detection

Social media moderation is a crucial component to establish healthy online communities and ensuring online safety from hate speech and offensive language. In many cases, hate speech may be targeted at specific gender which could be expressed in many different languages on social media platforms such as Indonesian Twitter. However, difficulties such as data scarcity and the imbalanced gender-based hate speech dataset in Indonesian tweets have slowed the development and implementation of automatic social media moderation. Obtaining more data to increase the number of samples may be costly in terms of resources required to gather and annotate the data. This study looks at the usage of data augmentation methods to increase the amount of textual dataset while keeping the quality of the augmented data. Three augmentation strategies are explored in this study: Random insertion, back translation, and a sequential combination of back translation and random insertion. Additionally, the study examines the preservation of the increased data labels. The performance result demonstrates that classification models trained with augmented data generated from random insertion strategy outperform the other approaches. In terms of label preservation, the three augmentation approaches have been shown to offer enough label preservation without compromising the meaning of the augmented data. The findings imply that by increasing the amount of the dataset while preserving the original label, data augmentation could be utilized to solve issues such as data scarcity and dataset imbalance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Computer Science Computer Science-Computer Networks and Communications

CiteScore

1.70

自引率

0.00%

发文量

期刊介绍： Journal of Computer Science is aimed to publish research articles on theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems. JCS updated twelve times a year and is a peer reviewed journal covers the latest and most compelling research of the time.