改进的半监督学习技术，用于自动检测Twitter上的南非辱骂性语言

Q3 Social Sciences

South African Computer Journal Pub Date : 2020-12-08 DOI:10.18489/sacj.v32i2.847

O. Oriola, E. Kotzé

{"title":"改进的半监督学习技术，用于自动检测Twitter上的南非辱骂性语言","authors":"O. Oriola, E. Kotzé","doi":"10.18489/sacj.v32i2.847","DOIUrl":null,"url":null,"abstract":"Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.","PeriodicalId":55859,"journal":{"name":"South African Computer Journal","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter\",\"authors\":\"O. Oriola, E. Kotzé\",\"doi\":\"10.18489/sacj.v32i2.847\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.\",\"PeriodicalId\":55859,\"journal\":{\"name\":\"South African Computer Journal\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"South African Computer Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18489/sacj.v32i2.847\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"South African Computer Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18489/sacj.v32i2.847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 3

摘要

半监督学习是一种潜在的解决方案，可以在资源不足的辱骂性语言检测环境中改进训练数据，例如推特上的南非辱骂性语言测试。然而，现有的半监督学习方法已经偏向于具有小特征空间的少量标记数据。因此，本文提出了一种半监督学习技术，该技术通过基于对标记和未标记数据簇的不同特征集的多数投票，为未标记数据分配标签来改善训练数据的分布。该技术被应用于南非英语语料库，该语料库由标记和未标记的辱骂推文组成。将所提出的技术与最先进的基于句法和语义特征的自学习和主动学习技术进行了比较。用Logistic回归、支持向量机和神经网络对这些技术的性能进行了评估。所提出的技术的准确度和F1得分分别为0.97和0.95，优于现有的半监督学习技术。学习曲线表明，与现有技术相比，所提出的技术更有效地使用了训练数据。总体而言，使用逻辑回归分类器的n-gram句法特征记录了最高的性能。文章得出结论，所提出的半监督学习技术有效地检测到了推特上南非人的内隐和外显辱骂语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

South African Computer Journal Social Sciences-Education

CiteScore

1.30

自引率

0.00%

发文量

审稿时长

24 weeks

期刊介绍： The South African Computer Journal is specialist ICT academic journal, accredited by the South African Department of Higher Education and Training SACJ publishes research articles, viewpoints and communications in English in Computer Science and Information Systems.