改进的半监督学习技术,用于自动检测Twitter上的南非辱骂性语言

Q3 Social Sciences
O. Oriola, E. Kotzé
{"title":"改进的半监督学习技术,用于自动检测Twitter上的南非辱骂性语言","authors":"O. Oriola, E. Kotzé","doi":"10.18489/sacj.v32i2.847","DOIUrl":null,"url":null,"abstract":"Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.","PeriodicalId":55859,"journal":{"name":"South African Computer Journal","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter\",\"authors\":\"O. Oriola, E. Kotzé\",\"doi\":\"10.18489/sacj.v32i2.847\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.\",\"PeriodicalId\":55859,\"journal\":{\"name\":\"South African Computer Journal\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"South African Computer Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18489/sacj.v32i2.847\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"South African Computer Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18489/sacj.v32i2.847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 3

摘要

半监督学习是一种潜在的解决方案,可以在资源不足的辱骂性语言检测环境中改进训练数据,例如推特上的南非辱骂性语言测试。然而,现有的半监督学习方法已经偏向于具有小特征空间的少量标记数据。因此,本文提出了一种半监督学习技术,该技术通过基于对标记和未标记数据簇的不同特征集的多数投票,为未标记数据分配标签来改善训练数据的分布。该技术被应用于南非英语语料库,该语料库由标记和未标记的辱骂推文组成。将所提出的技术与最先进的基于句法和语义特征的自学习和主动学习技术进行了比较。用Logistic回归、支持向量机和神经网络对这些技术的性能进行了评估。所提出的技术的准确度和F1得分分别为0.97和0.95,优于现有的半监督学习技术。学习曲线表明,与现有技术相比,所提出的技术更有效地使用了训练数据。总体而言,使用逻辑回归分类器的n-gram句法特征记录了最高的性能。文章得出结论,所提出的半监督学习技术有效地检测到了推特上南非人的内隐和外显辱骂语言。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter
Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
South African Computer Journal
South African Computer Journal Social Sciences-Education
CiteScore
1.30
自引率
0.00%
发文量
10
审稿时长
24 weeks
期刊介绍: The South African Computer Journal is specialist ICT academic journal, accredited by the South African Department of Higher Education and Training SACJ publishes research articles, viewpoints and communications in English in Computer Science and Information Systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信