使用Word2vec,部分语言和表情符号功能识别印度尼西亚推特上的仇恨言论和辱骂语言

Muhammad Okky Ibrohim, Muhammad Akbar Setiadi, I. Budi
{"title":"使用Word2vec,部分语言和表情符号功能识别印度尼西亚推特上的仇恨言论和辱骂语言","authors":"Muhammad Okky Ibrohim, Muhammad Akbar Setiadi, I. Budi","doi":"10.1145/3373477.3373495","DOIUrl":null,"url":null,"abstract":"Freedom of speech for the people of Indonesia on social media makes the spread of hate speech and abusive language inevitable. If there is no proper handling, this will lead to social disharmony between individuals and communities. The identification of hate speech and abusive language on Twitter in the Indonesian language is quite challenging. Because of its ability to understand the meaning of a sentence, semantic features such as word embedding can be relied on to understand tweets that contain hateful and abusive words. In this study, word embedding (word2vec) feature and its combinations with part of speech and/or emoji were used to identify hate speech and abusive language on Twitter in the Indonesian language. Furthermore, some combinations of unigram with part of speech and/or emojis were also utilized during the experiment and the results were studied. The classification algorithms used in this study were Support Vector Machine, Random Forest Decision Tree, and Logistic Regression. The combination of unigram features, part of speech and emoji obtained the highest accuracy value of 79.85% with F-Measure of 87.51%.","PeriodicalId":300431,"journal":{"name":"Proceedings of the 1st International Conference on Advanced Information Science and System","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Identification of hate speech and abusive language on indonesian Twitter using the Word2vec, part of speech and emoji features\",\"authors\":\"Muhammad Okky Ibrohim, Muhammad Akbar Setiadi, I. Budi\",\"doi\":\"10.1145/3373477.3373495\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Freedom of speech for the people of Indonesia on social media makes the spread of hate speech and abusive language inevitable. If there is no proper handling, this will lead to social disharmony between individuals and communities. The identification of hate speech and abusive language on Twitter in the Indonesian language is quite challenging. Because of its ability to understand the meaning of a sentence, semantic features such as word embedding can be relied on to understand tweets that contain hateful and abusive words. In this study, word embedding (word2vec) feature and its combinations with part of speech and/or emoji were used to identify hate speech and abusive language on Twitter in the Indonesian language. Furthermore, some combinations of unigram with part of speech and/or emojis were also utilized during the experiment and the results were studied. The classification algorithms used in this study were Support Vector Machine, Random Forest Decision Tree, and Logistic Regression. The combination of unigram features, part of speech and emoji obtained the highest accuracy value of 79.85% with F-Measure of 87.51%.\",\"PeriodicalId\":300431,\"journal\":{\"name\":\"Proceedings of the 1st International Conference on Advanced Information Science and System\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 1st International Conference on Advanced Information Science and System\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3373477.3373495\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International Conference on Advanced Information Science and System","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3373477.3373495","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

摘要

印度尼西亚人民在社交媒体上的言论自由使得仇恨言论和辱骂语言的传播不可避免。如果没有适当的处理,这将导致个人和社区之间的社会不和谐。在推特上识别印尼语的仇恨言论和辱骂语言是相当具有挑战性的。由于它能够理解句子的意思,因此可以依靠单词嵌入等语义特征来理解包含仇恨和辱骂词的推文。在本研究中,单词嵌入(word2vec)特征及其与词性和/或表情符号的组合被用于识别印度尼西亚语Twitter上的仇恨言论和辱骂语言。此外,在实验中还使用了一些uniggram与词性和/或表情符号的组合,并对结果进行了研究。本研究使用的分类算法有支持向量机、随机森林决策树和逻辑回归。单字符特征、词性和表情符号的组合获得了最高的准确率值79.85%,F-Measure为87.51%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Identification of hate speech and abusive language on indonesian Twitter using the Word2vec, part of speech and emoji features
Freedom of speech for the people of Indonesia on social media makes the spread of hate speech and abusive language inevitable. If there is no proper handling, this will lead to social disharmony between individuals and communities. The identification of hate speech and abusive language on Twitter in the Indonesian language is quite challenging. Because of its ability to understand the meaning of a sentence, semantic features such as word embedding can be relied on to understand tweets that contain hateful and abusive words. In this study, word embedding (word2vec) feature and its combinations with part of speech and/or emoji were used to identify hate speech and abusive language on Twitter in the Indonesian language. Furthermore, some combinations of unigram with part of speech and/or emojis were also utilized during the experiment and the results were studied. The classification algorithms used in this study were Support Vector Machine, Random Forest Decision Tree, and Logistic Regression. The combination of unigram features, part of speech and emoji obtained the highest accuracy value of 79.85% with F-Measure of 87.51%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信