基于层次多标签分类的Twitter文本危险语音识别特征提取

D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo
{"title":"基于层次多标签分类的Twitter文本危险语音识别特征提取","authors":"D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo","doi":"10.1109/ICCoSITE57641.2023.10127774","DOIUrl":null,"url":null,"abstract":"Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.","PeriodicalId":256184,"journal":{"name":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature Extraction in Hierarchical Multi-Label Classification for Dangerous Speech Identification on Twitter Texts\",\"authors\":\"D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo\",\"doi\":\"10.1109/ICCoSITE57641.2023.10127774\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.\",\"PeriodicalId\":256184,\"journal\":{\"name\":\"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCoSITE57641.2023.10127774\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCoSITE57641.2023.10127774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

危险言论是一种强烈的仇恨言论,会造成暴力、犯罪、社会压力、创伤和绝望等负面影响,并可能导致群体之间的冲突。Twitter文本的原始数据需要进行必要的预处理,以获得针对仇恨言论甚至危险言论的适当训练数据集。其中一个原因是如何表达与提及或标签相关的仇恨言论。由于原始Twitter帖子中的上下文信息的变体可能是仇恨言论,也可能不是,这里的问题是分层和多标签分类,有三种标签类型的仇恨言论状态,问题和危险级别。这部作品中的问题是关于宗教、种族和其他的。经过预处理后,由于数据集不平衡,词嵌入中包含了欠采样数据。额外的停顿词字典,以克服语言相关的词汇也纳入。为了观察预处理在分类问题中的效果,探索了一些机器学习和深度学习的方法,如SVM、Bi-LSTM和BERT。然后,我们在超参数设置后,除了微观和宏观平均的F1分数之外,还使用子集精度和汉明损失的不平衡性能指标进行了检验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Feature Extraction in Hierarchical Multi-Label Classification for Dangerous Speech Identification on Twitter Texts
Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信