TOWARDS CURBING CYBER-BULLYING IN MALAYSIA BY AUTHOR IDENTIFICATION OF IBAN AND KADAZANDUSUN OSN TEXT USING DEEP LEARNING

Q4 Earth and Planetary Sciences
Nursyahirah Tarmizi, Suhaila Saee, DayangSuryati Abang Ibrahim
{"title":"TOWARDS CURBING CYBER-BULLYING IN MALAYSIA BY AUTHOR IDENTIFICATION OF IBAN AND KADAZANDUSUN OSN TEXT USING DEEP LEARNING","authors":"Nursyahirah Tarmizi, Suhaila Saee, DayangSuryati Abang Ibrahim","doi":"10.11113/aej.v13.19171","DOIUrl":null,"url":null,"abstract":"Online Social Network (OSN) is frequently used to carry out cyber-criminal actions such as cyberbullying. As a developing country in Asia that keeps abreast of ICT advancement, Malaysia is no exception when it comes to cyberbullying. Author Identification (AI) task plays a vital role in social media forensic investigation (SMF) to unveil the genuine identity of the offender by analysing the text written in OSN by the candidate culprits. Several challenges in AI dealing with OSN text, including limited text length and informal language full of internet jargon and grammatical errors that further impact AI's performance in SMF. The traditional AI system that analyses long text documents seems inadequate to analyse short OSN text's writing style. N-gram features are proven to efficiently represent the authors' writing style for shot text. However, representing N-grams in traditional representation like Tf-IDF resulted in sparse and difficult in grasping the semantic information from text. Besides, most AI works have been done in English but receive less attention in indigenous languages. In West Malaysia, the supreme languages that transcend ethnic boundaries are Iban of Sarawak and KadazanDusun of Sabah, which both are inherently under-resourced. This paper presented a proposed workflow of AI for short OSN text using two Under-Resourced Language (U-RL), Iban and KadazanDusun tweets, to curb the cyberbullying issue in Malaysia. This paper compares Tf-Idf (sparse) and SoA embedding-based (dense) feature representations to observe which representations best represent the stylistic features of the authors’ writing. N-grams of word, character, and POS were extracted as the features. The representation models were learned by different classifiers using machine learning (Naïve Bayes, Random Forest, and SVM). The convolutional neural network (CNN), a SoA deep learning model in sentence classification, was tested against the traditional classifiers. The result was observed by combining different representation models and classifiers on three datasets (English, Iban, and KadazanDusun). The best result was achieved when CNN learned embedding-based models with a combination of all features. KadazanDusun achieved the highest accuracy with 95.76%, English with 95.02%, and Iban with 94%..","PeriodicalId":36749,"journal":{"name":"ASEAN Engineering Journal","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASEAN Engineering Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11113/aej.v13.19171","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Earth and Planetary Sciences","Score":null,"Total":0}
引用次数: 0

Abstract

Online Social Network (OSN) is frequently used to carry out cyber-criminal actions such as cyberbullying. As a developing country in Asia that keeps abreast of ICT advancement, Malaysia is no exception when it comes to cyberbullying. Author Identification (AI) task plays a vital role in social media forensic investigation (SMF) to unveil the genuine identity of the offender by analysing the text written in OSN by the candidate culprits. Several challenges in AI dealing with OSN text, including limited text length and informal language full of internet jargon and grammatical errors that further impact AI's performance in SMF. The traditional AI system that analyses long text documents seems inadequate to analyse short OSN text's writing style. N-gram features are proven to efficiently represent the authors' writing style for shot text. However, representing N-grams in traditional representation like Tf-IDF resulted in sparse and difficult in grasping the semantic information from text. Besides, most AI works have been done in English but receive less attention in indigenous languages. In West Malaysia, the supreme languages that transcend ethnic boundaries are Iban of Sarawak and KadazanDusun of Sabah, which both are inherently under-resourced. This paper presented a proposed workflow of AI for short OSN text using two Under-Resourced Language (U-RL), Iban and KadazanDusun tweets, to curb the cyberbullying issue in Malaysia. This paper compares Tf-Idf (sparse) and SoA embedding-based (dense) feature representations to observe which representations best represent the stylistic features of the authors’ writing. N-grams of word, character, and POS were extracted as the features. The representation models were learned by different classifiers using machine learning (Naïve Bayes, Random Forest, and SVM). The convolutional neural network (CNN), a SoA deep learning model in sentence classification, was tested against the traditional classifiers. The result was observed by combining different representation models and classifiers on three datasets (English, Iban, and KadazanDusun). The best result was achieved when CNN learned embedding-based models with a combination of all features. KadazanDusun achieved the highest accuracy with 95.76%, English with 95.02%, and Iban with 94%..
通过使用深度学习对iban和kadazandusun文本的作者识别,遏制马来西亚的网络欺凌
网络社交网络(Online Social Network, OSN)经常被用来实施网络欺凌等网络犯罪行为。作为与信息通信技术(ICT)同步发展的亚洲发展中国家,马来西亚在网络欺凌方面也不例外。作者识别(AI)任务在社交媒体取证调查(SMF)中发挥着至关重要的作用,通过分析潜在罪犯在OSN上写的文字,揭示罪犯的真实身份。人工智能处理OSN文本的几个挑战,包括有限的文本长度和充满互联网术语和语法错误的非正式语言,这些都进一步影响了人工智能在SMF中的表现。传统的分析长文本文档的人工智能系统似乎不足以分析简短的OSN文本的写作风格。N-gram特征被证明可以有效地表示作者的写作风格。然而,用Tf-IDF等传统的表示方式来表示n -gram会导致稀疏,难以从文本中获取语义信息。此外,大多数人工智能研究都是用英语完成的,但对本土语言的关注较少。在西马来西亚,超越种族界限的最高语言是沙捞越的伊班语和沙巴的卡达山语和都逊语,这两种语言本身都是资源不足的。本文提出了一种使用两种资源不足语言(U-RL), Iban和KadazanDusun推文的人工智能短文本工作流程,以遏制马来西亚的网络欺凌问题。本文比较了Tf-Idf(稀疏)和基于SoA嵌入的(密集)特征表示,以观察哪种表示最能代表作者写作的风格特征。提取词、字、词的N-grams作为特征。不同的分类器使用机器学习(Naïve贝叶斯,随机森林和支持向量机)学习表征模型。将SoA深度学习模型卷积神经网络(CNN)用于句子分类,并与传统分类器进行了对比测试。通过在三个数据集(英语、伊班语和卡达赞语)上结合不同的表示模型和分类器来观察结果。当CNN学习所有特征组合的基于嵌入的模型时,获得了最好的结果。KadazanDusun的准确率最高,为95.76%,英语为95.02%,Iban为94%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ASEAN Engineering Journal
ASEAN Engineering Journal Engineering-Engineering (all)
CiteScore
0.60
自引率
0.00%
发文量
75
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信