A Machine Learning Approach for the Classification of Methamphetamine Dealers on Twitter in Thailand.

Punnavich Khowrurk, R. Kongkachandra
{"title":"A Machine Learning Approach for the Classification of Methamphetamine Dealers on Twitter in Thailand.","authors":"Punnavich Khowrurk, R. Kongkachandra","doi":"10.1109/iSAI-NLP51646.2020.9376817","DOIUrl":null,"url":null,"abstract":"This research presents a method to classify messages from Twitter (tweet) related to Methamphetamine. The messages are classified into three classes: normal, seller, buyer. The models presented in this research are Multinomial Naive Bayes, Multi-Class LSTM, and Hierarchical LSTM. Model training uses a balanced and imbalanced dataset. The text used for Model training is tokenized from four tokenizers: Tlex+, Lexto+, Attacut, and Deepcut. To study the model performance’s effect, we divide the data with a different dataset and tokenizer. The results showed that all models could classify the messages into the three classes. The most effective model built from a balanced dataset is the Hierarchical LSTM model using the Lexto+ Tokenizer provides the highest Accuracy, and the most effective model build from an imbalanced dataset is the Multi-Class LSTM model using the Lexto+ Tokenizer. This model gave the highest Accuracy, but the Fl-Score of the Hierarchical LSTM model gave better Accuracy in each class.The creation of a text classification model related to Methamphetamine uses Twitter messages. Most of them are Thai grammatical errors and has many slang usage. We found that Lexto+ is the best tokenizer to build a model. However, it is not much different from other tokenizers. On the other hand, the best dataset to build the model is a balanced dataset that significantly affects model performance.","PeriodicalId":311014,"journal":{"name":"2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"16 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSAI-NLP51646.2020.9376817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

This research presents a method to classify messages from Twitter (tweet) related to Methamphetamine. The messages are classified into three classes: normal, seller, buyer. The models presented in this research are Multinomial Naive Bayes, Multi-Class LSTM, and Hierarchical LSTM. Model training uses a balanced and imbalanced dataset. The text used for Model training is tokenized from four tokenizers: Tlex+, Lexto+, Attacut, and Deepcut. To study the model performance’s effect, we divide the data with a different dataset and tokenizer. The results showed that all models could classify the messages into the three classes. The most effective model built from a balanced dataset is the Hierarchical LSTM model using the Lexto+ Tokenizer provides the highest Accuracy, and the most effective model build from an imbalanced dataset is the Multi-Class LSTM model using the Lexto+ Tokenizer. This model gave the highest Accuracy, but the Fl-Score of the Hierarchical LSTM model gave better Accuracy in each class.The creation of a text classification model related to Methamphetamine uses Twitter messages. Most of them are Thai grammatical errors and has many slang usage. We found that Lexto+ is the best tokenizer to build a model. However, it is not much different from other tokenizers. On the other hand, the best dataset to build the model is a balanced dataset that significantly affects model performance.
泰国Twitter上甲基苯丙胺经销商分类的机器学习方法。
本研究提出了一种方法来分类从推特(推文)与甲基苯丙胺相关的消息。这些信息被分为三类:普通、卖方、买方。本研究提出的模型有多项朴素贝叶斯、多类LSTM和分层LSTM。模型训练使用平衡和不平衡数据集。用于模型训练的文本从四个标记器中进行标记:Tlex+, Lexto+, Attacut和Deepcut。为了研究模型性能的影响,我们使用不同的数据集和标记器对数据进行划分。结果表明,所有模型都能将消息划分为三类。从平衡数据集构建的最有效的模型是使用Lexto+ Tokenizer提供最高精度的分层LSTM模型,从不平衡数据集构建的最有效的模型是使用Lexto+ Tokenizer构建的Multi-Class LSTM模型。该模型给出了最高的准确率,但分层LSTM模型的Fl-Score在每个类别中都给出了更好的准确率。创建与甲基苯丙胺相关的文本分类模型使用Twitter消息。其中大多数是泰语语法错误,并有许多俚语用法。我们发现Lexto+是构建模型的最佳标记器。然而,它与其他标记器并没有太大的不同。另一方面,构建模型的最佳数据集是显著影响模型性能的平衡数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信