{"title":"泰国Twitter上甲基苯丙胺经销商分类的机器学习方法。","authors":"Punnavich Khowrurk, R. Kongkachandra","doi":"10.1109/iSAI-NLP51646.2020.9376817","DOIUrl":null,"url":null,"abstract":"This research presents a method to classify messages from Twitter (tweet) related to Methamphetamine. The messages are classified into three classes: normal, seller, buyer. The models presented in this research are Multinomial Naive Bayes, Multi-Class LSTM, and Hierarchical LSTM. Model training uses a balanced and imbalanced dataset. The text used for Model training is tokenized from four tokenizers: Tlex+, Lexto+, Attacut, and Deepcut. To study the model performance’s effect, we divide the data with a different dataset and tokenizer. The results showed that all models could classify the messages into the three classes. The most effective model built from a balanced dataset is the Hierarchical LSTM model using the Lexto+ Tokenizer provides the highest Accuracy, and the most effective model build from an imbalanced dataset is the Multi-Class LSTM model using the Lexto+ Tokenizer. This model gave the highest Accuracy, but the Fl-Score of the Hierarchical LSTM model gave better Accuracy in each class.The creation of a text classification model related to Methamphetamine uses Twitter messages. Most of them are Thai grammatical errors and has many slang usage. We found that Lexto+ is the best tokenizer to build a model. However, it is not much different from other tokenizers. On the other hand, the best dataset to build the model is a balanced dataset that significantly affects model performance.","PeriodicalId":311014,"journal":{"name":"2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","volume":"16 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Machine Learning Approach for the Classification of Methamphetamine Dealers on Twitter in Thailand.\",\"authors\":\"Punnavich Khowrurk, R. Kongkachandra\",\"doi\":\"10.1109/iSAI-NLP51646.2020.9376817\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This research presents a method to classify messages from Twitter (tweet) related to Methamphetamine. The messages are classified into three classes: normal, seller, buyer. The models presented in this research are Multinomial Naive Bayes, Multi-Class LSTM, and Hierarchical LSTM. Model training uses a balanced and imbalanced dataset. The text used for Model training is tokenized from four tokenizers: Tlex+, Lexto+, Attacut, and Deepcut. To study the model performance’s effect, we divide the data with a different dataset and tokenizer. The results showed that all models could classify the messages into the three classes. The most effective model built from a balanced dataset is the Hierarchical LSTM model using the Lexto+ Tokenizer provides the highest Accuracy, and the most effective model build from an imbalanced dataset is the Multi-Class LSTM model using the Lexto+ Tokenizer. This model gave the highest Accuracy, but the Fl-Score of the Hierarchical LSTM model gave better Accuracy in each class.The creation of a text classification model related to Methamphetamine uses Twitter messages. Most of them are Thai grammatical errors and has many slang usage. We found that Lexto+ is the best tokenizer to build a model. However, it is not much different from other tokenizers. On the other hand, the best dataset to build the model is a balanced dataset that significantly affects model performance.\",\"PeriodicalId\":311014,\"journal\":{\"name\":\"2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)\",\"volume\":\"16 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iSAI-NLP51646.2020.9376817\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSAI-NLP51646.2020.9376817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Machine Learning Approach for the Classification of Methamphetamine Dealers on Twitter in Thailand.
This research presents a method to classify messages from Twitter (tweet) related to Methamphetamine. The messages are classified into three classes: normal, seller, buyer. The models presented in this research are Multinomial Naive Bayes, Multi-Class LSTM, and Hierarchical LSTM. Model training uses a balanced and imbalanced dataset. The text used for Model training is tokenized from four tokenizers: Tlex+, Lexto+, Attacut, and Deepcut. To study the model performance’s effect, we divide the data with a different dataset and tokenizer. The results showed that all models could classify the messages into the three classes. The most effective model built from a balanced dataset is the Hierarchical LSTM model using the Lexto+ Tokenizer provides the highest Accuracy, and the most effective model build from an imbalanced dataset is the Multi-Class LSTM model using the Lexto+ Tokenizer. This model gave the highest Accuracy, but the Fl-Score of the Hierarchical LSTM model gave better Accuracy in each class.The creation of a text classification model related to Methamphetamine uses Twitter messages. Most of them are Thai grammatical errors and has many slang usage. We found that Lexto+ is the best tokenizer to build a model. However, it is not much different from other tokenizers. On the other hand, the best dataset to build the model is a balanced dataset that significantly affects model performance.