Warunya Wunnasri, T. Theeramunkong, C. Haruechaiyasak
{"title":"Solving unbalanced data for Thai sentiment analysis","authors":"Warunya Wunnasri, T. Theeramunkong, C. Haruechaiyasak","doi":"10.1109/JCSSE.2013.6567345","DOIUrl":null,"url":null,"abstract":"Growth of microblogging “Twitter” is dramatic among online users in Thailand. Communication on Twitter is very lively and up-to-date since users Users often express their feelings and sentiments in Twitter posts related to current topics or new growing topic. While sentiment analysis on Twitter has challenges in language related issues, such as short-length message and word usage variation, it also faces the problem of unbalanced class problem. In Twitter, people tend to make complaints more than admirations. In this paper, we propose a sampling-based method to solve data unbalanceness in Twitter sentiment analysis in Thai. Three types of sampling methods, called random, largest complete-link sampling, and largest average-link sampling are produced as preprocess before k-NN classifier. From the experimental results, the largest average-linkage sampling achieves the highest performance with the macro average F-measure of 0.57 comparing to the unbalance case.","PeriodicalId":199516,"journal":{"name":"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2013.6567345","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Growth of microblogging “Twitter” is dramatic among online users in Thailand. Communication on Twitter is very lively and up-to-date since users Users often express their feelings and sentiments in Twitter posts related to current topics or new growing topic. While sentiment analysis on Twitter has challenges in language related issues, such as short-length message and word usage variation, it also faces the problem of unbalanced class problem. In Twitter, people tend to make complaints more than admirations. In this paper, we propose a sampling-based method to solve data unbalanceness in Twitter sentiment analysis in Thai. Three types of sampling methods, called random, largest complete-link sampling, and largest average-link sampling are produced as preprocess before k-NN classifier. From the experimental results, the largest average-linkage sampling achieves the highest performance with the macro average F-measure of 0.57 comparing to the unbalance case.