{"title":"印尼语用户反馈的监督文本分类与重采样技术比较","authors":"Dhammajoti, J. Young, A. Rusli","doi":"10.1109/ICIC50835.2020.9288588","DOIUrl":null,"url":null,"abstract":"User feedback is one of the most important sources of information for improving the quality of software products. Our current research focuses on a software product that is often used in many universities, the E- Learning system. To reduce the effort of manually reading all submitted user feedback, building an automatic text classification using various machine learning approaches is a popular solution. However, there is often a challenge of imbalanced data that could jeopardize the ability of the machine to find the pattern and classify feedback correctly. Several techniques ranging from random resampling of data to artificially creating more data (e.g. SMOTE) have already been proposed for handling imbalanced data and show promising results in terms of performance. This paper aims to implement several numerical representations and implementing resampling techniques (to handling imbalanced data), which then are followed by evaluating some popular supervised machine learning classification algorithms, which are the Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Decision Tree. Finally, evaluating performance with and without using resampling techniques by macro-average F1 Scores. The results show generally the implementation of oversampling techniques leads to better performance, except in a few cases where under-sampling techniques perform better.","PeriodicalId":413610,"journal":{"name":"2020 Fifth International Conference on Informatics and Computing (ICIC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A Comparison of Supervised Text Classification and Resampling Techniques for User Feedback in Bahasa Indonesia\",\"authors\":\"Dhammajoti, J. Young, A. Rusli\",\"doi\":\"10.1109/ICIC50835.2020.9288588\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"User feedback is one of the most important sources of information for improving the quality of software products. Our current research focuses on a software product that is often used in many universities, the E- Learning system. To reduce the effort of manually reading all submitted user feedback, building an automatic text classification using various machine learning approaches is a popular solution. However, there is often a challenge of imbalanced data that could jeopardize the ability of the machine to find the pattern and classify feedback correctly. Several techniques ranging from random resampling of data to artificially creating more data (e.g. SMOTE) have already been proposed for handling imbalanced data and show promising results in terms of performance. This paper aims to implement several numerical representations and implementing resampling techniques (to handling imbalanced data), which then are followed by evaluating some popular supervised machine learning classification algorithms, which are the Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Decision Tree. Finally, evaluating performance with and without using resampling techniques by macro-average F1 Scores. The results show generally the implementation of oversampling techniques leads to better performance, except in a few cases where under-sampling techniques perform better.\",\"PeriodicalId\":413610,\"journal\":{\"name\":\"2020 Fifth International Conference on Informatics and Computing (ICIC)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Fifth International Conference on Informatics and Computing (ICIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIC50835.2020.9288588\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Fifth International Conference on Informatics and Computing (ICIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIC50835.2020.9288588","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Comparison of Supervised Text Classification and Resampling Techniques for User Feedback in Bahasa Indonesia
User feedback is one of the most important sources of information for improving the quality of software products. Our current research focuses on a software product that is often used in many universities, the E- Learning system. To reduce the effort of manually reading all submitted user feedback, building an automatic text classification using various machine learning approaches is a popular solution. However, there is often a challenge of imbalanced data that could jeopardize the ability of the machine to find the pattern and classify feedback correctly. Several techniques ranging from random resampling of data to artificially creating more data (e.g. SMOTE) have already been proposed for handling imbalanced data and show promising results in terms of performance. This paper aims to implement several numerical representations and implementing resampling techniques (to handling imbalanced data), which then are followed by evaluating some popular supervised machine learning classification algorithms, which are the Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Decision Tree. Finally, evaluating performance with and without using resampling techniques by macro-average F1 Scores. The results show generally the implementation of oversampling techniques leads to better performance, except in a few cases where under-sampling techniques perform better.