Classifying Crowdsourced Citizen Complaints through Data Mining: Accuracy Testing of k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost

IF 2.8 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Informatics Pub Date : 2023-11-01 DOI:10.3390/informatics10040084

Evaristus D. Madyatmadja, Corinthias P. M. Sianipar, Cristofer Wijaya, David J. M. Sembiring

{"title":"Classifying Crowdsourced Citizen Complaints through Data Mining: Accuracy Testing of k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost","authors":"Evaristus D. Madyatmadja, Corinthias P. M. Sianipar, Cristofer Wijaya, David J. M. Sembiring","doi":"10.3390/informatics10040084","DOIUrl":null,"url":null,"abstract":"Crowdsourcing has gradually become an effective e-government process to gather citizen complaints over the implementation of various public services. In practice, the collected complaints form a massive dataset, making it difficult for government officers to analyze the big data effectively. It is consequently vital to use data mining algorithms to classify the citizen complaint data for efficient follow-up actions. However, different classification algorithms produce varied classification accuracies. Thus, this study aimed to compare the accuracy of several classification algorithms on crowdsourced citizen complaint data. Taking the case of the LAKSA app in Tangerang City, Indonesia, this study included k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost for the accuracy assessment. The data were taken from crowdsourced citizen complaints submitted to the LAKSA app, including those aggregated from official social media channels, from May 2021 to April 2022. The results showed SVM with a linear kernel as the most accurate among the assessed algorithms (89.2%). In contrast, AdaBoost (base learner: Decision Trees) produced the lowest accuracy. Still, the accuracy levels of all algorithms varied in parallel to the amount of training data available for the actual classification categories. Overall, the assessments on all algorithms indicated that their accuracies were insignificantly different, with an overall variation of 4.3%. The AdaBoost-based classification, in particular, showed its large dependence on the choice of base learners. Looking at the method and results, this study contributes to e-government, data mining, and big data discourses. This research recommends that governments continuously conduct supervised training of classification algorithms over their crowdsourced citizen complaints to seek the highest accuracy possible, paving the way for smart and sustainable governance.","PeriodicalId":37100,"journal":{"name":"Informatics","volume":"38 2","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/informatics10040084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Crowdsourcing has gradually become an effective e-government process to gather citizen complaints over the implementation of various public services. In practice, the collected complaints form a massive dataset, making it difficult for government officers to analyze the big data effectively. It is consequently vital to use data mining algorithms to classify the citizen complaint data for efficient follow-up actions. However, different classification algorithms produce varied classification accuracies. Thus, this study aimed to compare the accuracy of several classification algorithms on crowdsourced citizen complaint data. Taking the case of the LAKSA app in Tangerang City, Indonesia, this study included k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost for the accuracy assessment. The data were taken from crowdsourced citizen complaints submitted to the LAKSA app, including those aggregated from official social media channels, from May 2021 to April 2022. The results showed SVM with a linear kernel as the most accurate among the assessed algorithms (89.2%). In contrast, AdaBoost (base learner: Decision Trees) produced the lowest accuracy. Still, the accuracy levels of all algorithms varied in parallel to the amount of training data available for the actual classification categories. Overall, the assessments on all algorithms indicated that their accuracies were insignificantly different, with an overall variation of 4.3%. The AdaBoost-based classification, in particular, showed its large dependence on the choice of base learners. Looking at the method and results, this study contributes to e-government, data mining, and big data discourses. This research recommends that governments continuously conduct supervised training of classification algorithms over their crowdsourced citizen complaints to seek the highest accuracy possible, paving the way for smart and sustainable governance.

查看原文本刊更多论文

通过数据挖掘对众包公民投诉进行分类:k近邻、随机森林、支持向量机和AdaBoost的准确性测试

众包已逐渐成为收集公民对各种公共服务实施的投诉的有效电子政务流程。在实践中，收集到的投诉形成了一个庞大的数据集，这使得政府官员很难有效地分析大数据。因此，使用数据挖掘算法对公民投诉数据进行分类，以便有效地采取后续行动是至关重要的。然而，不同的分类算法产生不同的分类精度。因此，本研究旨在比较几种分类算法对众包公民投诉数据的准确性。本研究以印度尼西亚Tangerang市的LAKSA应用程序为例，采用k近邻、随机森林、支持向量机和AdaBoost进行准确性评估。这些数据来自于2021年5月至2022年4月期间提交给LAKSA应用程序的众包公民投诉，包括从官方社交媒体渠道汇总的投诉。结果表明，线性核支持向量机的准确率最高(89.2%)。相比之下，AdaBoost(基础学习器:决策树)的准确率最低。尽管如此，所有算法的准确度水平与实际分类类别可用的训练数据量并行变化。总体而言，对所有算法的评估表明，它们的准确率差异不显著，总体差异为4.3%。特别是基于adaboost的分类，显示出它对基础学习器的选择有很大的依赖性。从方法和结果来看，本研究有助于电子政务、数据挖掘和大数据话语。本研究建议政府不断对其众包公民投诉的分类算法进行监督培训，以寻求尽可能高的准确性，为智能和可持续治理铺平道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊