{"title":"KOMPARASI METODE KOMBINASI SELEKSI FITUR DAN MACHINE LEARNING K-NEAREST NEIGHBOR PADA DATASET LABEL HOURS SOFTWARE EFFORT ESTIMATION","authors":"I. Kurniawan, Ahmad Faiq Abror","doi":"10.36448/jsit.v10i2.1314","DOIUrl":null,"url":null,"abstract":"The methods for Software Effort Estimation are divided into two, these methods are grouped into Non Machine Learning (non-ML) and Machine Learning (ML) methods [1]. The k-NN method has the disadvantage of being unable to tolerate irrelevant features and greatly affect the accuracy of k-NN. The k-NN method is also difficult to deal with missing data problems and feature categorization problems such as features that are not relevant, weight features that are not optimal, and the same features [2]. Whereas the dataset of Software Effort Estimation still has some serious challenges such as the characteristics of the data set, which are irrelevant features and the level of influence of each feature in the estimated data of the software effort [3]. This study compared the k-NN individual method with the combination of feature selection method with k-NN to find out which method was the best. The results showed that the Forward Selection (FS) method and Median Weighted Information Gain with k-Nearest Neighbor can overcome the problem of irrelevant features so as to increase accuracy in the RMSE Software Effort Estimation dataset, which is smaller in the Albrecht dataset of 5,953 using the Median method -WIG k-NN, the Miyazaki dataset is 55,421 and Kemerer is 123,081 using the FS k-NN method. The combination of kNN with Feature Selection is proven to be able to improve the estimation results better than kNN individuals. With the FS k-NN method being the best by winning in 2 datasets Miyazaki and Kemerer.","PeriodicalId":174230,"journal":{"name":"Explore: Jurnal Sistem Informasi dan Telematika","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Explore: Jurnal Sistem Informasi dan Telematika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36448/jsit.v10i2.1314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The methods for Software Effort Estimation are divided into two, these methods are grouped into Non Machine Learning (non-ML) and Machine Learning (ML) methods [1]. The k-NN method has the disadvantage of being unable to tolerate irrelevant features and greatly affect the accuracy of k-NN. The k-NN method is also difficult to deal with missing data problems and feature categorization problems such as features that are not relevant, weight features that are not optimal, and the same features [2]. Whereas the dataset of Software Effort Estimation still has some serious challenges such as the characteristics of the data set, which are irrelevant features and the level of influence of each feature in the estimated data of the software effort [3]. This study compared the k-NN individual method with the combination of feature selection method with k-NN to find out which method was the best. The results showed that the Forward Selection (FS) method and Median Weighted Information Gain with k-Nearest Neighbor can overcome the problem of irrelevant features so as to increase accuracy in the RMSE Software Effort Estimation dataset, which is smaller in the Albrecht dataset of 5,953 using the Median method -WIG k-NN, the Miyazaki dataset is 55,421 and Kemerer is 123,081 using the FS k-NN method. The combination of kNN with Feature Selection is proven to be able to improve the estimation results better than kNN individuals. With the FS k-NN method being the best by winning in 2 datasets Miyazaki and Kemerer.
软件工作量估计的方法分为两种,这些方法分为非机器学习(Non - Machine Learning, Non -ML)和机器学习(Machine Learning, ML)方法[1]。k-NN方法的缺点是不能容忍不相关的特征,极大地影响了k-NN的精度。k-NN方法也难以处理缺失数据问题和特征分类问题,如特征不相关、权重特征非最优、特征相同等问题[2]。然而,软件工作量估计的数据集仍然存在一些严重的挑战,例如数据集的特征,这些特征是不相关的特征,以及每个特征在软件工作量估计数据中的影响程度[3]。本研究将k-NN个体方法与特征选择方法与k-NN相结合的方法进行比较,找出哪一种方法是最好的。结果表明,前向选择(FS)方法和带k近邻的中值加权信息增益(Median Weighted Information Gain with k-Nearest Neighbor)方法可以克服不相关特征的问题,从而提高RMSE Software Effort Estimation数据集的精度,其中使用中值方法-WIG k-NN的Albrecht数据集为5,953,使用FS k-NN方法的Miyazaki数据集为55,421,Kemerer数据集为123,081。将kNN与特征选择相结合可以比单个kNN更好地改善估计结果。其中FS k-NN方法在Miyazaki和Kemerer的2个数据集中获胜,是最好的。