{"title":"Investigate the Impact of Resampling Techniques on Imbalanced Datasets: A Case Study in Plant Disease Prediction","authors":"A. Bhatia, A. Chug, A. Singh, Dinesh Singh","doi":"10.1145/3474124.3474164","DOIUrl":null,"url":null,"abstract":"In the current circumstances, plant disease prediction is drawing the attention of various scientists and agricultural experts. The prediction of plant diseases is the foundation of the early identification of diseases in plants efficiently using machine-learning algorithms. However, this area of agriculture science faces the challenge of the imbalanced dataset. Imbalanced datasets can bias the results of machine learning models towards the major class containing the largest number of samples of datasets. This problem can be dealt with the use of resampling techniques that balance the dataset to improve the efficiency of machine learning models. Hence, in the current study, the impact of resampling techniques such as Importance Sampling, Random over Sampling, Synthetic Minority Over-sampling Technique, and Random under Sampling has been evaluated on imbalanced plant disease datasets, i.e., Tomato Powdery Mildew Disease and Soybean Large using various machine-learning classifiers, i.e., Random Forest, Naïve Bayes, Multinomial Logistic Regression and Bagged Classification and Regression Tree. The results of this evaluation show that amongst all the resampling techniques Random Over Sampling has performed the best with 99.24% accuracy for Tomato Powdery Mildew Disease dataset for Random Forest Classifier, whereas Synthetic Minority Over-sampling Technique performed the best with 98.53% accuracy for Soybean Large dataset in case of Bagged Classification and Regression Tree Classifier.","PeriodicalId":144611,"journal":{"name":"2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)","volume":"27 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3474124.3474164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the current circumstances, plant disease prediction is drawing the attention of various scientists and agricultural experts. The prediction of plant diseases is the foundation of the early identification of diseases in plants efficiently using machine-learning algorithms. However, this area of agriculture science faces the challenge of the imbalanced dataset. Imbalanced datasets can bias the results of machine learning models towards the major class containing the largest number of samples of datasets. This problem can be dealt with the use of resampling techniques that balance the dataset to improve the efficiency of machine learning models. Hence, in the current study, the impact of resampling techniques such as Importance Sampling, Random over Sampling, Synthetic Minority Over-sampling Technique, and Random under Sampling has been evaluated on imbalanced plant disease datasets, i.e., Tomato Powdery Mildew Disease and Soybean Large using various machine-learning classifiers, i.e., Random Forest, Naïve Bayes, Multinomial Logistic Regression and Bagged Classification and Regression Tree. The results of this evaluation show that amongst all the resampling techniques Random Over Sampling has performed the best with 99.24% accuracy for Tomato Powdery Mildew Disease dataset for Random Forest Classifier, whereas Synthetic Minority Over-sampling Technique performed the best with 98.53% accuracy for Soybean Large dataset in case of Bagged Classification and Regression Tree Classifier.