An Efficient Smote-based Model for Dyslexia Prediction

International Journal of Information Engineering and Electronic Business Pub Date : 2021-12-08 DOI:10.5815/ijieeb.2021.06.02

Vani Chakraborty, Meenatchi Sundaram

{"title":"An Efficient Smote-based Model for Dyslexia Prediction","authors":"Vani Chakraborty, Meenatchi Sundaram","doi":"10.5815/ijieeb.2021.06.02","DOIUrl":null,"url":null,"abstract":"Dyslexia is a learning disability which causes difficulty in an individual to read, write and spell and do simple mathematical calculations. It affects almost 10% of the global population and detecting it early is paramount for its effective handling. There are many different methods to detect the risk of Dyslexia. Some of these methods are using assessment tools, handwriting recognition, expert psychological help and also using the eye movement data recorded while reading. One of the other convenient and easy ways of detecting risk of dyslexia is to make an individual participate in a simple game related to phonological awareness, syllabic awareness, auditory discrimination, lexical awareness, visual working memory, and many more and recording the observations. The proposed research work presents an effective way of predicing the risk of dyslexia with high accuracy and reliability. It uses a dataset made available from the kaggle repository to predict the risk of dyslexia using various machine learning algorithms. Also it is observed that the dataset has an unequal distribution of positive and negative cases and so the classification accuracy is compromised if used directly. The proposed research work uses three resampling techniques to reduce the imbalance in the dataset. The resampling techniques used are undersampling using near-miss algorithm, oversampling using SMOTE and ADASYN. After applying the undersampling near-miss algorithm, best accuracy was given by SVC classifier with the value of 81.63%. All the other classifiers used in the experiment produced accuracy in the range of 64% to 79.08%. After using the oversampling algorithm SMOTE, the classifiers produced very good results in the evaluation metrics of accuracy,CV score, F1 Score and recall. The maximum accuracy was given by RandomForest with a value of 96.37% and closely followed by XGBBoosting and GradientBoosting with an accuracy of 95.14%. Decision tree, SVC and ADABoost got an accuracy of 91.26%, 93.36% and 93.48% respectively. Even the values of CV score, F1 and recall were considerably high for all these classifiers. After applying the oversampling technique of ADASYN, RandomForest algorithm generated maximum accuracy of 96.25%. Between the two oversampling techniques, SMOTE algorithm performed slightly better in producing better evaluation metrics than ADASYN. The proposed system has very high reliability and so can be effectively used for detecting the risk of dyslexia.","PeriodicalId":427770,"journal":{"name":"International Journal of Information Engineering and Electronic Business","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Engineering and Electronic Business","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5815/ijieeb.2021.06.02","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Dyslexia is a learning disability which causes difficulty in an individual to read, write and spell and do simple mathematical calculations. It affects almost 10% of the global population and detecting it early is paramount for its effective handling. There are many different methods to detect the risk of Dyslexia. Some of these methods are using assessment tools, handwriting recognition, expert psychological help and also using the eye movement data recorded while reading. One of the other convenient and easy ways of detecting risk of dyslexia is to make an individual participate in a simple game related to phonological awareness, syllabic awareness, auditory discrimination, lexical awareness, visual working memory, and many more and recording the observations. The proposed research work presents an effective way of predicing the risk of dyslexia with high accuracy and reliability. It uses a dataset made available from the kaggle repository to predict the risk of dyslexia using various machine learning algorithms. Also it is observed that the dataset has an unequal distribution of positive and negative cases and so the classification accuracy is compromised if used directly. The proposed research work uses three resampling techniques to reduce the imbalance in the dataset. The resampling techniques used are undersampling using near-miss algorithm, oversampling using SMOTE and ADASYN. After applying the undersampling near-miss algorithm, best accuracy was given by SVC classifier with the value of 81.63%. All the other classifiers used in the experiment produced accuracy in the range of 64% to 79.08%. After using the oversampling algorithm SMOTE, the classifiers produced very good results in the evaluation metrics of accuracy,CV score, F1 Score and recall. The maximum accuracy was given by RandomForest with a value of 96.37% and closely followed by XGBBoosting and GradientBoosting with an accuracy of 95.14%. Decision tree, SVC and ADABoost got an accuracy of 91.26%, 93.36% and 93.48% respectively. Even the values of CV score, F1 and recall were considerably high for all these classifiers. After applying the oversampling technique of ADASYN, RandomForest algorithm generated maximum accuracy of 96.25%. Between the two oversampling techniques, SMOTE algorithm performed slightly better in producing better evaluation metrics than ADASYN. The proposed system has very high reliability and so can be effectively used for detecting the risk of dyslexia.

查看原文本刊更多论文

一个有效的基于smote的阅读障碍预测模型

阅读障碍是一种学习障碍，它会导致个人在阅读、写作、拼写和做简单的数学计算方面出现困难。它影响到全球近10%的人口，及早发现对于有效处理至关重要。有许多不同的方法来检测阅读障碍的风险。其中一些方法是使用评估工具、手写识别、专家心理帮助，以及使用阅读时记录的眼球运动数据。另一种检测阅读障碍风险的便捷方法是让个体参与一个与语音意识、音节意识、听觉辨别、词汇意识、视觉工作记忆等相关的简单游戏，并记录观察结果。本研究提供了一种预测阅读障碍风险的有效方法，具有较高的准确性和可靠性。它使用kaggle存储库提供的数据集，使用各种机器学习算法来预测阅读障碍的风险。此外，还观察到数据集具有不均匀的正案例和负案例分布，因此如果直接使用分类精度会受到损害。提出的研究工作使用三种重采样技术来减少数据集的不平衡。使用的重采样技术是使用近射算法的欠采样，使用SMOTE和ADASYN的过采样。应用欠采样近靶算法后，SVC分类器准确率最高，为81.63%。实验中使用的所有其他分类器的准确率在64%到79.08%之间。在使用过采样算法SMOTE后，分类器在准确率、CV分数、F1分数和召回率的评价指标上都取得了很好的效果。RandomForest的准确率最高，为96.37%，XGBBoosting和GradientBoosting紧随其后，准确率为95.14%。决策树、SVC和ADABoost的准确率分别为91.26%、93.36%和93.48%。甚至这些分类器的CV得分、F1和召回值都相当高。采用ADASYN的过采样技术后，RandomForest算法的准确率最高达到96.25%。在两种过采样技术之间，SMOTE算法在产生更好的评估指标方面的表现略好于ADASYN。该系统具有很高的可靠性，因此可以有效地用于检测阅读障碍的风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Information Engineering and Electronic Business

自引率

0.00%

发文量