An Efficient Smote-based Model for Dyslexia Prediction

Vani Chakraborty, Meenatchi Sundaram
{"title":"An Efficient Smote-based Model for Dyslexia Prediction","authors":"Vani Chakraborty, Meenatchi Sundaram","doi":"10.5815/ijieeb.2021.06.02","DOIUrl":null,"url":null,"abstract":"Dyslexia is a learning disability which causes difficulty in an individual to read, write and spell and do simple mathematical calculations. It affects almost 10% of the global population and detecting it early is paramount for its effective handling. There are many different methods to detect the risk of Dyslexia. Some of these methods are using assessment tools, handwriting recognition, expert psychological help and also using the eye movement data recorded while reading. One of the other convenient and easy ways of detecting risk of dyslexia is to make an individual participate in a simple game related to phonological awareness, syllabic awareness, auditory discrimination, lexical awareness, visual working memory, and many more and recording the observations. The proposed research work presents an effective way of predicing the risk of dyslexia with high accuracy and reliability. It uses a dataset made available from the kaggle repository to predict the risk of dyslexia using various machine learning algorithms. Also it is observed that the dataset has an unequal distribution of positive and negative cases and so the classification accuracy is compromised if used directly. The proposed research work uses three resampling techniques to reduce the imbalance in the dataset. The resampling techniques used are undersampling using near-miss algorithm, oversampling using SMOTE and ADASYN. After applying the undersampling near-miss algorithm, best accuracy was given by SVC classifier with the value of 81.63%. All the other classifiers used in the experiment produced accuracy in the range of 64% to 79.08%. After using the oversampling algorithm SMOTE, the classifiers produced very good results in the evaluation metrics of accuracy,CV score, F1 Score and recall. The maximum accuracy was given by RandomForest with a value of 96.37% and closely followed by XGBBoosting and GradientBoosting with an accuracy of 95.14%. Decision tree, SVC and ADABoost got an accuracy of 91.26%, 93.36% and 93.48% respectively. Even the values of CV score, F1 and recall were considerably high for all these classifiers. After applying the oversampling technique of ADASYN, RandomForest algorithm generated maximum accuracy of 96.25%. Between the two oversampling techniques, SMOTE algorithm performed slightly better in producing better evaluation metrics than ADASYN. The proposed system has very high reliability and so can be effectively used for detecting the risk of dyslexia.","PeriodicalId":427770,"journal":{"name":"International Journal of Information Engineering and Electronic Business","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Engineering and Electronic Business","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5815/ijieeb.2021.06.02","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Dyslexia is a learning disability which causes difficulty in an individual to read, write and spell and do simple mathematical calculations. It affects almost 10% of the global population and detecting it early is paramount for its effective handling. There are many different methods to detect the risk of Dyslexia. Some of these methods are using assessment tools, handwriting recognition, expert psychological help and also using the eye movement data recorded while reading. One of the other convenient and easy ways of detecting risk of dyslexia is to make an individual participate in a simple game related to phonological awareness, syllabic awareness, auditory discrimination, lexical awareness, visual working memory, and many more and recording the observations. The proposed research work presents an effective way of predicing the risk of dyslexia with high accuracy and reliability. It uses a dataset made available from the kaggle repository to predict the risk of dyslexia using various machine learning algorithms. Also it is observed that the dataset has an unequal distribution of positive and negative cases and so the classification accuracy is compromised if used directly. The proposed research work uses three resampling techniques to reduce the imbalance in the dataset. The resampling techniques used are undersampling using near-miss algorithm, oversampling using SMOTE and ADASYN. After applying the undersampling near-miss algorithm, best accuracy was given by SVC classifier with the value of 81.63%. All the other classifiers used in the experiment produced accuracy in the range of 64% to 79.08%. After using the oversampling algorithm SMOTE, the classifiers produced very good results in the evaluation metrics of accuracy,CV score, F1 Score and recall. The maximum accuracy was given by RandomForest with a value of 96.37% and closely followed by XGBBoosting and GradientBoosting with an accuracy of 95.14%. Decision tree, SVC and ADABoost got an accuracy of 91.26%, 93.36% and 93.48% respectively. Even the values of CV score, F1 and recall were considerably high for all these classifiers. After applying the oversampling technique of ADASYN, RandomForest algorithm generated maximum accuracy of 96.25%. Between the two oversampling techniques, SMOTE algorithm performed slightly better in producing better evaluation metrics than ADASYN. The proposed system has very high reliability and so can be effectively used for detecting the risk of dyslexia.
一个有效的基于smote的阅读障碍预测模型
阅读障碍是一种学习障碍,它会导致个人在阅读、写作、拼写和做简单的数学计算方面出现困难。它影响到全球近10%的人口,及早发现对于有效处理至关重要。有许多不同的方法来检测阅读障碍的风险。其中一些方法是使用评估工具、手写识别、专家心理帮助,以及使用阅读时记录的眼球运动数据。另一种检测阅读障碍风险的便捷方法是让个体参与一个与语音意识、音节意识、听觉辨别、词汇意识、视觉工作记忆等相关的简单游戏,并记录观察结果。本研究提供了一种预测阅读障碍风险的有效方法,具有较高的准确性和可靠性。它使用kaggle存储库提供的数据集,使用各种机器学习算法来预测阅读障碍的风险。此外,还观察到数据集具有不均匀的正案例和负案例分布,因此如果直接使用分类精度会受到损害。提出的研究工作使用三种重采样技术来减少数据集的不平衡。使用的重采样技术是使用近射算法的欠采样,使用SMOTE和ADASYN的过采样。应用欠采样近靶算法后,SVC分类器准确率最高,为81.63%。实验中使用的所有其他分类器的准确率在64%到79.08%之间。在使用过采样算法SMOTE后,分类器在准确率、CV分数、F1分数和召回率的评价指标上都取得了很好的效果。RandomForest的准确率最高,为96.37%,XGBBoosting和GradientBoosting紧随其后,准确率为95.14%。决策树、SVC和ADABoost的准确率分别为91.26%、93.36%和93.48%。甚至这些分类器的CV得分、F1和召回值都相当高。采用ADASYN的过采样技术后,RandomForest算法的准确率最高达到96.25%。在两种过采样技术之间,SMOTE算法在产生更好的评估指标方面的表现略好于ADASYN。该系统具有很高的可靠性,因此可以有效地用于检测阅读障碍的风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信