{"title":"An attention-based loss function and synthetic minority oversampling technique for alleviating class imbalance in predicting diabetes","authors":"Santanu Roy , Reshma Rachel Cherish , Gifty Roy","doi":"10.1016/j.health.2025.100399","DOIUrl":null,"url":null,"abstract":"<div><div>Diabetes is a chronic disease due to higher blood sugar (or Glucose) levels in the blood. This study proposes a novel attention-based loss function and a lightweight artificial neural network (ANN) called Diabetic Lite (DB-Lite) for diabetes prediction in the Pima Indian Diabetes Dataset (PIDD). We show that the Pima dataset has many challenges. It is a small and imbalanced dataset; moreover, many features are non-linearly correlated in this dataset. The novelties of this research work are as follows: (i) A novel loss function of attention-based binary cross entropy (ABCE) is proposed for the first time to alleviate the statistical imbalance present within the Pima dataset. This ABCE loss function is incorporated in the DB-Lite model, which is trained from scratch. (ii) A Swish activation function is deployed in the hidden layer of DB-Lite instead of Rectified Linear Unit (ReLU) to deal with the non-linear dependency of features with the final outcome. (iii) The synthetic minority oversampling technique (SMOTE) is used as a pre-processing technique to mitigate the class imbalance problem from the Pima dataset. (iv) An adaptive learning rate is utilized while training the model to speed up the convergence of the DB-Lite model. Our final proposed framework has achieved 99.7% accuracy, 99.4% precision, 99.8% recall, and 99.6% F1 score in testing, which is the best result on this Pima dataset. The Welch t-testing (as a statistical hypothesis testing) and 10-fold cross-validation are utilized to prove the validity of the proposed loss function.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"7 ","pages":"Article 100399"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000188","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Diabetes is a chronic disease due to higher blood sugar (or Glucose) levels in the blood. This study proposes a novel attention-based loss function and a lightweight artificial neural network (ANN) called Diabetic Lite (DB-Lite) for diabetes prediction in the Pima Indian Diabetes Dataset (PIDD). We show that the Pima dataset has many challenges. It is a small and imbalanced dataset; moreover, many features are non-linearly correlated in this dataset. The novelties of this research work are as follows: (i) A novel loss function of attention-based binary cross entropy (ABCE) is proposed for the first time to alleviate the statistical imbalance present within the Pima dataset. This ABCE loss function is incorporated in the DB-Lite model, which is trained from scratch. (ii) A Swish activation function is deployed in the hidden layer of DB-Lite instead of Rectified Linear Unit (ReLU) to deal with the non-linear dependency of features with the final outcome. (iii) The synthetic minority oversampling technique (SMOTE) is used as a pre-processing technique to mitigate the class imbalance problem from the Pima dataset. (iv) An adaptive learning rate is utilized while training the model to speed up the convergence of the DB-Lite model. Our final proposed framework has achieved 99.7% accuracy, 99.4% precision, 99.8% recall, and 99.6% F1 score in testing, which is the best result on this Pima dataset. The Welch t-testing (as a statistical hypothesis testing) and 10-fold cross-validation are utilized to prove the validity of the proposed loss function.
糖尿病是一种由于血液中高血糖(或葡萄糖)水平引起的慢性疾病。本研究提出了一种新的基于注意力的损失函数和一种称为diabetes Lite (DB-Lite)的轻量级人工神经网络(ANN),用于皮马印第安人糖尿病数据集(PIDD)的糖尿病预测。我们表明,Pima数据集存在许多挑战。这是一个小而不平衡的数据集;此外,该数据集中的许多特征是非线性相关的。本研究的新颖之处在于:(1)首次提出了一种新的基于注意力的二元交叉熵(ABCE)损失函数,以缓解Pima数据集中存在的统计不平衡。这个ABCE损失函数被纳入DB-Lite模型中,该模型是从头开始训练的。(ii)在DB-Lite的隐藏层部署Swish激活函数,而不是ReLU (Rectified Linear Unit),以处理特征与最终结果的非线性依赖关系。(iii)采用合成少数派过采样技术(SMOTE)作为预处理技术,缓解了Pima数据集的类不平衡问题。(iv)在训练模型的同时,利用自适应学习率加快DB-Lite模型的收敛速度。我们最终提出的框架在测试中达到了99.7%的准确率,99.4%的精密度,99.8%的召回率和99.6%的F1分数,这是该Pima数据集上的最佳结果。使用Welch t检验(作为统计假设检验)和10倍交叉验证来证明所提出的损失函数的有效性。