DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2025-07-18 eCollection Date: 2025-01-01 DOI:10.1371/journal.pone.0328253

Said A Salloum, Khaled Mohammad Alomari, Ayham Salloum

{"title":"DNA sequence classification for diabetes mellitus using NuSVC and XGBoost: A comparative.","authors":"Said A Salloum, Khaled Mohammad Alomari, Ayham Salloum","doi":"10.1371/journal.pone.0328253","DOIUrl":null,"url":null,"abstract":"<p><p>Diabetes Mellitus is a global health concern, characterized by high blood sugar levels over a prolonged period, leading to severe complications if left unmanaged. The early identification of individuals at risk is critical for effective intervention and treatment. Traditional diagnostic methods rely heavily on clinical symptoms and biochemical tests, which may not capture the underlying genetic predispositions. With the advent of genomics, DNA sequence analysis has emerged as a promising approach to uncover the genetic markers associated with Diabetes Mellitus. However, the challenge lies in accurately classifying DNA sequences to predict susceptibility to the disease, given the complex nature of genetic data. This study addresses this challenge by employing two advanced machine learning models, NuSVC (Nu-Support Vector Classification) and XGBoost (Extreme Gradient Boosting), to classify DNA sequences related to Diabetes Mellitus. The dataset, obtained from reputable sources like NCBI, was preprocessed using Natural Language Processing (NLP) techniques, where DNA sequences were treated as textual data and transformed into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). To handle the class imbalance in the dataset, SMOTE (Synthetic Minority Over-sampling Technique) was applied. The models were trained and validated using 10-fold cross-validation. XGBoost was trained with up to 300 boosting rounds, and performance was evaluated using accuracy, precision, recall, F1-score, ROC-AUC, and log loss. The results demonstrate that XGBoost outperformed NuSVC across all metrics, achieving an accuracy of 98%, a log loss of 0.0650, and an AUC of 1.00, compared to NuSVC's accuracy of 87%, log loss of 0.2649, and AUC of 0.95. The superior performance of XGBoost indicates its robustness in handling complex genetic data and its potential utility in clinical applications for early diagnosis of Diabetes Mellitus. The findings of this study underscore the importance of advanced machine learning techniques in genomics and suggest that integrating such models into healthcare systems could significantly enhance predictive diagnostics.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 7","pages":"e0328253"},"PeriodicalIF":2.6000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12273983/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0328253","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Diabetes Mellitus is a global health concern, characterized by high blood sugar levels over a prolonged period, leading to severe complications if left unmanaged. The early identification of individuals at risk is critical for effective intervention and treatment. Traditional diagnostic methods rely heavily on clinical symptoms and biochemical tests, which may not capture the underlying genetic predispositions. With the advent of genomics, DNA sequence analysis has emerged as a promising approach to uncover the genetic markers associated with Diabetes Mellitus. However, the challenge lies in accurately classifying DNA sequences to predict susceptibility to the disease, given the complex nature of genetic data. This study addresses this challenge by employing two advanced machine learning models, NuSVC (Nu-Support Vector Classification) and XGBoost (Extreme Gradient Boosting), to classify DNA sequences related to Diabetes Mellitus. The dataset, obtained from reputable sources like NCBI, was preprocessed using Natural Language Processing (NLP) techniques, where DNA sequences were treated as textual data and transformed into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). To handle the class imbalance in the dataset, SMOTE (Synthetic Minority Over-sampling Technique) was applied. The models were trained and validated using 10-fold cross-validation. XGBoost was trained with up to 300 boosting rounds, and performance was evaluated using accuracy, precision, recall, F1-score, ROC-AUC, and log loss. The results demonstrate that XGBoost outperformed NuSVC across all metrics, achieving an accuracy of 98%, a log loss of 0.0650, and an AUC of 1.00, compared to NuSVC's accuracy of 87%, log loss of 0.2649, and AUC of 0.95. The superior performance of XGBoost indicates its robustness in handling complex genetic data and its potential utility in clinical applications for early diagnosis of Diabetes Mellitus. The findings of this study underscore the importance of advanced machine learning techniques in genomics and suggest that integrating such models into healthcare systems could significantly enhance predictive diagnostics.

Abstract Image

查看原文本刊更多论文

NuSVC与XGBoost对糖尿病DNA序列分类的比较。

糖尿病是一个全球性的健康问题，其特点是长期高血糖，如果不加以控制，会导致严重的并发症。早期识别有风险的个体对于有效的干预和治疗至关重要。传统的诊断方法严重依赖临床症状和生化测试，这可能无法捕捉潜在的遗传易感性。随着基因组学的出现，DNA序列分析已成为一种有前途的方法来揭示与糖尿病相关的遗传标记。然而，鉴于遗传数据的复杂性，挑战在于准确分类DNA序列以预测对该疾病的易感性。本研究通过采用两种先进的机器学习模型NuSVC （Nu-Support Vector Classification）和XGBoost （Extreme Gradient Boosting）对糖尿病相关的DNA序列进行分类，解决了这一挑战。从NCBI等知名来源获得的数据集使用自然语言处理（NLP）技术进行预处理，其中DNA序列被视为文本数据，并使用TF-IDF （Term Frequency- inverse Document Frequency）将其转换为数字特征。为了处理数据集中的类不平衡，采用了SMOTE （Synthetic Minority oversampling Technique）技术。使用10倍交叉验证对模型进行训练和验证。XGBoost训练了多达300发增强弹，并通过准确性、精度、召回率、f1分数、ROC-AUC和日志损失来评估性能。结果表明，XGBoost在所有指标上都优于NuSVC，准确率为98%，对数损失为0.0650，AUC为1.00，而NuSVC的准确率为87%，对数损失为0.2649，AUC为0.95。XGBoost的优异表现表明其在处理复杂遗传数据方面的稳健性和在糖尿病早期诊断方面的潜在临床应用价值。这项研究的发现强调了先进的机器学习技术在基因组学中的重要性，并建议将这些模型集成到医疗保健系统中可以显着提高预测性诊断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage