Linear B-cell epitope prediction for SARS and COVID-19 vaccine design: Integrating balanced ensemble learning models and resampling strategies.

IF 3.5 4区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
PeerJ Computer Science Pub Date : 2025-06-18 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2970
Fatih Gurcan
{"title":"Linear B-cell epitope prediction for SARS and COVID-19 vaccine design: Integrating balanced ensemble learning models and resampling strategies.","authors":"Fatih Gurcan","doi":"10.7717/peerj-cs.2970","DOIUrl":null,"url":null,"abstract":"<p><p>This study presents a comprehensive machine learning framework to enhance the prediction accuracy of B-cell epitopes and antibody recognition related to Severe Acute Respiratory Syndrome (SARS) and Coronavirus Disease 2019 (COVID-19). To address the issue of data imbalance, various resampling techniques were applied using three types of strategies: over-sampling, under-sampling, and hybrid-sampling. The implemented resampling methods were designed to improve class balance and enhance model training. The rebalanced datasets were then used in model building with ensemble classifiers employing Boosting, Bagging, and Balancing strategies. Hyperparameter optimization for the classifiers was conducted using GridSearchCV, while feature selection was performed with the recursive feature elimination (RFE) algorithm. Model performance was evaluated using seven different metrics: Accuracy, Precision, Recall, F1-score, receiver operating characteristic area under the curve (ROC AUC), precision recall area under the curve (PR AUC), and Matthews correlation coefficient (MCC). Furthermore, statistical significance analyses including paired t-test, Wilcoxon, and permutation tests confirmed the reliability of the model improvements. To interpret the model's predictive behavior, peptides with the highest confidence among correctly classified instances were identified as potential epitope candidates. The results indicate that the combination of Synthetic Minority Over-Sampling Technique-Edited Nearest Neighbors (SMOTE-ENN), and ExtraTrees yielded the best performance, achieving an ROC AUC score of 0.9899. The combination of Instance Hardness Threshold (IHT) and ExtraTrees followed closely with a score of 0.9799. These findings emphasize the effectiveness of integrating resampling models and balancing ensemble classifiers in improving the accuracy of B-cell epitope prediction and antibody recognition for SARS and COVID-19 infections. This study contributes to vaccine development efforts and the advancement of immunoinformatics research by identifying promising epitope candidates.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2970"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12193457/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2970","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

This study presents a comprehensive machine learning framework to enhance the prediction accuracy of B-cell epitopes and antibody recognition related to Severe Acute Respiratory Syndrome (SARS) and Coronavirus Disease 2019 (COVID-19). To address the issue of data imbalance, various resampling techniques were applied using three types of strategies: over-sampling, under-sampling, and hybrid-sampling. The implemented resampling methods were designed to improve class balance and enhance model training. The rebalanced datasets were then used in model building with ensemble classifiers employing Boosting, Bagging, and Balancing strategies. Hyperparameter optimization for the classifiers was conducted using GridSearchCV, while feature selection was performed with the recursive feature elimination (RFE) algorithm. Model performance was evaluated using seven different metrics: Accuracy, Precision, Recall, F1-score, receiver operating characteristic area under the curve (ROC AUC), precision recall area under the curve (PR AUC), and Matthews correlation coefficient (MCC). Furthermore, statistical significance analyses including paired t-test, Wilcoxon, and permutation tests confirmed the reliability of the model improvements. To interpret the model's predictive behavior, peptides with the highest confidence among correctly classified instances were identified as potential epitope candidates. The results indicate that the combination of Synthetic Minority Over-Sampling Technique-Edited Nearest Neighbors (SMOTE-ENN), and ExtraTrees yielded the best performance, achieving an ROC AUC score of 0.9899. The combination of Instance Hardness Threshold (IHT) and ExtraTrees followed closely with a score of 0.9799. These findings emphasize the effectiveness of integrating resampling models and balancing ensemble classifiers in improving the accuracy of B-cell epitope prediction and antibody recognition for SARS and COVID-19 infections. This study contributes to vaccine development efforts and the advancement of immunoinformatics research by identifying promising epitope candidates.

线性b细胞表位预测用于SARS和COVID-19疫苗设计:整合平衡集成学习模型和重采样策略
本研究提出了一个全面的机器学习框架,以提高与严重急性呼吸系统综合征(SARS)和冠状病毒病2019 (COVID-19)相关的b细胞表位和抗体识别的预测准确性。为了解决数据不平衡的问题,使用了三种类型的策略应用了各种重采样技术:过采样、欠采样和混合采样。实现的重采样方法旨在改善类平衡和增强模型训练。然后将重新平衡的数据集与采用Boosting、Bagging和Balancing策略的集成分类器一起用于模型构建。使用GridSearchCV对分类器进行超参数优化,使用递归特征消除(RFE)算法进行特征选择。使用七个不同的指标来评估模型的性能:准确性、精密度、召回率、f1评分、受试者工作特征曲线下面积(ROC AUC)、精确召回面积(PR AUC)和马修斯相关系数(MCC)。此外,包括配对t检验、Wilcoxon检验和排列检验在内的统计显著性分析证实了模型改进的可靠性。为了解释模型的预测行为,在正确分类的实例中具有最高置信度的肽被确定为潜在的表位候选者。结果表明,合成少数过采样技术-编辑近邻(SMOTE-ENN)与ExtraTrees组合的效果最好,ROC AUC得分为0.9899。实例硬度阈值(IHT)与extratree组合次之,得分为0.9799。这些发现强调了整合重采样模型和平衡集合分类器在提高SARS和COVID-19感染的b细胞表位预测和抗体识别准确性方面的有效性。本研究通过确定有希望的候选表位,有助于疫苗开发工作和免疫信息学研究的进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
PeerJ Computer Science
PeerJ Computer Science Computer Science-General Computer Science
CiteScore
6.10
自引率
5.30%
发文量
332
审稿时长
10 weeks
期刊介绍: PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信