HEPAD: enhancing hemolytic peptide prediction with adaptive feature engineering and diverse sequence descriptors.

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-10-01 DOI:10.1186/s12859-025-06254-6

Sih-Han Chen, Jen-Chieh Yu, Yi-Hsiang Lin, Shao-Chun Kuo, Kuan Ni, Ching-Tai Chen

{"title":"HEPAD: enhancing hemolytic peptide prediction with adaptive feature engineering and diverse sequence descriptors.","authors":"Sih-Han Chen, Jen-Chieh Yu, Yi-Hsiang Lin, Shao-Chun Kuo, Kuan Ni, Ching-Tai Chen","doi":"10.1186/s12859-025-06254-6","DOIUrl":null,"url":null,"abstract":"Background: Peptides have emerged as promising therapeutic agents for drug development against cancer, immune disorders, hypertension, and microbial infections. Peptide drugs have the advantage of high selectivity, low production cost, and fewer side effects compared to traditional small molecule-based drugs. However, one main challenge that hinders the adoption of peptide therapeutics is that some peptides are prone to be hemolytic, leading to the disruption of erythrocyte membranes and decreasing the life span of red blood cells. A computational model for hemolytic peptide identification would be a valuable tool for peptide drug discovery.Results: In this study, we present HEPAD, a machine learning predictor to identify hemolytic peptides based on adaptive feature engineering and diverse sequence descriptors. Sequence descriptors were applied for feature encoding, generating a feature vector of nearly 4000 numeric values for each peptide. Next, an adaptive feature engineering method was proposed to produce a customized feature subset for a given dataset. The four datasets considered in this study were associated with 250, 350, 90, and 130 selected features. Five machine learning methods of different rationale were employed to perform cross validation and independent tests. HEPAD yields Matthew's correlation coefficients (MCCs) of 0.973, 0.643, and 0.609, respectively, for three independent datasets. The improvements in MCC compared to existing approaches range from 1.9 to 13.3% for three independent tests. Moreover, data visualization reveals that the customized feature subsets can effectively separate hemolytic peptides from random peptides.Conclusions: HEPAD offers efficient identification of potential hemolytic peptides, thereby expediting experimental procedures in drug discovery. The source code, datasets, and machine learning models are available at https://github.com/csh07/HEPAD .","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"234"},"PeriodicalIF":3.3000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486866/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06254-6","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Peptides have emerged as promising therapeutic agents for drug development against cancer, immune disorders, hypertension, and microbial infections. Peptide drugs have the advantage of high selectivity, low production cost, and fewer side effects compared to traditional small molecule-based drugs. However, one main challenge that hinders the adoption of peptide therapeutics is that some peptides are prone to be hemolytic, leading to the disruption of erythrocyte membranes and decreasing the life span of red blood cells. A computational model for hemolytic peptide identification would be a valuable tool for peptide drug discovery.

Results: In this study, we present HEPAD, a machine learning predictor to identify hemolytic peptides based on adaptive feature engineering and diverse sequence descriptors. Sequence descriptors were applied for feature encoding, generating a feature vector of nearly 4000 numeric values for each peptide. Next, an adaptive feature engineering method was proposed to produce a customized feature subset for a given dataset. The four datasets considered in this study were associated with 250, 350, 90, and 130 selected features. Five machine learning methods of different rationale were employed to perform cross validation and independent tests. HEPAD yields Matthew's correlation coefficients (MCCs) of 0.973, 0.643, and 0.609, respectively, for three independent datasets. The improvements in MCC compared to existing approaches range from 1.9 to 13.3% for three independent tests. Moreover, data visualization reveals that the customized feature subsets can effectively separate hemolytic peptides from random peptides.

Conclusions: HEPAD offers efficient identification of potential hemolytic peptides, thereby expediting experimental procedures in drug discovery. The source code, datasets, and machine learning models are available at https://github.com/csh07/HEPAD .

查看原文本刊更多论文

HEPAD：利用自适应特征工程和多种序列描述子增强溶血肽预测。

背景：多肽已成为抗癌、免疫紊乱、高血压和微生物感染药物开发中有前景的治疗药物。与传统的小分子药物相比，多肽药物具有选择性高、生产成本低、副作用少等优点。然而，阻碍多肽疗法采用的一个主要挑战是，一些多肽容易溶血，导致红细胞膜破坏，减少红细胞的寿命。建立溶血肽鉴定的计算模型将为肽类药物的发现提供有价值的工具。结果：在这项研究中，我们提出了HEPAD，一个基于自适应特征工程和多种序列描述符的机器学习预测器来识别溶血肽。利用序列描述子进行特征编码，为每个肽段生成近4000个数值的特征向量。其次，提出了一种自适应特征工程方法，针对给定数据集生成自定义特征子集。本研究中考虑的四个数据集分别与250、350、90和130个选定特征相关。采用五种不同原理的机器学习方法进行交叉验证和独立测试。HEPAD对三个独立数据集的马修相关系数（mcs）分别为0.973、0.643和0.609。在三个独立测试中，与现有方法相比，MCC的改进幅度从1.9到13.3%不等。此外，数据可视化表明，自定义特征子集可以有效地将溶血肽从随机肽中分离出来。结论：HEPAD可以有效地识别潜在的溶血肽，从而加快药物发现的实验程序。源代码、数据集和机器学习模型可在https://github.com/csh07/HEPAD上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.