{"title":"HEPAD: enhancing hemolytic peptide prediction with adaptive feature engineering and diverse sequence descriptors.","authors":"Sih-Han Chen, Jen-Chieh Yu, Yi-Hsiang Lin, Shao-Chun Kuo, Kuan Ni, Ching-Tai Chen","doi":"10.1186/s12859-025-06254-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Peptides have emerged as promising therapeutic agents for drug development against cancer, immune disorders, hypertension, and microbial infections. Peptide drugs have the advantage of high selectivity, low production cost, and fewer side effects compared to traditional small molecule-based drugs. However, one main challenge that hinders the adoption of peptide therapeutics is that some peptides are prone to be hemolytic, leading to the disruption of erythrocyte membranes and decreasing the life span of red blood cells. A computational model for hemolytic peptide identification would be a valuable tool for peptide drug discovery.</p><p><strong>Results: </strong>In this study, we present HEPAD, a machine learning predictor to identify hemolytic peptides based on adaptive feature engineering and diverse sequence descriptors. Sequence descriptors were applied for feature encoding, generating a feature vector of nearly 4000 numeric values for each peptide. Next, an adaptive feature engineering method was proposed to produce a customized feature subset for a given dataset. The four datasets considered in this study were associated with 250, 350, 90, and 130 selected features. Five machine learning methods of different rationale were employed to perform cross validation and independent tests. HEPAD yields Matthew's correlation coefficients (MCCs) of 0.973, 0.643, and 0.609, respectively, for three independent datasets. The improvements in MCC compared to existing approaches range from 1.9 to 13.3% for three independent tests. Moreover, data visualization reveals that the customized feature subsets can effectively separate hemolytic peptides from random peptides.</p><p><strong>Conclusions: </strong>HEPAD offers efficient identification of potential hemolytic peptides, thereby expediting experimental procedures in drug discovery. The source code, datasets, and machine learning models are available at https://github.com/csh07/HEPAD .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"234"},"PeriodicalIF":3.3000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486866/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06254-6","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Peptides have emerged as promising therapeutic agents for drug development against cancer, immune disorders, hypertension, and microbial infections. Peptide drugs have the advantage of high selectivity, low production cost, and fewer side effects compared to traditional small molecule-based drugs. However, one main challenge that hinders the adoption of peptide therapeutics is that some peptides are prone to be hemolytic, leading to the disruption of erythrocyte membranes and decreasing the life span of red blood cells. A computational model for hemolytic peptide identification would be a valuable tool for peptide drug discovery.
Results: In this study, we present HEPAD, a machine learning predictor to identify hemolytic peptides based on adaptive feature engineering and diverse sequence descriptors. Sequence descriptors were applied for feature encoding, generating a feature vector of nearly 4000 numeric values for each peptide. Next, an adaptive feature engineering method was proposed to produce a customized feature subset for a given dataset. The four datasets considered in this study were associated with 250, 350, 90, and 130 selected features. Five machine learning methods of different rationale were employed to perform cross validation and independent tests. HEPAD yields Matthew's correlation coefficients (MCCs) of 0.973, 0.643, and 0.609, respectively, for three independent datasets. The improvements in MCC compared to existing approaches range from 1.9 to 13.3% for three independent tests. Moreover, data visualization reveals that the customized feature subsets can effectively separate hemolytic peptides from random peptides.
Conclusions: HEPAD offers efficient identification of potential hemolytic peptides, thereby expediting experimental procedures in drug discovery. The source code, datasets, and machine learning models are available at https://github.com/csh07/HEPAD .
期刊介绍:
BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology.
BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.