iACP-DPNet: a dual-pooling causal dilated convolutional network for interpretable anticancer peptide identification.

IF 3.1 4区生物学 Q1 GENETICS & HEREDITY

Functional & Integrative Genomics Pub Date : 2025-07-04 DOI:10.1007/s10142-025-01641-x

Zimeng Zhang, Xin Wang, Wenhui Shang

{"title":"iACP-DPNet: a dual-pooling causal dilated convolutional network for interpretable anticancer peptide identification.","authors":"Zimeng Zhang, Xin Wang, Wenhui Shang","doi":"10.1007/s10142-025-01641-x","DOIUrl":null,"url":null,"abstract":"<p><p>Anticancer peptides (ACPs) are acknowledged for their potential in cancer therapy, attributed to their safety, low side effects, and high target specificity. However, the discovery of ACPs is slowed by the high cost and labor-intensive nature of experimental validation, resulting in a limited number of confirmed ACPs. Although various computational methods have been proposed, existing models commonly suffer from three critical limitations: reliance on small-scale datasets, lack of interpretable feature learning mechanisms, and insufficient generalization capability. To address these challenges, this study constructs a larger and more diverse dataset by consolidating data from existing literature and databases, and proposes a novel deep learning predictive model named iACP-DPNet. The model utilizes the protein language model ProtBert with positional encoding to convert protein sequences into feature vectors, then applies a two-step feature selection process via LightGBM and MIC. The selected features undergo processing by a causal dilated convolution network. A dual-pooling mechanism is designed to enhance the model's ability to synergistically model local critical residues and global sequence contexts, integrating parallel GlobalAveragePooling and attention pooling layers. Compared to traditional single-pooling models (e.g., ACP‑MHCNN), this architecture significantly improves feature extraction capability. To understand the model's decision-making process, we employ t-SNE for visualizing key steps, ISM for interpreting sequence regions, and SHAP analysis for evaluating feature importance. These approaches significantly improve the model's interpretability. The model exhibits outstanding performances on the novel dataset, as evidenced by rigorous tenfold cross-validation. Achieving remarkable metrics-including Sp of 96.1%, Sn of 92.91%, Acc of 94.5% and MCC of 89.05%, it significantly outperforms all existing state-of-the-art methods in comparative analyses. Furthermore, to assess its generalizability, we evaluated iACP-DPNet on an additional dataset, where it outperformed other current models. In conclusion, the iACP-DPNet exhibits exceptional performance and generalizability, showcasing its advanced design and effectiveness in ACPs prediction. This research provides a robust and interpretable framework for advancing research in anticancer peptide discovery. HIGHLIGHTS: • We have established a larger and more diverse dataset for ACPs prediction, addressing the limitations of existing datasets and providing a robust foundation for model training and evaluation. • The implementation of a dual-pooling layer mechanism (GlobalAveragePooling and attention pooling) bolsters the model's capacity to learn diverse features, ultimately enhancing its prediction efficiency. • We employed t-SNE visualization and ISM-based interpretability analysis to provide insights into the model's decision-making process, highlighting key regions and amino acids critical for ACPs functionality. • The iACP-DPNet model demonstrates strong generalizability across diverse datasets, making it a reliable tool for ACPs prediction and potentially other peptide-related tasks.</p>","PeriodicalId":574,"journal":{"name":"Functional & Integrative Genomics","volume":"25 1","pages":"147"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Functional & Integrative Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s10142-025-01641-x","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Anticancer peptides (ACPs) are acknowledged for their potential in cancer therapy, attributed to their safety, low side effects, and high target specificity. However, the discovery of ACPs is slowed by the high cost and labor-intensive nature of experimental validation, resulting in a limited number of confirmed ACPs. Although various computational methods have been proposed, existing models commonly suffer from three critical limitations: reliance on small-scale datasets, lack of interpretable feature learning mechanisms, and insufficient generalization capability. To address these challenges, this study constructs a larger and more diverse dataset by consolidating data from existing literature and databases, and proposes a novel deep learning predictive model named iACP-DPNet. The model utilizes the protein language model ProtBert with positional encoding to convert protein sequences into feature vectors, then applies a two-step feature selection process via LightGBM and MIC. The selected features undergo processing by a causal dilated convolution network. A dual-pooling mechanism is designed to enhance the model's ability to synergistically model local critical residues and global sequence contexts, integrating parallel GlobalAveragePooling and attention pooling layers. Compared to traditional single-pooling models (e.g., ACP‑MHCNN), this architecture significantly improves feature extraction capability. To understand the model's decision-making process, we employ t-SNE for visualizing key steps, ISM for interpreting sequence regions, and SHAP analysis for evaluating feature importance. These approaches significantly improve the model's interpretability. The model exhibits outstanding performances on the novel dataset, as evidenced by rigorous tenfold cross-validation. Achieving remarkable metrics-including Sp of 96.1%, Sn of 92.91%, Acc of 94.5% and MCC of 89.05%, it significantly outperforms all existing state-of-the-art methods in comparative analyses. Furthermore, to assess its generalizability, we evaluated iACP-DPNet on an additional dataset, where it outperformed other current models. In conclusion, the iACP-DPNet exhibits exceptional performance and generalizability, showcasing its advanced design and effectiveness in ACPs prediction. This research provides a robust and interpretable framework for advancing research in anticancer peptide discovery. HIGHLIGHTS: • We have established a larger and more diverse dataset for ACPs prediction, addressing the limitations of existing datasets and providing a robust foundation for model training and evaluation. • The implementation of a dual-pooling layer mechanism (GlobalAveragePooling and attention pooling) bolsters the model's capacity to learn diverse features, ultimately enhancing its prediction efficiency. • We employed t-SNE visualization and ISM-based interpretability analysis to provide insights into the model's decision-making process, highlighting key regions and amino acids critical for ACPs functionality. • The iACP-DPNet model demonstrates strong generalizability across diverse datasets, making it a reliable tool for ACPs prediction and potentially other peptide-related tasks.

查看原文本刊更多论文

iACP-DPNet：用于可解释抗癌肽识别的双池因果扩展卷积网络。

抗癌肽（ACPs）因其安全性、低副作用和高靶向特异性而在癌症治疗中具有广泛的应用前景。然而，由于实验验证的高成本和劳动密集型性质，acp的发现速度减慢，导致确认acp的数量有限。尽管已经提出了各种计算方法，但现有模型普遍存在三个关键局限性：依赖小规模数据集，缺乏可解释的特征学习机制，泛化能力不足。为了应对这些挑战，本研究通过整合现有文献和数据库的数据，构建了一个更大、更多样化的数据集，并提出了一种新的深度学习预测模型iACP-DPNet。该模型利用具有位置编码的蛋白质语言模型ProtBert将蛋白质序列转换为特征向量，然后通过LightGBM和MIC进行两步特征选择。选择的特征经过因果扩展卷积网络的处理。设计了一种双池化机制，结合并行的GlobalAveragePooling和注意力池化层，增强了模型对局部关键残基和全局序列上下文的协同建模能力。与传统的单池模型（如ACP‑MHCNN）相比，该架构显著提高了特征提取能力。为了理解模型的决策过程，我们使用t-SNE来可视化关键步骤，ISM用于解释序列区域，SHAP分析用于评估特征重要性。这些方法显著提高了模型的可解释性。通过严格的十倍交叉验证，该模型在新的数据集上表现出出色的性能。该方法取得了显著的指标，包括Sp为96.1%，Sn为92.91%，Acc为94.5%，MCC为89.05%，在比较分析中显著优于所有现有的最先进的方法。此外，为了评估其泛化性，我们在一个额外的数据集上评估了iACP-DPNet，在那里它优于其他当前模型。总之，iACP-DPNet具有优异的性能和通用性，展示了其在acp预测中的先进设计和有效性。本研究为推进抗癌肽发现的研究提供了一个强有力的、可解释的框架。•我们已经建立了一个更大、更多样化的acp预测数据集，解决了现有数据集的局限性，并为模型训练和评估提供了坚实的基础。•双重池化层机制（GlobalAveragePooling和attention pooling）的实现增强了模型学习不同特征的能力，最终提高了其预测效率。•我们采用t-SNE可视化和基于ism的可解释性分析来提供对模型决策过程的见解，突出了对ACPs功能至关重要的关键区域和氨基酸。•iACP-DPNet模型在不同的数据集上表现出很强的通用性，使其成为acp预测和潜在的其他肽相关任务的可靠工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Functional & Integrative Genomics 生物-遗传学

CiteScore

3.50

自引率

3.40%

发文量

审稿时长

2 months

期刊介绍： Functional & Integrative Genomics is devoted to large-scale studies of genomes and their functions, including systems analyses of biological processes. The journal will provide the research community an integrated platform where researchers can share, review and discuss their findings on important biological questions that will ultimately enable us to answer the fundamental question: How do genomes work?