{"title":"利用对比学习和智能体注意增强酶委托数预测。","authors":"Wendi Zhao, Qiaoling Han, Fan Yang, Yue Zhao","doi":"10.1002/prot.26822","DOIUrl":null,"url":null,"abstract":"<p><p>The accurate prediction of enzyme function is crucial for elucidating disease mechanisms and identifying drug targets. Nevertheless, existing enzyme commission (EC) number prediction methods are limited by database coverage and the depth of sequence information mining, hindering the efficiency and precision of enzyme function annotation. Therefore, this study introduces ProteEC-CLA (Protein EC number prediction model with Contrastive Learning and Agent Attention). ProteEC-CLA utilizes contrastive learning to construct positive and negative sample pairs, which not only enhances sequence feature extraction but also improves the utilization of unlabeled data. This process helps the model learn the differences in sequence features, thereby enhancing its ability to predict enzyme function. Integrating the pre-trained protein language model ESM2, the model generates informative sequence embeddings for deep functional correlation analysis, significantly enhancing prediction accuracy. With the incorporation of the Agent Attention mechanism, ProteEC-CLA's ability to comprehensively capture local details and global features is enhanced, ensuring high-accuracy predictions on complex sequences. The results demonstrate that ProteEC-CLA performs exceptionally well on two independent and representative datasets. In the standard dataset, it achieves 98.92% accuracy at the EC4 level. In the more challenging clustered split dataset, ProteEC-CLA achieves 93.34% accuracy and an F1-score of 94.72%. With only enzyme sequences as input, ProteEC-CLA can accurately predict EC numbers up to the fourth level, significantly enhancing annotation efficiency and accuracy, which makes it a highly efficient and precise functional annotation tool for enzymology research and applications.</p>","PeriodicalId":56271,"journal":{"name":"Proteins-Structure Function and Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Enzyme Commission Number Prediction With Contrastive Learning and Agent Attention.\",\"authors\":\"Wendi Zhao, Qiaoling Han, Fan Yang, Yue Zhao\",\"doi\":\"10.1002/prot.26822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The accurate prediction of enzyme function is crucial for elucidating disease mechanisms and identifying drug targets. Nevertheless, existing enzyme commission (EC) number prediction methods are limited by database coverage and the depth of sequence information mining, hindering the efficiency and precision of enzyme function annotation. Therefore, this study introduces ProteEC-CLA (Protein EC number prediction model with Contrastive Learning and Agent Attention). ProteEC-CLA utilizes contrastive learning to construct positive and negative sample pairs, which not only enhances sequence feature extraction but also improves the utilization of unlabeled data. This process helps the model learn the differences in sequence features, thereby enhancing its ability to predict enzyme function. Integrating the pre-trained protein language model ESM2, the model generates informative sequence embeddings for deep functional correlation analysis, significantly enhancing prediction accuracy. With the incorporation of the Agent Attention mechanism, ProteEC-CLA's ability to comprehensively capture local details and global features is enhanced, ensuring high-accuracy predictions on complex sequences. The results demonstrate that ProteEC-CLA performs exceptionally well on two independent and representative datasets. In the standard dataset, it achieves 98.92% accuracy at the EC4 level. In the more challenging clustered split dataset, ProteEC-CLA achieves 93.34% accuracy and an F1-score of 94.72%. With only enzyme sequences as input, ProteEC-CLA can accurately predict EC numbers up to the fourth level, significantly enhancing annotation efficiency and accuracy, which makes it a highly efficient and precise functional annotation tool for enzymology research and applications.</p>\",\"PeriodicalId\":56271,\"journal\":{\"name\":\"Proteins-Structure Function and Bioinformatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proteins-Structure Function and Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1002/prot.26822\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proteins-Structure Function and Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/prot.26822","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
酶功能的准确预测对于阐明疾病机制和确定药物靶点至关重要。然而,现有的酶谱(EC)数预测方法受数据库覆盖范围和序列信息挖掘深度的限制,影响了酶功能标注的效率和精度。因此,本研究引入了ProteEC-CLA (Protein EC number prediction model with contrast Learning and Agent Attention)模型。ProteEC-CLA利用对比学习构造正、负样本对,既增强了序列特征提取,又提高了未标记数据的利用率。这一过程有助于模型了解序列特征的差异,从而增强其预测酶功能的能力。该模型结合预训练的蛋白质语言模型ESM2,生成信息丰富的序列嵌入,用于深度功能相关性分析,显著提高预测精度。结合Agent Attention机制,ProteEC-CLA全面捕获局部细节和全局特征的能力得到增强,确保对复杂序列的高精度预测。结果表明,ProteEC-CLA在两个独立且具有代表性的数据集上表现出色。在标准数据集中,它达到了98.92%的EC4级准确率。在更具挑战性的聚类分割数据集中,ProteEC-CLA的准确率为93.34%,f1得分为94.72%。ProteEC-CLA仅以酶序列为输入,就能准确预测到最高4级的EC数,显著提高了标注效率和准确性,是酶学研究和应用中高效、精准的功能标注工具。
Enhancing Enzyme Commission Number Prediction With Contrastive Learning and Agent Attention.
The accurate prediction of enzyme function is crucial for elucidating disease mechanisms and identifying drug targets. Nevertheless, existing enzyme commission (EC) number prediction methods are limited by database coverage and the depth of sequence information mining, hindering the efficiency and precision of enzyme function annotation. Therefore, this study introduces ProteEC-CLA (Protein EC number prediction model with Contrastive Learning and Agent Attention). ProteEC-CLA utilizes contrastive learning to construct positive and negative sample pairs, which not only enhances sequence feature extraction but also improves the utilization of unlabeled data. This process helps the model learn the differences in sequence features, thereby enhancing its ability to predict enzyme function. Integrating the pre-trained protein language model ESM2, the model generates informative sequence embeddings for deep functional correlation analysis, significantly enhancing prediction accuracy. With the incorporation of the Agent Attention mechanism, ProteEC-CLA's ability to comprehensively capture local details and global features is enhanced, ensuring high-accuracy predictions on complex sequences. The results demonstrate that ProteEC-CLA performs exceptionally well on two independent and representative datasets. In the standard dataset, it achieves 98.92% accuracy at the EC4 level. In the more challenging clustered split dataset, ProteEC-CLA achieves 93.34% accuracy and an F1-score of 94.72%. With only enzyme sequences as input, ProteEC-CLA can accurately predict EC numbers up to the fourth level, significantly enhancing annotation efficiency and accuracy, which makes it a highly efficient and precise functional annotation tool for enzymology research and applications.
期刊介绍:
PROTEINS : Structure, Function, and Bioinformatics publishes original reports of significant experimental and analytic research in all areas of protein research: structure, function, computation, genetics, and design. The journal encourages reports that present new experimental or computational approaches for interpreting and understanding data from biophysical chemistry, structural studies of proteins and macromolecular assemblies, alterations of protein structure and function engineered through techniques of molecular biology and genetics, functional analyses under physiologic conditions, as well as the interactions of proteins with receptors, nucleic acids, or other specific ligands or substrates. Research in protein and peptide biochemistry directed toward synthesizing or characterizing molecules that simulate aspects of the activity of proteins, or that act as inhibitors of protein function, is also within the scope of PROTEINS. In addition to full-length reports, short communications (usually not more than 4 printed pages) and prediction reports are welcome. Reviews are typically by invitation; authors are encouraged to submit proposed topics for consideration.