颜色：一种基于组合线性操作的蛋白质序列表示，用于鉴定单体对性质的贡献

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2025-04-24 DOI:10.1021/acs.jcim.5c0020510.1021/acs.jcim.5c00205

Akash Pandey, Wei Chen and Sinan Keten*,

{"title":"颜色：一种基于组合线性操作的蛋白质序列表示，用于鉴定单体对性质的贡献","authors":"Akash Pandey, Wei Chen and Sinan Keten*, ","doi":"10.1021/acs.jcim.5c0020510.1021/acs.jcim.5c00205","DOIUrl":null,"url":null,"abstract":"The properties of biological materials like proteins and nucleic acids are largely determined by their primary sequence. Certain segments in the sequence strongly influence specific functions, but identifying these segments, or so-called motifs, is challenging due to the complexity of sequential data. While deep learning (DL) models can accurately capture sequence–property relationships, the degree of nonlinearity in these models limits the assessment of monomer contributions to a property─a critical step in identifying key motifs. Recent advances in explainable AI (XAI) offer attention and gradient-based methods for estimating monomeric contributions. However, these methods are primarily applied to classification tasks, such as binding site identification, where they achieve limited accuracy (40–45%) and rely on qualitative evaluations. To address these limitations, we introduce a DL model with interpretable steps, enabling direct tracing of monomeric contributions. Inspired by the masking technique commonly used in vision and natural language processing domains, we propose a new metric <math><mo>(</mo><mi>I</mi><mo>)</mo></math> for quantitative analysis on datasets mainly containing distinct properties of anticancer peptides (ACP), antimicrobial peptides (AMP), and collagen. Our model exhibits 22% higher explainability than the gradient and attention-based state-of-the-art models, recognizes critical motifs (RRR, RRI, and RSS) that significantly destabilize ACPs, and identifies motifs in AMPs that are 50% more effective in converting non-AMPs to AMPs. These findings highlight the potential of our model in guiding mutation strategies for designing protein-based biomaterials.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"65 9","pages":"4320–4333 4320–4333"},"PeriodicalIF":5.3000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"COLOR: A Compositional Linear Operation-Based Representation of Protein Sequences for Identification of Monomer Contributions to Properties\",\"authors\":\"Akash Pandey, Wei Chen and Sinan Keten*, \",\"doi\":\"10.1021/acs.jcim.5c0020510.1021/acs.jcim.5c00205\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The properties of biological materials like proteins and nucleic acids are largely determined by their primary sequence. Certain segments in the sequence strongly influence specific functions, but identifying these segments, or so-called motifs, is challenging due to the complexity of sequential data. While deep learning (DL) models can accurately capture sequence–property relationships, the degree of nonlinearity in these models limits the assessment of monomer contributions to a property─a critical step in identifying key motifs. Recent advances in explainable AI (XAI) offer attention and gradient-based methods for estimating monomeric contributions. However, these methods are primarily applied to classification tasks, such as binding site identification, where they achieve limited accuracy (40–45%) and rely on qualitative evaluations. To address these limitations, we introduce a DL model with interpretable steps, enabling direct tracing of monomeric contributions. Inspired by the masking technique commonly used in vision and natural language processing domains, we propose a new metric <math><mo>(</mo><mi>I</mi><mo>)</mo></math> for quantitative analysis on datasets mainly containing distinct properties of anticancer peptides (ACP), antimicrobial peptides (AMP), and collagen. Our model exhibits 22% higher explainability than the gradient and attention-based state-of-the-art models, recognizes critical motifs (RRR, RRI, and RSS) that significantly destabilize ACPs, and identifies motifs in AMPs that are 50% more effective in converting non-AMPs to AMPs. These findings highlight the potential of our model in guiding mutation strategies for designing protein-based biomaterials.\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"65 9\",\"pages\":\"4320–4333 4320–4333\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.jcim.5c00205\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.5c00205","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

摘要

蛋白质和核酸等生物材料的性质在很大程度上取决于它们的一级序列。序列中的某些片段强烈影响特定功能，但由于序列数据的复杂性，识别这些片段或所谓的基序是具有挑战性的。虽然深度学习（DL）模型可以准确地捕获序列-属性关系，但这些模型中的非线性程度限制了对单体对属性的贡献的评估──这是识别关键基序的关键步骤。可解释人工智能（XAI）的最新进展提供了基于注意力和梯度的方法来估计单体贡献。然而，这些方法主要应用于分类任务，如结合位点鉴定，它们的准确性有限（40-45%），并且依赖于定性评估。为了解决这些限制，我们引入了一个具有可解释步骤的深度学习模型，从而可以直接跟踪单体贡献。受视觉和自然语言处理领域中常用的掩蔽技术的启发，我们提出了一种新的度量(I)，用于定量分析主要包含抗癌肽（ACP），抗菌肽（AMP）和胶原蛋白不同性质的数据集。我们的模型比基于梯度和注意力的最先进模型的可解释性高22%，识别出显著破坏acp稳定性的关键基序（RRR、RRI和RSS），并识别出在将非amp转化为amp的效率提高50%的amp中的基序。这些发现突出了我们的模型在指导设计基于蛋白质的生物材料的突变策略方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

COLOR: A Compositional Linear Operation-Based Representation of Protein Sequences for Identification of Monomer Contributions to Properties

查看原文本刊更多论文

COLOR: A Compositional Linear Operation-Based Representation of Protein Sequences for Identification of Monomer Contributions to Properties

The properties of biological materials like proteins and nucleic acids are largely determined by their primary sequence. Certain segments in the sequence strongly influence specific functions, but identifying these segments, or so-called motifs, is challenging due to the complexity of sequential data. While deep learning (DL) models can accurately capture sequence–property relationships, the degree of nonlinearity in these models limits the assessment of monomer contributions to a property─a critical step in identifying key motifs. Recent advances in explainable AI (XAI) offer attention and gradient-based methods for estimating monomeric contributions. However, these methods are primarily applied to classification tasks, such as binding site identification, where they achieve limited accuracy (40–45%) and rely on qualitative evaluations. To address these limitations, we introduce a DL model with interpretable steps, enabling direct tracing of monomeric contributions. Inspired by the masking technique commonly used in vision and natural language processing domains, we propose a new metric $(I)$ for quantitative analysis on datasets mainly containing distinct properties of anticancer peptides (ACP), antimicrobial peptides (AMP), and collagen. Our model exhibits 22% higher explainability than the gradient and attention-based state-of-the-art models, recognizes critical motifs (RRR, RRI, and RSS) that significantly destabilize ACPs, and identifies motifs in AMPs that are 50% more effective in converting non-AMPs to AMPs. These findings highlight the potential of our model in guiding mutation strategies for designing protein-based biomaterials.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.