利用分子模拟和物理信息多机器学习策略预测无序蛋白质的物理特性。

IF 5.4 2区 化学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY
Diego Linares Gonzalez, Shahana Ibrahim, Swarnadeep Seth, George Atia, Aniket Bhattacharya
{"title":"利用分子模拟和物理信息多机器学习策略预测无序蛋白质的物理特性。","authors":"Diego Linares Gonzalez, Shahana Ibrahim, Swarnadeep Seth, George Atia, Aniket Bhattacharya","doi":"10.1021/acs.biomac.5c01118","DOIUrl":null,"url":null,"abstract":"<p><p>We introduce a novel hybrid machine learning (ML) framework to predict the radius of gyration and other conformational properties of intrinsically disordered proteins (IDPs). Our model integrates sequence information with physical features derived from a coarse-grained model validated by experimental data. Specifically, we combine hidden states from sequence-based models with 23 physical features projected into a shared latent space, and apply an attention mechanism that assigns weights to each residue to highlight the most informative regions of the sequence. This attention-guided fusion significantly improves predictive accuracy across multiple metrics, including mean absolute percentage error and mean squared error, while also enhancing confidence in the predictions. We trained and evaluated our models on Brownian dynamics (BD) simulation results for approximately 7000 IDPs from the MobiDB database (each with >99% disorder score). We find that sequence-based models consistently outperform feature-only models, with the GRU achieving the best performance among sequence-only approaches. Moreover, combining sequence and feature information further improves accuracy across all architectures, with the hybrid biGRU model delivering the best overall predictive performance. SHAP analysis reveals the relative importance of physical features, offering model explainability, and guiding feature selection. Notably, using a small number of top features often reduces model complexity and improves generalization. Furthermore, an integrated gradient analysis reveals that in addition to the length of the IDPs, the three parameters (sequence charge and hydropathy decoration parameters (SCD and SHD), and charge asymmetry parameter <i>f</i>*) play a key role in the predictions of ML. Our framework provides a fast, interpretable, and scalable tool for predicting IDP behavior, enabling efficient initial screening prior to costly molecular simulations.</p>","PeriodicalId":30,"journal":{"name":"Biomacromolecules","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prediction of Physical Characteristics of Disordered Proteins Using Molecular Simulation and Physics-Informed Multiple Machine Learning Strategies.\",\"authors\":\"Diego Linares Gonzalez, Shahana Ibrahim, Swarnadeep Seth, George Atia, Aniket Bhattacharya\",\"doi\":\"10.1021/acs.biomac.5c01118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>We introduce a novel hybrid machine learning (ML) framework to predict the radius of gyration and other conformational properties of intrinsically disordered proteins (IDPs). Our model integrates sequence information with physical features derived from a coarse-grained model validated by experimental data. Specifically, we combine hidden states from sequence-based models with 23 physical features projected into a shared latent space, and apply an attention mechanism that assigns weights to each residue to highlight the most informative regions of the sequence. This attention-guided fusion significantly improves predictive accuracy across multiple metrics, including mean absolute percentage error and mean squared error, while also enhancing confidence in the predictions. We trained and evaluated our models on Brownian dynamics (BD) simulation results for approximately 7000 IDPs from the MobiDB database (each with >99% disorder score). We find that sequence-based models consistently outperform feature-only models, with the GRU achieving the best performance among sequence-only approaches. Moreover, combining sequence and feature information further improves accuracy across all architectures, with the hybrid biGRU model delivering the best overall predictive performance. SHAP analysis reveals the relative importance of physical features, offering model explainability, and guiding feature selection. Notably, using a small number of top features often reduces model complexity and improves generalization. Furthermore, an integrated gradient analysis reveals that in addition to the length of the IDPs, the three parameters (sequence charge and hydropathy decoration parameters (SCD and SHD), and charge asymmetry parameter <i>f</i>*) play a key role in the predictions of ML. Our framework provides a fast, interpretable, and scalable tool for predicting IDP behavior, enabling efficient initial screening prior to costly molecular simulations.</p>\",\"PeriodicalId\":30,\"journal\":{\"name\":\"Biomacromolecules\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biomacromolecules\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.biomac.5c01118\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomacromolecules","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.biomac.5c01118","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

我们引入了一种新的混合机器学习(ML)框架来预测内在无序蛋白(IDPs)的旋转半径和其他构象性质。我们的模型集成了序列信息和物理特征,这些特征来源于经过实验数据验证的粗粒度模型。具体来说,我们将基于序列模型的隐藏状态与23个物理特征投影到共享潜在空间中,并应用一种注意机制,为每个残差分配权重,以突出序列中信息量最大的区域。这种注意力引导的融合显著提高了多个指标的预测精度,包括平均绝对百分比误差和均方误差,同时也增强了预测的信心。我们对MobiDB数据库中大约7000名IDPs的布朗动力学(BD)模拟结果进行了训练和评估(每个人的紊乱评分为bbbb99 %)。我们发现基于序列的模型始终优于仅具有特征的模型,其中GRU在仅具有序列的方法中实现了最佳性能。此外,结合序列和特征信息进一步提高了所有架构的准确性,混合biGRU模型提供了最佳的整体预测性能。SHAP分析揭示了物理特征的相对重要性,提供了模型的可解释性,并指导特征选择。值得注意的是,使用少量的顶级特征通常会降低模型的复杂性并提高泛化。此外,综合梯度分析表明,除了IDP的长度外,三个参数(序列电荷和亲水修饰参数(SCD和SHD)以及电荷不对称参数f*)在ML的预测中也起着关键作用。我们的框架提供了一个快速、可解释和可扩展的工具来预测IDP行为,从而在昂贵的分子模拟之前实现有效的初始筛选。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Prediction of Physical Characteristics of Disordered Proteins Using Molecular Simulation and Physics-Informed Multiple Machine Learning Strategies.

We introduce a novel hybrid machine learning (ML) framework to predict the radius of gyration and other conformational properties of intrinsically disordered proteins (IDPs). Our model integrates sequence information with physical features derived from a coarse-grained model validated by experimental data. Specifically, we combine hidden states from sequence-based models with 23 physical features projected into a shared latent space, and apply an attention mechanism that assigns weights to each residue to highlight the most informative regions of the sequence. This attention-guided fusion significantly improves predictive accuracy across multiple metrics, including mean absolute percentage error and mean squared error, while also enhancing confidence in the predictions. We trained and evaluated our models on Brownian dynamics (BD) simulation results for approximately 7000 IDPs from the MobiDB database (each with >99% disorder score). We find that sequence-based models consistently outperform feature-only models, with the GRU achieving the best performance among sequence-only approaches. Moreover, combining sequence and feature information further improves accuracy across all architectures, with the hybrid biGRU model delivering the best overall predictive performance. SHAP analysis reveals the relative importance of physical features, offering model explainability, and guiding feature selection. Notably, using a small number of top features often reduces model complexity and improves generalization. Furthermore, an integrated gradient analysis reveals that in addition to the length of the IDPs, the three parameters (sequence charge and hydropathy decoration parameters (SCD and SHD), and charge asymmetry parameter f*) play a key role in the predictions of ML. Our framework provides a fast, interpretable, and scalable tool for predicting IDP behavior, enabling efficient initial screening prior to costly molecular simulations.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Biomacromolecules
Biomacromolecules 化学-高分子科学
CiteScore
10.60
自引率
4.80%
发文量
417
审稿时长
1.6 months
期刊介绍: Biomacromolecules is a leading forum for the dissemination of cutting-edge research at the interface of polymer science and biology. Submissions to Biomacromolecules should contain strong elements of innovation in terms of macromolecular design, synthesis and characterization, or in the application of polymer materials to biology and medicine. Topics covered by Biomacromolecules include, but are not exclusively limited to: sustainable polymers, polymers based on natural and renewable resources, degradable polymers, polymer conjugates, polymeric drugs, polymers in biocatalysis, biomacromolecular assembly, biomimetic polymers, polymer-biomineral hybrids, biomimetic-polymer processing, polymer recycling, bioactive polymer surfaces, original polymer design for biomedical applications such as immunotherapy, drug delivery, gene delivery, antimicrobial applications, diagnostic imaging and biosensing, polymers in tissue engineering and regenerative medicine, polymeric scaffolds and hydrogels for cell culture and delivery.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信