Diego Linares Gonzalez, Shahana Ibrahim, Swarnadeep Seth, George Atia, Aniket Bhattacharya
{"title":"利用分子模拟和物理信息多机器学习策略预测无序蛋白质的物理特性。","authors":"Diego Linares Gonzalez, Shahana Ibrahim, Swarnadeep Seth, George Atia, Aniket Bhattacharya","doi":"10.1021/acs.biomac.5c01118","DOIUrl":null,"url":null,"abstract":"<p><p>We introduce a novel hybrid machine learning (ML) framework to predict the radius of gyration and other conformational properties of intrinsically disordered proteins (IDPs). Our model integrates sequence information with physical features derived from a coarse-grained model validated by experimental data. Specifically, we combine hidden states from sequence-based models with 23 physical features projected into a shared latent space, and apply an attention mechanism that assigns weights to each residue to highlight the most informative regions of the sequence. This attention-guided fusion significantly improves predictive accuracy across multiple metrics, including mean absolute percentage error and mean squared error, while also enhancing confidence in the predictions. We trained and evaluated our models on Brownian dynamics (BD) simulation results for approximately 7000 IDPs from the MobiDB database (each with >99% disorder score). We find that sequence-based models consistently outperform feature-only models, with the GRU achieving the best performance among sequence-only approaches. Moreover, combining sequence and feature information further improves accuracy across all architectures, with the hybrid biGRU model delivering the best overall predictive performance. SHAP analysis reveals the relative importance of physical features, offering model explainability, and guiding feature selection. Notably, using a small number of top features often reduces model complexity and improves generalization. Furthermore, an integrated gradient analysis reveals that in addition to the length of the IDPs, the three parameters (sequence charge and hydropathy decoration parameters (SCD and SHD), and charge asymmetry parameter <i>f</i>*) play a key role in the predictions of ML. Our framework provides a fast, interpretable, and scalable tool for predicting IDP behavior, enabling efficient initial screening prior to costly molecular simulations.</p>","PeriodicalId":30,"journal":{"name":"Biomacromolecules","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prediction of Physical Characteristics of Disordered Proteins Using Molecular Simulation and Physics-Informed Multiple Machine Learning Strategies.\",\"authors\":\"Diego Linares Gonzalez, Shahana Ibrahim, Swarnadeep Seth, George Atia, Aniket Bhattacharya\",\"doi\":\"10.1021/acs.biomac.5c01118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>We introduce a novel hybrid machine learning (ML) framework to predict the radius of gyration and other conformational properties of intrinsically disordered proteins (IDPs). Our model integrates sequence information with physical features derived from a coarse-grained model validated by experimental data. Specifically, we combine hidden states from sequence-based models with 23 physical features projected into a shared latent space, and apply an attention mechanism that assigns weights to each residue to highlight the most informative regions of the sequence. This attention-guided fusion significantly improves predictive accuracy across multiple metrics, including mean absolute percentage error and mean squared error, while also enhancing confidence in the predictions. We trained and evaluated our models on Brownian dynamics (BD) simulation results for approximately 7000 IDPs from the MobiDB database (each with >99% disorder score). We find that sequence-based models consistently outperform feature-only models, with the GRU achieving the best performance among sequence-only approaches. Moreover, combining sequence and feature information further improves accuracy across all architectures, with the hybrid biGRU model delivering the best overall predictive performance. SHAP analysis reveals the relative importance of physical features, offering model explainability, and guiding feature selection. Notably, using a small number of top features often reduces model complexity and improves generalization. Furthermore, an integrated gradient analysis reveals that in addition to the length of the IDPs, the three parameters (sequence charge and hydropathy decoration parameters (SCD and SHD), and charge asymmetry parameter <i>f</i>*) play a key role in the predictions of ML. Our framework provides a fast, interpretable, and scalable tool for predicting IDP behavior, enabling efficient initial screening prior to costly molecular simulations.</p>\",\"PeriodicalId\":30,\"journal\":{\"name\":\"Biomacromolecules\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biomacromolecules\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.biomac.5c01118\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomacromolecules","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.biomac.5c01118","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
Prediction of Physical Characteristics of Disordered Proteins Using Molecular Simulation and Physics-Informed Multiple Machine Learning Strategies.
We introduce a novel hybrid machine learning (ML) framework to predict the radius of gyration and other conformational properties of intrinsically disordered proteins (IDPs). Our model integrates sequence information with physical features derived from a coarse-grained model validated by experimental data. Specifically, we combine hidden states from sequence-based models with 23 physical features projected into a shared latent space, and apply an attention mechanism that assigns weights to each residue to highlight the most informative regions of the sequence. This attention-guided fusion significantly improves predictive accuracy across multiple metrics, including mean absolute percentage error and mean squared error, while also enhancing confidence in the predictions. We trained and evaluated our models on Brownian dynamics (BD) simulation results for approximately 7000 IDPs from the MobiDB database (each with >99% disorder score). We find that sequence-based models consistently outperform feature-only models, with the GRU achieving the best performance among sequence-only approaches. Moreover, combining sequence and feature information further improves accuracy across all architectures, with the hybrid biGRU model delivering the best overall predictive performance. SHAP analysis reveals the relative importance of physical features, offering model explainability, and guiding feature selection. Notably, using a small number of top features often reduces model complexity and improves generalization. Furthermore, an integrated gradient analysis reveals that in addition to the length of the IDPs, the three parameters (sequence charge and hydropathy decoration parameters (SCD and SHD), and charge asymmetry parameter f*) play a key role in the predictions of ML. Our framework provides a fast, interpretable, and scalable tool for predicting IDP behavior, enabling efficient initial screening prior to costly molecular simulations.
期刊介绍:
Biomacromolecules is a leading forum for the dissemination of cutting-edge research at the interface of polymer science and biology. Submissions to Biomacromolecules should contain strong elements of innovation in terms of macromolecular design, synthesis and characterization, or in the application of polymer materials to biology and medicine.
Topics covered by Biomacromolecules include, but are not exclusively limited to: sustainable polymers, polymers based on natural and renewable resources, degradable polymers, polymer conjugates, polymeric drugs, polymers in biocatalysis, biomacromolecular assembly, biomimetic polymers, polymer-biomineral hybrids, biomimetic-polymer processing, polymer recycling, bioactive polymer surfaces, original polymer design for biomedical applications such as immunotherapy, drug delivery, gene delivery, antimicrobial applications, diagnostic imaging and biosensing, polymers in tissue engineering and regenerative medicine, polymeric scaffolds and hydrogels for cell culture and delivery.