arXiv - QuanBio - Biomolecules最新文献

筛选
英文 中文
Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees 利用生成树生成具有自我批评功能的任意属性条件分子
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-12 DOI: arxiv-2407.09357
Alexia Jolicoeur-Martineau, Aristide Baratin, Kisoo Kwon, Boris Knyazev, Yan Zhang
{"title":"Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees","authors":"Alexia Jolicoeur-Martineau, Aristide Baratin, Kisoo Kwon, Boris Knyazev, Yan Zhang","doi":"arxiv-2407.09357","DOIUrl":"https://doi.org/arxiv-2407.09357","url":null,"abstract":"Generating novel molecules is challenging, with most representations leading\u0000to generative models producing many invalid molecules. Spanning Tree-based\u0000Graph Generation (STGG) is a promising approach to ensure the generation of\u0000valid molecules, outperforming state-of-the-art SMILES and graph diffusion\u0000models for unconditional generation. In the real world, we want to be able to\u0000generate molecules conditional on one or multiple desired properties rather\u0000than unconditionally. Thus, in this work, we extend STGG to\u0000multi-property-conditional generation. Our approach, STGG+, incorporates a\u0000modern Transformer architecture, random masking of properties during training\u0000(enabling conditioning on any subset of properties and classifier-free\u0000guidance), an auxiliary property-prediction loss (allowing the model to\u0000self-criticize molecules and select the best ones), and other improvements. We\u0000show that STGG+ achieves state-of-the-art performance on in-distribution and\u0000out-of-distribution conditional generation, and reward maximization.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An efficient algorithm to compute the minimum free energy of interacting nucleic acid strands 计算相互作用核酸链最小自由能的高效算法
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-12 DOI: arxiv-2407.09676
Ahmed Shalaby, Damien Woods
{"title":"An efficient algorithm to compute the minimum free energy of interacting nucleic acid strands","authors":"Ahmed Shalaby, Damien Woods","doi":"arxiv-2407.09676","DOIUrl":"https://doi.org/arxiv-2407.09676","url":null,"abstract":"The information-encoding molecules RNA and DNA form a combinatorially large\u0000set of secondary structures through nucleic acid base pairing. Thermodynamic\u0000prediction algorithms predict favoured, or minimum free energy (MFE), secondary\u0000structures, and can assign an equilibrium probability to any structure via the\u0000partition function: a Boltzman-weighted sum over the set of secondary\u0000structures. MFE is NP-hard in the presence pseudoknots, base pairings that\u0000violate a restricted planarity condition. However, unpseudoknotted structures\u0000are amenable to dynamic programming: for a single DNA/RNA strand there are\u0000polynomial time algorithms for MFE and partition function. For multiple\u0000strands, the problem is more complicated due to entropic penalties. Dirks et al\u0000[SICOMP Review; 2007] showed that for O(1) strands, with N bases, there is a\u0000polynomial time in N partition function algorithm, however their technique did\u0000not generalise to MFE which they left open. We give the first polynomial time\u0000(O(N^4)) algorithm for unpseudoknotted multiple (O(1)) strand MFE, answering\u0000the open problem from Dirks et al. The challenge lies in considering rotational\u0000symmetry of secondary structures, a feature not immediately amenable to dynamic\u0000programming algorithms. Our proof has two main technical contributions: First,\u0000a polynomial upper bound on the number of symmetric secondary structures to be\u0000considered when computing rotational symmetry penalties. Second, that bound is\u0000leveraged by a backtracking algorithm to find the MFE in an exponential space\u0000of contenders. Our MFE algorithm has the same asymptotic run time as Dirks et\u0000al's partition function algorithm, suggesting efficient handling of rotational\u0000symmetry, although higher space complexity. It also seems reasonably tight in\u0000the number of strands since Codon, Hajiaghayi & Thachuk [DNA27, 2021] have\u0000shown that unpseudoknotted MFE is NP-hard for O(N) strands.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX 利用大型多模态模型 HelixProtX 统一序列、结构和描述,生成任意蛋白质
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-12 DOI: arxiv-2407.09274
Zhiyuan Chen, Tianhao Chen, Chenggang Xie, Yang Xue, Xiaonan Zhang, Jingbo Zhou, Xiaomin Fang
{"title":"Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX","authors":"Zhiyuan Chen, Tianhao Chen, Chenggang Xie, Yang Xue, Xiaonan Zhang, Jingbo Zhou, Xiaomin Fang","doi":"arxiv-2407.09274","DOIUrl":"https://doi.org/arxiv-2407.09274","url":null,"abstract":"Proteins are fundamental components of biological systems and can be\u0000represented through various modalities, including sequences, structures, and\u0000textual descriptions. Despite the advances in deep learning and scientific\u0000large language models (LLMs) for protein research, current methodologies\u0000predominantly focus on limited specialized tasks -- often predicting one\u0000protein modality from another. These approaches restrict the understanding and\u0000generation of multimodal protein data. In contrast, large multimodal models\u0000have demonstrated potential capabilities in generating any-to-any content like\u0000text, images, and videos, thus enriching user interactions across various\u0000domains. Integrating these multimodal model technologies into protein research\u0000offers significant promise by potentially transforming how proteins are\u0000studied. To this end, we introduce HelixProtX, a system built upon the large\u0000multimodal model, aiming to offer a comprehensive solution to protein research\u0000by supporting any-to-any protein modality generation. Unlike existing methods,\u0000it allows for the transformation of any input protein modality into any desired\u0000protein modality. The experimental results affirm the advanced capabilities of\u0000HelixProtX, not only in generating functional descriptions from amino acid\u0000sequences but also in executing critical tasks such as designing protein\u0000sequences and structures from textual descriptions. Preliminary findings\u0000indicate that HelixProtX consistently achieves superior accuracy across a range\u0000of protein-related tasks, outperforming existing state-of-the-art models. By\u0000integrating multimodal large models into protein research, HelixProtX opens new\u0000avenues for understanding protein biology, thereby promising to accelerate\u0000scientific discovery.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Token-Mol 1.0: Tokenized drug design with large language model Token-Mol 1.0:使用大型语言模型的标记化药物设计
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-10 DOI: arxiv-2407.07930
Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou
{"title":"Token-Mol 1.0: Tokenized drug design with large language model","authors":"Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou","doi":"arxiv-2407.07930","DOIUrl":"https://doi.org/arxiv-2407.07930","url":null,"abstract":"Significant interests have recently risen in leveraging sequence-based large\u0000language models (LLMs) for drug design. However, most current applications of\u0000LLMs in drug discovery lack the ability to comprehend three-dimensional (3D)\u0000structures, thereby limiting their effectiveness in tasks that explicitly\u0000involve molecular conformations. In this study, we introduced Token-Mol, a\u0000token-only 3D drug design model. This model encodes all molecular information,\u0000including 2D and 3D structures, as well as molecular property data, into\u0000tokens, which transforms classification and regression tasks in drug discovery\u0000into probabilistic prediction problems, thereby enabling learning through a\u0000unified paradigm. Token-Mol is built on the transformer decoder architecture\u0000and trained using random causal masking techniques. Additionally, we proposed\u0000the Gaussian cross-entropy (GCE) loss function to overcome the challenges in\u0000regression tasks, significantly enhancing the capacity of LLMs to learn\u0000continuous numerical values. Through a combination of fine-tuning and\u0000reinforcement learning (RL), Token-Mol achieves performance comparable to or\u0000surpassing existing task-specific methods across various downstream tasks,\u0000including pocket-based molecular generation, conformation generation, and\u0000molecular property prediction. Compared to existing molecular pre-trained\u0000models, Token-Mol exhibits superior proficiency in handling a wider range of\u0000downstream tasks essential for drug design. Notably, our approach improves\u0000regression task accuracy by approximately 30% compared to similar token-only\u0000methods. Token-Mol overcomes the precision limitations of token-only models and\u0000has the potential to integrate seamlessly with general models such as ChatGPT,\u0000paving the way for the development of a universal artificial intelligence drug\u0000design model that facilitates rapid and high-quality drug design by experts.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141613714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPIN: SE(3)-Invariant Physics Informed Network for Binding Affinity Prediction SPIN:用于结合亲和力预测的 SE(3)-Invariant 物理信息网络
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-10 DOI: arxiv-2407.11057
Seungyeon Choi, Sangmin Seo, Sanghyun Park
{"title":"SPIN: SE(3)-Invariant Physics Informed Network for Binding Affinity Prediction","authors":"Seungyeon Choi, Sangmin Seo, Sanghyun Park","doi":"arxiv-2407.11057","DOIUrl":"https://doi.org/arxiv-2407.11057","url":null,"abstract":"Accurate prediction of protein-ligand binding affinity is crucial for rapid\u0000and efficient drug development. Recently, the importance of predicting binding\u0000affinity has led to increased attention on research that models the\u0000three-dimensional structure of protein-ligand complexes using graph neural\u0000networks to predict binding affinity. However, traditional methods often fail\u0000to accurately model the complex's spatial information or rely solely on\u0000geometric features, neglecting the principles of protein-ligand binding. This\u0000can lead to overfitting, resulting in models that perform poorly on independent\u0000datasets and ultimately reducing their usefulness in real drug development. To\u0000address this issue, we propose SPIN, a model designed to achieve superior\u0000generalization by incorporating various inductive biases applicable to this\u0000task, beyond merely training on empirical data from datasets. For prediction,\u0000we defined two types of inductive biases: a geometric perspective that\u0000maintains consistent binding affinity predictions regardless of the complexs\u0000rotations and translations, and a physicochemical perspective that necessitates\u0000minimal binding free energy along their reaction coordinate for effective\u0000protein-ligand binding. These prior knowledge inputs enable the SPIN to\u0000outperform comparative models in benchmark sets such as CASF-2016 and CSAR HiQ.\u0000Furthermore, we demonstrated the practicality of our model through virtual\u0000screening experiments and validated the reliability and potential of our\u0000proposed model based on experiments assessing its interpretability.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction HERMES:用于突变效应和稳定性预测的全息等变神经网络模型
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-09 DOI: arxiv-2407.06703
Gian Marco Visani, Michael N. Pun, William Galvin, Eric Daniel, Kevin Borisiak, Utheri Wagura, Armita Nourmohammad
{"title":"HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction","authors":"Gian Marco Visani, Michael N. Pun, William Galvin, Eric Daniel, Kevin Borisiak, Utheri Wagura, Armita Nourmohammad","doi":"arxiv-2407.06703","DOIUrl":"https://doi.org/arxiv-2407.06703","url":null,"abstract":"Predicting the stability and fitness effects of amino acid mutations in\u0000proteins is a cornerstone of biological discovery and engineering. Various\u0000experimental techniques have been developed to measure mutational effects,\u0000providing us with extensive datasets across a diverse range of proteins. By\u0000training on these data, traditional computational modeling and more recent\u0000machine learning approaches have advanced significantly in predicting\u0000mutational effects. Here, we introduce HERMES, a 3D rotationally equivariant\u0000structure-based neural network model for mutational effect and stability\u0000prediction. Pre-trained to predict amino acid propensity from its surrounding\u00003D structure, HERMES can be fine-tuned for mutational effects using our\u0000open-source code. We present a suite of HERMES models, pre-trained with\u0000different strategies, and fine-tuned to predict the stability effect of\u0000mutations. Benchmarking against other models shows that HERMES often\u0000outperforms or matches their performance in predicting mutational effect on\u0000stability, binding, and fitness. HERMES offers versatile tools for evaluating\u0000mutational effects and can be fine-tuned for specific predictive objectives.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141569523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Differential Effects of Sequence-Local versus Nonlocal Charge Patterns on Phase Separation and Conformational Dimensions of Polyampholytes as Model Intrinsically Disordered Proteins 序列局部电荷模式与非局部电荷模式对作为本征无序蛋白模型的多聚两性离子的相分离和构象尺寸的不同影响
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-09 DOI: arxiv-2407.07226
Tanmoy Pal, Jonas Wessén, Suman Das, Hue Sun Chan
{"title":"Differential Effects of Sequence-Local versus Nonlocal Charge Patterns on Phase Separation and Conformational Dimensions of Polyampholytes as Model Intrinsically Disordered Proteins","authors":"Tanmoy Pal, Jonas Wessén, Suman Das, Hue Sun Chan","doi":"arxiv-2407.07226","DOIUrl":"https://doi.org/arxiv-2407.07226","url":null,"abstract":"Conformational properties of intrinsically disordered proteins (IDPs) are\u0000governed by a sequence-ensemble relationship. To differentiate the impact of\u0000sequence-local versus sequence-nonlocal features of an IDP's charge pattern on\u0000its conformational dimensions and its phase-separation propensity, the charge\u0000\"blockiness'' $kappa$ and the nonlocality-weighted sequence charge decoration\u0000(SCD) parameters are compared for their correlations with isolated-chain radii\u0000of gyration ($R_{rm g}$s) and upper critical solution temperatures (UCSTs) of\u0000polyampholytes modeled by random phase approximation, field-theoretic\u0000simulation, and coarse-grained molecular dynamics. SCD is superior to $kappa$\u0000in predicting $R_{rm g}$ because SCD accounts for effects of contact order,\u0000i.e., nonlocality, on dimensions of isolated chains. In contrast, $kappa$ and\u0000SCD are comparably good, though nonideal, predictors of UCST because\u0000frequencies of interchain contacts in the multiple-chain condensed phase are\u0000less sensitive to sequence positions than frequencies of intrachain contacts of\u0000an isolated chain, as reflected by $kappa$ correlating better with\u0000condensed-phase interaction energy than SCD.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dihedral Angle Adherence: Evaluating Protein Structure Predictions in the Absence of Experimental Data 二面角粘附:在缺乏实验数据的情况下评估蛋白质结构预测
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-09 DOI: arxiv-2407.18336
Musa Azeem, Homayoun Valafar
{"title":"Dihedral Angle Adherence: Evaluating Protein Structure Predictions in the Absence of Experimental Data","authors":"Musa Azeem, Homayoun Valafar","doi":"arxiv-2407.18336","DOIUrl":"https://doi.org/arxiv-2407.18336","url":null,"abstract":"Determining the 3D structures of proteins is essential in understanding their\u0000behavior in the cellular environment. Computational methods of predicting\u0000protein structures have advanced, but assessing prediction accuracy remains a\u0000challenge. The traditional method, RMSD, relies on experimentally determined\u0000structures and lacks insight into improvement areas of predictions. We propose\u0000an alternative: analyzing dihedral angles, bypassing the need for the reference\u0000structure of an evaluated protein. Our method segments proteins into amino acid\u0000subsequences and searches for matches, comparing dihedral angles across\u0000numerous proteins to compute a metric using Mahalanobis distance. Evaluated on\u0000many predictions, our approach correlates with RMSD and identifies areas for\u0000prediction enhancement. This method offers a promising route for accurate\u0000protein structure prediction assessment and improvement.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single-Sequence-Based Protein Secondary Structure Prediction using One-Hot and Chemical Encodings of Amino Acids 利用氨基酸的单热编码和化学编码进行基于单序列的蛋白质二级结构预测
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-06 DOI: arxiv-2407.05173
Hoa Trinh, Satish Kumar Thittamaranahalli
{"title":"Single-Sequence-Based Protein Secondary Structure Prediction using One-Hot and Chemical Encodings of Amino Acids","authors":"Hoa Trinh, Satish Kumar Thittamaranahalli","doi":"arxiv-2407.05173","DOIUrl":"https://doi.org/arxiv-2407.05173","url":null,"abstract":"In protein secondary structure prediction, each amino acid in sequence is\u0000typically treated as a distinct category and represented by a one-hot vector.\u0000In this study, we developed two novel chemical representations for amino acids\u0000utilizing molecular fingerprints and the dimensionality reduction algorithm\u0000FastMap. We demonstrate that the two new chemical encodings can provide\u0000additional information about the interactions of amino acids in sequences that\u0000an LSTM-based model cannot capture with one-hot encoding alone. Compared to the\u0000latest LSTM-based model used in the single-sequence-based method\u0000SPOT-1D-Single, our ensemble model utilizing one-hot and chemical encodings\u0000achieves better accuracy across most test sets while requiring approximately\u0000nine times fewer trainable parameters for each encoding model. Our\u0000single-sequence-based method is valuable for its simplicity, lower resource\u0000requirements, and independence from external sequence data. It is beneficial\u0000when quick or preliminary predictions are needed or when data on homologous\u0000sequences is scarce.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141569238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reinforcement Learning for Sequence Design Leveraging Protein Language Models 利用蛋白质语言模型进行序列设计的强化学习
arXiv - QuanBio - Biomolecules Pub Date : 2024-07-03 DOI: arxiv-2407.03154
Jithendaraa Subramanian, Shivakanth Sujit, Niloy Irtisam, Umong Sain, Derek Nowrouzezahrai, Samira Ebrahimi Kahou, Riashat Islam
{"title":"Reinforcement Learning for Sequence Design Leveraging Protein Language Models","authors":"Jithendaraa Subramanian, Shivakanth Sujit, Niloy Irtisam, Umong Sain, Derek Nowrouzezahrai, Samira Ebrahimi Kahou, Riashat Islam","doi":"arxiv-2407.03154","DOIUrl":"https://doi.org/arxiv-2407.03154","url":null,"abstract":"Protein sequence design, determined by amino acid sequences, are essential to\u0000protein engineering problems in drug discovery. Prior approaches have resorted\u0000to evolutionary strategies or Monte-Carlo methods for protein design, but often\u0000fail to exploit the structure of the combinatorial search space, to generalize\u0000to unseen sequences. In the context of discrete black box optimization over\u0000large search spaces, learning a mutation policy to generate novel sequences\u0000with reinforcement learning is appealing. Recent advances in protein language\u0000models (PLMs) trained on large corpora of protein sequences offer a potential\u0000solution to this problem by scoring proteins according to their biological\u0000plausibility (such as the TM-score). In this work, we propose to use PLMs as a\u0000reward function to generate new sequences. Yet the PLM can be computationally\u0000expensive to query due to its large size. To this end, we propose an\u0000alternative paradigm where optimization can be performed on scores from a\u0000smaller proxy model that is periodically finetuned, jointly while learning the\u0000mutation policy. We perform extensive experiments on various sequence lengths\u0000to benchmark RL-based approaches, and provide comprehensive evaluations along\u0000biological plausibility and diversity of the protein. Our experimental results\u0000include favorable evaluations of the proposed sequences, along with high\u0000diversity scores, demonstrating that RL is a strong candidate for biological\u0000sequence design. Finally, we provide a modular open source implementation can\u0000be easily integrated in most RL training loops, with support for replacing the\u0000reward model with other PLMs, to spur further research in this domain. The code\u0000for all experiments is provided in the supplementary material.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141553091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信