利用语言模型为基于结构的药物设计识别分子片段和几何标记

arXiv - QuanBio - Biomolecules Pub Date : 2024-08-19 DOI:arxiv-2408.09730

Cong Fu, Xiner Li, Blake Olson, Heng Ji, Shuiwang Ji

{"title":"利用语言模型为基于结构的药物设计识别分子片段和几何标记","authors":"Cong Fu, Xiner Li, Blake Olson, Heng Ji, Shuiwang Ji","doi":"arxiv-2408.09730","DOIUrl":null,"url":null,"abstract":"Structure-based drug design (SBDD) is crucial for developing specific and\neffective therapeutics against protein targets but remains challenging due to\ncomplex protein-ligand interactions and vast chemical space. Although language\nmodels (LMs) have excelled in natural language processing, their application in\nSBDD is underexplored. To bridge this gap, we introduce a method, known as\nFrag2Seq, to apply LMs to SBDD by generating molecules in a fragment-based\nmanner in which fragments correspond to functional modules. We transform 3D\nmolecules into fragment-informed sequences using SE(3)-equivariant molecule and\nfragment local frames, extracting SE(3)-invariant sequences that preserve\ngeometric information of 3D fragments. Furthermore, we incorporate protein\npocket embeddings obtained from a pre-trained inverse folding model into the\nLMs via cross-attention to capture protein-ligand interaction, enabling\neffective target-aware molecule generation. Benefiting from employing LMs with\nfragment-based generation and effective protein context encoding, our model\nachieves the best performance on binding vina score and chemical properties\nsuch as QED and Lipinski, which shows our model's efficacy in generating\ndrug-like ligands with higher binding affinity against target proteins.\nMoreover, our method also exhibits higher sampling efficiency compared to\natom-based autoregressive and diffusion baselines with at most ~300x speedup.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"114 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models\",\"authors\":\"Cong Fu, Xiner Li, Blake Olson, Heng Ji, Shuiwang Ji\",\"doi\":\"arxiv-2408.09730\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Structure-based drug design (SBDD) is crucial for developing specific and\\neffective therapeutics against protein targets but remains challenging due to\\ncomplex protein-ligand interactions and vast chemical space. Although language\\nmodels (LMs) have excelled in natural language processing, their application in\\nSBDD is underexplored. To bridge this gap, we introduce a method, known as\\nFrag2Seq, to apply LMs to SBDD by generating molecules in a fragment-based\\nmanner in which fragments correspond to functional modules. We transform 3D\\nmolecules into fragment-informed sequences using SE(3)-equivariant molecule and\\nfragment local frames, extracting SE(3)-invariant sequences that preserve\\ngeometric information of 3D fragments. Furthermore, we incorporate protein\\npocket embeddings obtained from a pre-trained inverse folding model into the\\nLMs via cross-attention to capture protein-ligand interaction, enabling\\neffective target-aware molecule generation. Benefiting from employing LMs with\\nfragment-based generation and effective protein context encoding, our model\\nachieves the best performance on binding vina score and chemical properties\\nsuch as QED and Lipinski, which shows our model's efficacy in generating\\ndrug-like ligands with higher binding affinity against target proteins.\\nMoreover, our method also exhibits higher sampling efficiency compared to\\natom-based autoregressive and diffusion baselines with at most ~300x speedup.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"114 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09730\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09730","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基于结构的药物设计（SBDD）对于开发针对蛋白质靶点的特异性有效疗法至关重要，但由于复杂的蛋白质配体相互作用和广阔的化学空间，SBDD 仍然充满挑战。尽管语言模型（LMs）在自然语言处理方面表现出色，但其在 SBDD 中的应用却未得到充分探索。为了弥补这一不足，我们引入了一种称为 Frag2Seq 的方法，通过基于片段生成分子（片段对应功能模块）的方式，将语言模型应用于 SBDD。我们使用 SE(3)-equivariant 分子和片段局部框架将三维分子转化为片段信息序列，提取保留三维片段几何信息的 SE(3)-invariant 序列。此外，我们还通过交叉关注将从预先训练的反折叠模型中获得的蛋白质口袋嵌入（proteinpocket embeddings）纳入 LMs，以捕捉蛋白质与配体之间的相互作用，从而实现有效的目标感知分子生成。得益于采用基于片段生成的 LMs 和有效的蛋白质上下文编码，我们的模型在结合维纳评分和化学性质（如 QED 和 Lipinski）方面取得了最佳性能，这表明我们的模型在生成与目标蛋白质有更高结合亲和力的类药物配体方面非常有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models

Structure-based drug design (SBDD) is crucial for developing specific and effective therapeutics against protein targets but remains challenging due to complex protein-ligand interactions and vast chemical space. Although language models (LMs) have excelled in natural language processing, their application in SBDD is underexplored. To bridge this gap, we introduce a method, known as Frag2Seq, to apply LMs to SBDD by generating molecules in a fragment-based manner in which fragments correspond to functional modules. We transform 3D molecules into fragment-informed sequences using SE(3)-equivariant molecule and fragment local frames, extracting SE(3)-invariant sequences that preserve geometric information of 3D fragments. Furthermore, we incorporate protein pocket embeddings obtained from a pre-trained inverse folding model into the LMs via cross-attention to capture protein-ligand interaction, enabling effective target-aware molecule generation. Benefiting from employing LMs with fragment-based generation and effective protein context encoding, our model achieves the best performance on binding vina score and chemical properties such as QED and Lipinski, which shows our model's efficacy in generating drug-like ligands with higher binding affinity against target proteins. Moreover, our method also exhibits higher sampling efficiency compared to atom-based autoregressive and diffusion baselines with at most ~300x speedup.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量