Cong Fu, Xiner Li, Blake Olson, Heng Ji, Shuiwang Ji
{"title":"利用语言模型为基于结构的药物设计识别分子片段和几何标记","authors":"Cong Fu, Xiner Li, Blake Olson, Heng Ji, Shuiwang Ji","doi":"arxiv-2408.09730","DOIUrl":null,"url":null,"abstract":"Structure-based drug design (SBDD) is crucial for developing specific and\neffective therapeutics against protein targets but remains challenging due to\ncomplex protein-ligand interactions and vast chemical space. Although language\nmodels (LMs) have excelled in natural language processing, their application in\nSBDD is underexplored. To bridge this gap, we introduce a method, known as\nFrag2Seq, to apply LMs to SBDD by generating molecules in a fragment-based\nmanner in which fragments correspond to functional modules. We transform 3D\nmolecules into fragment-informed sequences using SE(3)-equivariant molecule and\nfragment local frames, extracting SE(3)-invariant sequences that preserve\ngeometric information of 3D fragments. Furthermore, we incorporate protein\npocket embeddings obtained from a pre-trained inverse folding model into the\nLMs via cross-attention to capture protein-ligand interaction, enabling\neffective target-aware molecule generation. Benefiting from employing LMs with\nfragment-based generation and effective protein context encoding, our model\nachieves the best performance on binding vina score and chemical properties\nsuch as QED and Lipinski, which shows our model's efficacy in generating\ndrug-like ligands with higher binding affinity against target proteins.\nMoreover, our method also exhibits higher sampling efficiency compared to\natom-based autoregressive and diffusion baselines with at most ~300x speedup.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"114 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models\",\"authors\":\"Cong Fu, Xiner Li, Blake Olson, Heng Ji, Shuiwang Ji\",\"doi\":\"arxiv-2408.09730\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Structure-based drug design (SBDD) is crucial for developing specific and\\neffective therapeutics against protein targets but remains challenging due to\\ncomplex protein-ligand interactions and vast chemical space. Although language\\nmodels (LMs) have excelled in natural language processing, their application in\\nSBDD is underexplored. To bridge this gap, we introduce a method, known as\\nFrag2Seq, to apply LMs to SBDD by generating molecules in a fragment-based\\nmanner in which fragments correspond to functional modules. We transform 3D\\nmolecules into fragment-informed sequences using SE(3)-equivariant molecule and\\nfragment local frames, extracting SE(3)-invariant sequences that preserve\\ngeometric information of 3D fragments. Furthermore, we incorporate protein\\npocket embeddings obtained from a pre-trained inverse folding model into the\\nLMs via cross-attention to capture protein-ligand interaction, enabling\\neffective target-aware molecule generation. Benefiting from employing LMs with\\nfragment-based generation and effective protein context encoding, our model\\nachieves the best performance on binding vina score and chemical properties\\nsuch as QED and Lipinski, which shows our model's efficacy in generating\\ndrug-like ligands with higher binding affinity against target proteins.\\nMoreover, our method also exhibits higher sampling efficiency compared to\\natom-based autoregressive and diffusion baselines with at most ~300x speedup.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"114 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.09730\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09730","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models
Structure-based drug design (SBDD) is crucial for developing specific and
effective therapeutics against protein targets but remains challenging due to
complex protein-ligand interactions and vast chemical space. Although language
models (LMs) have excelled in natural language processing, their application in
SBDD is underexplored. To bridge this gap, we introduce a method, known as
Frag2Seq, to apply LMs to SBDD by generating molecules in a fragment-based
manner in which fragments correspond to functional modules. We transform 3D
molecules into fragment-informed sequences using SE(3)-equivariant molecule and
fragment local frames, extracting SE(3)-invariant sequences that preserve
geometric information of 3D fragments. Furthermore, we incorporate protein
pocket embeddings obtained from a pre-trained inverse folding model into the
LMs via cross-attention to capture protein-ligand interaction, enabling
effective target-aware molecule generation. Benefiting from employing LMs with
fragment-based generation and effective protein context encoding, our model
achieves the best performance on binding vina score and chemical properties
such as QED and Lipinski, which shows our model's efficacy in generating
drug-like ligands with higher binding affinity against target proteins.
Moreover, our method also exhibits higher sampling efficiency compared to
atom-based autoregressive and diffusion baselines with at most ~300x speedup.