Ulrich A. Mbou Sob, Qiulin Li, Miguel Arbesú, Oliver Bent, Andries P. Smit, Arnu Pretorius
{"title":"Generative Model for Small Molecules with Latent Space RL Fine-Tuning to Protein Targets","authors":"Ulrich A. Mbou Sob, Qiulin Li, Miguel Arbesú, Oliver Bent, Andries P. Smit, Arnu Pretorius","doi":"arxiv-2407.13780","DOIUrl":null,"url":null,"abstract":"A specific challenge with deep learning approaches for molecule generation is\ngenerating both syntactically valid and chemically plausible molecular string\nrepresentations. To address this, we propose a novel generative latent-variable\ntransformer model for small molecules that leverages a recently proposed\nmolecular string representation called SAFE. We introduce a modification to\nSAFE to reduce the number of invalid fragmented molecules generated during\ntraining and use this to train our model. Our experiments show that our model\ncan generate novel molecules with a validity rate > 90% and a fragmentation\nrate < 1% by sampling from a latent space. By fine-tuning the model using\nreinforcement learning to improve molecular docking, we significantly increase\nthe number of hit candidates for five specific protein targets compared to the\npre-trained model, nearly doubling this number for certain targets.\nAdditionally, our top 5% mean docking scores are comparable to the current\nstate-of-the-art (SOTA), and we marginally outperform SOTA on three of the five\ntargets.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13780","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A specific challenge with deep learning approaches for molecule generation is
generating both syntactically valid and chemically plausible molecular string
representations. To address this, we propose a novel generative latent-variable
transformer model for small molecules that leverages a recently proposed
molecular string representation called SAFE. We introduce a modification to
SAFE to reduce the number of invalid fragmented molecules generated during
training and use this to train our model. Our experiments show that our model
can generate novel molecules with a validity rate > 90% and a fragmentation
rate < 1% by sampling from a latent space. By fine-tuning the model using
reinforcement learning to improve molecular docking, we significantly increase
the number of hit candidates for five specific protein targets compared to the
pre-trained model, nearly doubling this number for certain targets.
Additionally, our top 5% mean docking scores are comparable to the current
state-of-the-art (SOTA), and we marginally outperform SOTA on three of the five
targets.