{"title":"Small Molecule Optimization with Large Language Models","authors":"Philipp Guevorguian, Menua Bedrosian, Tigran Fahradyan, Gayane Chilingaryan, Hrant Khachatrian, Armen Aghajanyan","doi":"arxiv-2407.18897","DOIUrl":null,"url":null,"abstract":"Recent advancements in large language models have opened new possibilities\nfor generative molecular drug design. We present Chemlactica and Chemma, two\nlanguage models fine-tuned on a novel corpus of 110M molecules with computed\nproperties, totaling 40B tokens. These models demonstrate strong performance in\ngenerating molecules with specified properties and predicting new molecular\ncharacteristics from limited samples. We introduce a novel optimization\nalgorithm that leverages our language models to optimize molecules for\narbitrary properties given limited access to a black box oracle. Our approach\ncombines ideas from genetic algorithms, rejection sampling, and prompt\noptimization. It achieves state-of-the-art performance on multiple molecular\noptimization benchmarks, including an 8% improvement on Practical Molecular\nOptimization compared to previous methods. We publicly release the training\ncorpus, the language models and the optimization algorithm.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.18897","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advancements in large language models have opened new possibilities
for generative molecular drug design. We present Chemlactica and Chemma, two
language models fine-tuned on a novel corpus of 110M molecules with computed
properties, totaling 40B tokens. These models demonstrate strong performance in
generating molecules with specified properties and predicting new molecular
characteristics from limited samples. We introduce a novel optimization
algorithm that leverages our language models to optimize molecules for
arbitrary properties given limited access to a black box oracle. Our approach
combines ideas from genetic algorithms, rejection sampling, and prompt
optimization. It achieves state-of-the-art performance on multiple molecular
optimization benchmarks, including an 8% improvement on Practical Molecular
Optimization compared to previous methods. We publicly release the training
corpus, the language models and the optimization algorithm.