Llamol:用于从头开始分子设计的动态多条件生成转换器。

IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Niklas Dobberstein, Astrid Maass, Jan Hamaekers
{"title":"Llamol:用于从头开始分子设计的动态多条件生成转换器。","authors":"Niklas Dobberstein,&nbsp;Astrid Maass,&nbsp;Jan Hamaekers","doi":"10.1186/s13321-024-00863-8","DOIUrl":null,"url":null,"abstract":"<p>Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present <i>Llamol</i>, a single novel generative transformer model based on the Llama 2 architecture, which was trained on a 12.5M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce <i>Stochastic Context Learning</i> (SCL) as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model’s capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making <i>Llamol</i> a potent tool for de novo molecule design, easily expandable with new properties.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00863-8","citationCount":"0","resultStr":"{\"title\":\"Llamol: a dynamic multi-conditional generative transformer for de novo molecular design\",\"authors\":\"Niklas Dobberstein,&nbsp;Astrid Maass,&nbsp;Jan Hamaekers\",\"doi\":\"10.1186/s13321-024-00863-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present <i>Llamol</i>, a single novel generative transformer model based on the Llama 2 architecture, which was trained on a 12.5M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce <i>Stochastic Context Learning</i> (SCL) as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model’s capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making <i>Llamol</i> a potent tool for de novo molecule design, easily expandable with new properties.</p>\",\"PeriodicalId\":617,\"journal\":{\"name\":\"Journal of Cheminformatics\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":7.1000,\"publicationDate\":\"2024-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00863-8\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Cheminformatics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://link.springer.com/article/10.1186/s13321-024-00863-8\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00863-8","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

生成模型已在自然语言处理(NLP)领域展现出巨大前景,并已应用于分子设计,如通用预训练变换器(GPT)模型。为了开发这样一种用于探索有机化学空间以寻找潜在电活性化合物的工具,我们提出了 Llamol,这是一种基于 Llama 2 架构的单一新型生成式变换器模型,它是在来自不同公共资源的 1250 万个有机化合物超集上训练而成的。鉴于数据可能不完整,为了最大限度地提高使用灵活性和鲁棒性,我们引入了随机上下文学习(SCL)作为新的训练程序。我们证明,由此产生的模型能够很好地处理单条件和多条件有机分子生成,最多可有四个条件,但也可能有更多条件。该模型以 SMILES 符号生成有效的分子结构,同时根据要求灵活地将三个数字和/或一个标记序列纳入生成过程。在所有测试场景中,生成的化合物都非常令人满意。详细而言,我们展示了该模型利用标记序列进行调节的能力,无论是单独使用还是与数字特性结合使用,都使 Llamol 成为一种有效的全新分子设计工具,可轻松扩展新特性。科学贡献:我们在 Llama 2 架构的基础上开发了一种新颖的生成式转换器模型 Llamol,该模型在 12.5 M 有机化合物的不同集合上进行了训练。该模型引入了随机上下文学习(SCL)作为一种新的训练程序,可以灵活、稳健地生成有效的有机分子,这些分子可以多种条件以多种方式组合,从而使其成为全新分子设计的有力工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Llamol: a dynamic multi-conditional generative transformer for de novo molecular design

Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present Llamol, a single novel generative transformer model based on the Llama 2 architecture, which was trained on a 12.5M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce Stochastic Context Learning (SCL) as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model’s capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making Llamol a potent tool for de novo molecule design, easily expandable with new properties.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信